Spaces:

qanta-challenge
/

quizbowl-submission

Running

App Files Files Community

quizbowl-submission / docs /goals-and-evaluation.md

Maharshi Gor

Added better documentation

0f6850b 17 days ago

preview code

raw

history blame

2.08 kB

	# Quizbowl Agent Goals and Evaluation

	## Objectives

	### Tossup Agents
	- Respond to questions with the best guess with calibrated confidence
	- Buzz at the earliest possible moment with sufficient information
	- Avoid incorrect buzzes
	- Maintain consistent performance across topics

	### Bonus Agents
	- Answer parts correctly with accurate confidence estimation
	- Provide clear explanation of reasoning which will be used by human team members to validate / pick the suggested answer.
	- Adapt to varying difficulty levels (easy, medium, hard)

	## Performance Metrics

	### Tossup Metrics
	- Accuracy: Percentage of correct answers
	- Average Buzz Position: How early in the question you buzz (earlier is better)
	- Confidence Calibration: How well confidence score matches actual performance
	- Score: Points earned based on buzz position and correctness

	### Bonus Metrics
	- Accuracy: Percentage of correct answers across all parts
	- Confidence Calibration: How well confidence score matches actual performance
	- Explanation Quality: Relevance and clarity of reasoning

	## Evaluating Your Agent

	### Testing Baseline Performance
	1. Run the default agent configuration
	2. Record metrics (accuracy, confidence, buzz position)
	3. Identify specific weaknesses in performance

	### Validating Improvements
	After each enhancement:
	1. Run the agent on the same development set of questions
	2. Compare metrics to previous version
	3. Check for improvements in weak areas

	### Final Evaluation Criteria
	Your final agent will be evaluated on:
	1. Overall accuracy across diverse questions
	2. Optimal buzz timing (neither too early nor too late)
	3. Confidence threshold calibration
	4. Explanation quality (for bonus agents)

	<!-- ## Setting Goals for Your Agent

	### Minimum Goals
	- Accuracy above 60%
	- Appropriate confidence threshold (0.7-0.9)
	- Reasonable buzz positions

	### Advanced Goals
	- Multi-step pipelines with specialized components
	- Accuracy above 85%
	- Strategic early buzzing on familiar topics
	- Detailed, accurate explanations for bonus questions -->