quizbowl-submission / docs /goals-and-evaluation.md
Maharshi Gor
Added better documentation
0f6850b
|
raw
history blame
2.08 kB
# Quizbowl Agent Goals and Evaluation
## Objectives
### Tossup Agents
- Respond to questions with the best guess with calibrated confidence
- Buzz at the earliest possible moment with sufficient information
- Avoid incorrect buzzes
- Maintain consistent performance across topics
### Bonus Agents
- Answer parts correctly with accurate confidence estimation
- Provide clear explanation of reasoning which will be used by human team members to validate / pick the suggested answer.
- Adapt to varying difficulty levels (easy, medium, hard)
## Performance Metrics
### Tossup Metrics
- **Accuracy**: Percentage of correct answers
- **Average Buzz Position**: How early in the question you buzz (earlier is better)
- **Confidence Calibration**: How well confidence score matches actual performance
- **Score**: Points earned based on buzz position and correctness
### Bonus Metrics
- **Accuracy**: Percentage of correct answers across all parts
- **Confidence Calibration**: How well confidence score matches actual performance
- **Explanation Quality**: Relevance and clarity of reasoning
## Evaluating Your Agent
### Testing Baseline Performance
1. Run the default agent configuration
2. Record metrics (accuracy, confidence, buzz position)
3. Identify specific weaknesses in performance
### Validating Improvements
After each enhancement:
1. Run the agent on the same development set of questions
2. Compare metrics to previous version
3. Check for improvements in weak areas
### Final Evaluation Criteria
Your final agent will be evaluated on:
1. Overall accuracy across diverse questions
2. Optimal buzz timing (neither too early nor too late)
3. Confidence threshold calibration
4. Explanation quality (for bonus agents)
<!-- ## Setting Goals for Your Agent
### Minimum Goals
- Accuracy above 60%
- Appropriate confidence threshold (0.7-0.9)
- Reasonable buzz positions
### Advanced Goals
- Multi-step pipelines with specialized components
- Accuracy above 85%
- Strategic early buzzing on familiar topics
- Detailed, accurate explanations for bonus questions -->