File size: 2,079 Bytes
0f6850b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# Quizbowl Agent Goals and Evaluation
## Objectives
### Tossup Agents
- Respond to questions with the best guess with calibrated confidence
- Buzz at the earliest possible moment with sufficient information
- Avoid incorrect buzzes
- Maintain consistent performance across topics
### Bonus Agents
- Answer parts correctly with accurate confidence estimation
- Provide clear explanation of reasoning which will be used by human team members to validate / pick the suggested answer.
- Adapt to varying difficulty levels (easy, medium, hard)
## Performance Metrics
### Tossup Metrics
- **Accuracy**: Percentage of correct answers
- **Average Buzz Position**: How early in the question you buzz (earlier is better)
- **Confidence Calibration**: How well confidence score matches actual performance
- **Score**: Points earned based on buzz position and correctness
### Bonus Metrics
- **Accuracy**: Percentage of correct answers across all parts
- **Confidence Calibration**: How well confidence score matches actual performance
- **Explanation Quality**: Relevance and clarity of reasoning
## Evaluating Your Agent
### Testing Baseline Performance
1. Run the default agent configuration
2. Record metrics (accuracy, confidence, buzz position)
3. Identify specific weaknesses in performance
### Validating Improvements
After each enhancement:
1. Run the agent on the same development set of questions
2. Compare metrics to previous version
3. Check for improvements in weak areas
### Final Evaluation Criteria
Your final agent will be evaluated on:
1. Overall accuracy across diverse questions
2. Optimal buzz timing (neither too early nor too late)
3. Confidence threshold calibration
4. Explanation quality (for bonus agents)
<!-- ## Setting Goals for Your Agent
### Minimum Goals
- Accuracy above 60%
- Appropriate confidence threshold (0.7-0.9)
- Reasonable buzz positions
### Advanced Goals
- Multi-step pipelines with specialized components
- Accuracy above 85%
- Strategic early buzzing on familiar topics
- Detailed, accurate explanations for bonus questions --> |