|
# Building an Effective Tossup Agent |
|
|
|
## Goals |
|
By the end of this guide, you will: |
|
- Create a tossup agent that answers questions accurately |
|
- Calibrate confidence thresholds for optimal buzzing |
|
- Test performance on sample questions |
|
- Submit your agent for evaluation |
|
|
|
## Baseline System Performance |
|
|
|
Let's import the simple tossup agent pipeline `umdclip/simple-tossup-pipeline` and examine the configuration: |
|
|
|
 |
|
|
|
The baseline system achieves: |
|
- Accuracy: ~30% on sample questions |
|
- Average Buzz Token Position: 40.40 |
|
- Average Confidence: 0.65 |
|
|
|
We'll improve this through targeted enhancements. |
|
|
|
## Enhancement 1: Basic Model Configuration |
|
|
|
### Current Performance |
|
The default configuration uses `gpt-4o-mini` with temperature `0.7` and confidence threshold `0.85` for buzzer. |
|
|
|
### Implementing the Enhancement |
|
1. Navigate to "Tossup Agents" tab |
|
2. Select a stronger model (e.g., gpt-4o) |
|
3. Reduce temperature to 0.1 for more consistent outputs |
|
4. Test on sample questions |
|
|
|
### Validation |
|
Run the agent on test questions and check: |
|
- Has accuracy improved? |
|
- Are confidence scores more consistent? |
|
- Is your agent buzzing earlier? |
|
|
|
### Results |
|
With better model configuration: |
|
- Accuracy increases to ~80% |
|
- Avg Buzz Position increased to 59.60 |
|
|
|
## Enhancement 2: System Prompt Optimization |
|
|
|
### Current Performance |
|
The default prompt lacks specific instructions for: |
|
- Answer formatting |
|
- Confidence calibration |
|
- Domain-specific knowledge |
|
|
|
### Implementing the Enhancement |
|
1. Click "System Prompt" tab |
|
2. Add specific instructions: |
|
|
|
``` |
|
You are a professional quizbowl player answering tossup questions. |
|
|
|
Your task: |
|
1. Analyze clues in the question text |
|
2. Determine the most likely answer |
|
3. Assess confidence on a scale from 0.0 to 1.0 |
|
|
|
Important guidelines: |
|
- Give answers in the expected format (person's full name, complete title, etc.) |
|
- Use 0.8+ confidence ONLY when absolutely certain |
|
- For literature, include author's full name |
|
- For science, include complete technical terms |
|
``` |
|
|
|
### Validation |
|
Test on the same questions and check: |
|
- Are answers formatted more consistently? |
|
- Is confidence more accurately reflecting correctness? |
|
- Check specific categories where you added domain knowledge |
|
|
|
### Results |
|
With optimized prompts: |
|
- Accuracy increases to ~75% |
|
- Confidence scores align better with actual performance |
|
- Answer formats become more consistent |
|
|
|
## Enhancement 3: Confidence Calibration |
|
|
|
### Current Performance |
|
Even with better prompts, confidence thresholds may be: |
|
- Too high (missing answerable questions) |
|
- Too low (buzzing incorrectly) |
|
|
|
### Implementing the Enhancement |
|
1. Scroll to "Buzzer Settings" |
|
2. Test different thresholds (0.7-0.9) |
|
3. Find optimal balance between: |
|
- Buzzing early enough to score points |
|
- Waiting for sufficient confidence |
|
|
|
 |
|
|
|
### Validation |
|
For each threshold: |
|
1. Run tests on multiple questions |
|
2. Check percentage of correct buzzes |
|
3. Monitor average buzz position |
|
|
|
### Results |
|
With calibrated threshold (e.g., 0.75): |
|
- Balance between accuracy and early buzzing |
|
- Fewer incorrect buzzes |
|
- Earlier correct buzzes |
|
|
|
## Enhancement 4: Multi-Step Pipeline |
|
|
|
### Current Performance |
|
Single-step pipelines often struggle with: |
|
- Accurately separating answer generation from confidence estimation |
|
- Providing consistent performance across question types |
|
|
|
### Implementing the Enhancement |
|
1. Click "+ Add Step" to create a two-step pipeline: |
|
- Step A: Answer Generator |
|
- Step B: Confidence Evaluator |
|
2. Configure each step: |
|
- Step A focuses only on generating the best answer |
|
- Step B evaluates confidence based on the answer and question |
|
|
|
Let's load a multi-step pipeline `umdclip/two-step-justified-confidence` that does the same. |
|
For more details on the pipeline, see [Advanced Pipeline Examples](./advanced-pipeline-examples.md) |
|
|
|
### Validation |
|
Test the multi-step pipeline and compare to single-step: |
|
- Does separation of concerns improve performance? |
|
- Are confidence scores more accurate? |
|
- Is there improvement in early buzz positions? |
|
|
|
## Final Evaluation and Submission |
|
|
|
1. Run comprehensive testing across categories |
|
2. Verify metrics match your goals |
|
3. Export your pipeline configuration |
|
4. Submit your agent for official evaluation |
|
|
|
For complete UI reference, see [UI Reference](./ui-reference.md). |