Building an Effective Tossup Agent
Goals
By the end of this guide, you will:
- Create a tossup agent that answers questions accurately
- Calibrate confidence thresholds for optimal buzzing
- Test performance on sample questions
- Submit your agent for evaluation
Baseline System Performance
Let's import the simple tossup agent pipeline umdclip/simple-tossup-pipeline
and examine the configuration:
The baseline system achieves:
- Accuracy: ~30% on sample questions
- Average Buzz Token Position: 40.40
- Average Confidence: 0.65
We'll improve this through targeted enhancements.
Enhancement 1: Basic Model Configuration
Current Performance
The default configuration uses gpt-4o-mini
with temperature 0.7
and confidence threshold 0.85
for buzzer.
Implementing the Enhancement
- Navigate to "Tossup Agents" tab
- Select a stronger model (e.g., gpt-4o)
- Reduce temperature to 0.1 for more consistent outputs
- Test on sample questions
Validation
Run the agent on test questions and check:
- Has accuracy improved?
- Are confidence scores more consistent?
- Is your agent buzzing earlier?
Results
With better model configuration:
- Accuracy increases to ~80%
- Avg Buzz Position increased to 59.60
Enhancement 2: System Prompt Optimization
Current Performance
The default prompt lacks specific instructions for:
- Answer formatting
- Confidence calibration
- Domain-specific knowledge
Implementing the Enhancement
- Click "System Prompt" tab
- Add specific instructions:
You are a professional quizbowl player answering tossup questions.
Your task:
1. Analyze clues in the question text
2. Determine the most likely answer
3. Assess confidence on a scale from 0.0 to 1.0
Important guidelines:
- Give answers in the expected format (person's full name, complete title, etc.)
- Use 0.8+ confidence ONLY when absolutely certain
- For literature, include author's full name
- For science, include complete technical terms
Validation
Test on the same questions and check:
- Are answers formatted more consistently?
- Is confidence more accurately reflecting correctness?
- Check specific categories where you added domain knowledge
Results
With optimized prompts:
- Accuracy increases to ~75%
- Confidence scores align better with actual performance
- Answer formats become more consistent
Enhancement 3: Confidence Calibration
Current Performance
Even with better prompts, confidence thresholds may be:
- Too high (missing answerable questions)
- Too low (buzzing incorrectly)
Implementing the Enhancement
- Scroll to "Buzzer Settings"
- Test different thresholds (0.7-0.9)
- Find optimal balance between:
- Buzzing early enough to score points
- Waiting for sufficient confidence
Validation
For each threshold:
- Run tests on multiple questions
- Check percentage of correct buzzes
- Monitor average buzz position
Results
With calibrated threshold (e.g., 0.75):
- Balance between accuracy and early buzzing
- Fewer incorrect buzzes
- Earlier correct buzzes
Enhancement 4: Multi-Step Pipeline
Current Performance
Single-step pipelines often struggle with:
- Accurately separating answer generation from confidence estimation
- Providing consistent performance across question types
Implementing the Enhancement
- Click "+ Add Step" to create a two-step pipeline:
- Step A: Answer Generator
- Step B: Confidence Evaluator
- Configure each step:
- Step A focuses only on generating the best answer
- Step B evaluates confidence based on the answer and question
Let's load a multi-step pipeline umdclip/two-step-justified-confidence
that does the same.
For more details on the pipeline, see Advanced Pipeline Examples
Validation
Test the multi-step pipeline and compare to single-step:
- Does separation of concerns improve performance?
- Are confidence scores more accurate?
- Is there improvement in early buzz positions?
Final Evaluation and Submission
- Run comprehensive testing across categories
- Verify metrics match your goals
- Export your pipeline configuration
- Submit your agent for official evaluation
For complete UI reference, see UI Reference.