Building an Effective Tossup Agent

Goals

By the end of this guide, you will:

Create a tossup agent that answers questions accurately
Calibrate confidence thresholds for optimal buzzing
Test performance on sample questions
Submit your agent for evaluation

Baseline System Performance

Let's import the simple tossup agent pipeline umdclip/simple-tossup-pipeline and examine the configuration:

The baseline system achieves:

Accuracy: ~30% on sample questions
Average Buzz Token Position: 40.40
Average Confidence: 0.65

We'll improve this through targeted enhancements.

Enhancement 1: Basic Model Configuration

Current Performance

The default configuration uses gpt-4o-mini with temperature 0.7 and confidence threshold 0.85 for buzzer.

Implementing the Enhancement

Navigate to "Tossup Agents" tab
Select a stronger model (e.g., gpt-4o)
Reduce temperature to 0.1 for more consistent outputs
Test on sample questions

Validation

Run the agent on test questions and check:

Has accuracy improved?
Are confidence scores more consistent?
Is your agent buzzing earlier?

Results

With better model configuration:

Accuracy increases to ~80%
Avg Buzz Position increased to 59.60

Enhancement 2: System Prompt Optimization

Current Performance

The default prompt lacks specific instructions for:

Answer formatting
Confidence calibration
Domain-specific knowledge

Implementing the Enhancement

Click "System Prompt" tab
Add specific instructions:

You are a professional quizbowl player answering tossup questions.

Your task:
1. Analyze clues in the question text
2. Determine the most likely answer
3. Assess confidence on a scale from 0.0 to 1.0

Important guidelines:
- Give answers in the expected format (person's full name, complete title, etc.)
- Use 0.8+ confidence ONLY when absolutely certain
- For literature, include author's full name
- For science, include complete technical terms

Validation

Test on the same questions and check:

Are answers formatted more consistently?
Is confidence more accurately reflecting correctness?
Check specific categories where you added domain knowledge

Results

With optimized prompts:

Accuracy increases to ~75%
Confidence scores align better with actual performance
Answer formats become more consistent

Enhancement 3: Confidence Calibration

Current Performance

Even with better prompts, confidence thresholds may be:

Too high (missing answerable questions)
Too low (buzzing incorrectly)

Implementing the Enhancement

Scroll to "Buzzer Settings"
Test different thresholds (0.7-0.9)
Find optimal balance between:
- Buzzing early enough to score points
- Waiting for sufficient confidence

Validation

For each threshold:

Run tests on multiple questions
Check percentage of correct buzzes
Monitor average buzz position

Results

With calibrated threshold (e.g., 0.75):

Balance between accuracy and early buzzing
Fewer incorrect buzzes
Earlier correct buzzes

Enhancement 4: Multi-Step Pipeline

Current Performance

Single-step pipelines often struggle with:

Accurately separating answer generation from confidence estimation
Providing consistent performance across question types

Implementing the Enhancement

Click "+ Add Step" to create a two-step pipeline:
- Step A: Answer Generator
- Step B: Confidence Evaluator
Configure each step:
- Step A focuses only on generating the best answer
- Step B evaluates confidence based on the answer and question

Let's load a multi-step pipeline umdclip/two-step-justified-confidence that does the same. For more details on the pipeline, see Advanced Pipeline Examples

Validation

Test the multi-step pipeline and compare to single-step:

Does separation of concerns improve performance?
Are confidence scores more accurate?
Is there improvement in early buzz positions?

Final Evaluation and Submission

Run comprehensive testing across categories
Verify metrics match your goals
Export your pipeline configuration
Submit your agent for official evaluation

For complete UI reference, see UI Reference.