Spaces:

qanta-challenge
/

quizbowl-submission

Running

File size: 4,392 Bytes

0f6850b

# Building an Effective Tossup Agent

## Goals
By the end of this guide, you will:
- Create a tossup agent that answers questions accurately
- Calibrate confidence thresholds for optimal buzzing
- Test performance on sample questions
- Submit your agent for evaluation

## Baseline System Performance

Let's import the simple tossup agent pipeline `umdclip/simple-tossup-pipeline` and examine the configuration:

![Default Tossup Configuration](./imgs/tossup-agent-pipeline.png)

The baseline system achieves:
- Accuracy: ~30% on sample questions
- Average Buzz Token Position: 40.40
- Average Confidence: 0.65

We'll improve this through targeted enhancements.

## Enhancement 1: Basic Model Configuration

### Current Performance
The default configuration uses `gpt-4o-mini` with temperature `0.7` and confidence threshold `0.85` for buzzer.

### Implementing the Enhancement
1. Navigate to "Tossup Agents" tab
2. Select a stronger model (e.g., gpt-4o)
3. Reduce temperature to 0.1 for more consistent outputs
4. Test on sample questions

### Validation
Run the agent on test questions and check:
- Has accuracy improved?
- Are confidence scores more consistent?
- Is your agent buzzing earlier?

### Results
With better model configuration:
- Accuracy increases to ~80%
- Avg Buzz Position increased to 59.60

## Enhancement 2: System Prompt Optimization

### Current Performance
The default prompt lacks specific instructions for:
- Answer formatting
- Confidence calibration
- Domain-specific knowledge

### Implementing the Enhancement
1. Click "System Prompt" tab
2. Add specific instructions:

```
You are a professional quizbowl player answering tossup questions.

Your task:
1. Analyze clues in the question text
2. Determine the most likely answer
3. Assess confidence on a scale from 0.0 to 1.0

Important guidelines:
- Give answers in the expected format (person's full name, complete title, etc.)
- Use 0.8+ confidence ONLY when absolutely certain
- For literature, include author's full name
- For science, include complete technical terms
```

### Validation
Test on the same questions and check:
- Are answers formatted more consistently?
- Is confidence more accurately reflecting correctness?
- Check specific categories where you added domain knowledge

### Results
With optimized prompts:
- Accuracy increases to ~75%
- Confidence scores align better with actual performance
- Answer formats become more consistent

## Enhancement 3: Confidence Calibration

### Current Performance
Even with better prompts, confidence thresholds may be:
- Too high (missing answerable questions)
- Too low (buzzing incorrectly)

### Implementing the Enhancement
1. Scroll to "Buzzer Settings"
2. Test different thresholds (0.7-0.9)
3. Find optimal balance between:
   - Buzzing early enough to score points
   - Waiting for sufficient confidence

![Buzzer Settings](./imgs/buzzer-settings.png)

### Validation
For each threshold:
1. Run tests on multiple questions
2. Check percentage of correct buzzes
3. Monitor average buzz position

### Results
With calibrated threshold (e.g., 0.75):
- Balance between accuracy and early buzzing
- Fewer incorrect buzzes
- Earlier correct buzzes

## Enhancement 4: Multi-Step Pipeline

### Current Performance
Single-step pipelines often struggle with:
- Accurately separating answer generation from confidence estimation
- Providing consistent performance across question types

### Implementing the Enhancement
1. Click "+ Add Step" to create a two-step pipeline:
   - Step A: Answer Generator
   - Step B: Confidence Evaluator
2. Configure each step:
   - Step A focuses only on generating the best answer
   - Step B evaluates confidence based on the answer and question

Let's load a multi-step pipeline `umdclip/two-step-justified-confidence` that does the same.
For more details on the pipeline, see [Advanced Pipeline Examples](./advanced-pipeline-examples.md)

### Validation
Test the multi-step pipeline and compare to single-step:
- Does separation of concerns improve performance?
- Are confidence scores more accurate?
- Is there improvement in early buzz positions?

## Final Evaluation and Submission

1. Run comprehensive testing across categories
2. Verify metrics match your goals
3. Export your pipeline configuration
4. Submit your agent for official evaluation

For complete UI reference, see [UI Reference](./ui-reference.md).