quizbowl-submission / docs /tossup-agent-guide.md
Maharshi Gor
Added better documentation
0f6850b
|
raw
history blame
4.39 kB

Building an Effective Tossup Agent

Goals

By the end of this guide, you will:

  • Create a tossup agent that answers questions accurately
  • Calibrate confidence thresholds for optimal buzzing
  • Test performance on sample questions
  • Submit your agent for evaluation

Baseline System Performance

Let's import the simple tossup agent pipeline umdclip/simple-tossup-pipeline and examine the configuration:

Default Tossup Configuration

The baseline system achieves:

  • Accuracy: ~30% on sample questions
  • Average Buzz Token Position: 40.40
  • Average Confidence: 0.65

We'll improve this through targeted enhancements.

Enhancement 1: Basic Model Configuration

Current Performance

The default configuration uses gpt-4o-mini with temperature 0.7 and confidence threshold 0.85 for buzzer.

Implementing the Enhancement

  1. Navigate to "Tossup Agents" tab
  2. Select a stronger model (e.g., gpt-4o)
  3. Reduce temperature to 0.1 for more consistent outputs
  4. Test on sample questions

Validation

Run the agent on test questions and check:

  • Has accuracy improved?
  • Are confidence scores more consistent?
  • Is your agent buzzing earlier?

Results

With better model configuration:

  • Accuracy increases to ~80%
  • Avg Buzz Position increased to 59.60

Enhancement 2: System Prompt Optimization

Current Performance

The default prompt lacks specific instructions for:

  • Answer formatting
  • Confidence calibration
  • Domain-specific knowledge

Implementing the Enhancement

  1. Click "System Prompt" tab
  2. Add specific instructions:
You are a professional quizbowl player answering tossup questions.

Your task:
1. Analyze clues in the question text
2. Determine the most likely answer
3. Assess confidence on a scale from 0.0 to 1.0

Important guidelines:
- Give answers in the expected format (person's full name, complete title, etc.)
- Use 0.8+ confidence ONLY when absolutely certain
- For literature, include author's full name
- For science, include complete technical terms

Validation

Test on the same questions and check:

  • Are answers formatted more consistently?
  • Is confidence more accurately reflecting correctness?
  • Check specific categories where you added domain knowledge

Results

With optimized prompts:

  • Accuracy increases to ~75%
  • Confidence scores align better with actual performance
  • Answer formats become more consistent

Enhancement 3: Confidence Calibration

Current Performance

Even with better prompts, confidence thresholds may be:

  • Too high (missing answerable questions)
  • Too low (buzzing incorrectly)

Implementing the Enhancement

  1. Scroll to "Buzzer Settings"
  2. Test different thresholds (0.7-0.9)
  3. Find optimal balance between:
    • Buzzing early enough to score points
    • Waiting for sufficient confidence

Buzzer Settings

Validation

For each threshold:

  1. Run tests on multiple questions
  2. Check percentage of correct buzzes
  3. Monitor average buzz position

Results

With calibrated threshold (e.g., 0.75):

  • Balance between accuracy and early buzzing
  • Fewer incorrect buzzes
  • Earlier correct buzzes

Enhancement 4: Multi-Step Pipeline

Current Performance

Single-step pipelines often struggle with:

  • Accurately separating answer generation from confidence estimation
  • Providing consistent performance across question types

Implementing the Enhancement

  1. Click "+ Add Step" to create a two-step pipeline:
    • Step A: Answer Generator
    • Step B: Confidence Evaluator
  2. Configure each step:
    • Step A focuses only on generating the best answer
    • Step B evaluates confidence based on the answer and question

Let's load a multi-step pipeline umdclip/two-step-justified-confidence that does the same. For more details on the pipeline, see Advanced Pipeline Examples

Validation

Test the multi-step pipeline and compare to single-step:

  • Does separation of concerns improve performance?
  • Are confidence scores more accurate?
  • Is there improvement in early buzz positions?

Final Evaluation and Submission

  1. Run comprehensive testing across categories
  2. Verify metrics match your goals
  3. Export your pipeline configuration
  4. Submit your agent for official evaluation

For complete UI reference, see UI Reference.