Maharshi Gor
commited on
Commit
·
0f6850b
1
Parent(s):
3a1af80
Added better documentation
Browse files- docs/README.md +22 -0
- docs/advanced-pipeline-examples.md +182 -73
- docs/goals-and-evaluation.md +60 -0
- docs/imgs/bonus-output-panel.png +3 -0
- docs/imgs/import-pipeline.png +3 -0
- docs/imgs/pipeline-preview.png +3 -0
- docs/imgs/tossup-agent-pipeline.png +3 -0
- docs/tossup-agent-guide.md +143 -0
- docs/ui-reference.md +147 -0
docs/README.md
ADDED
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Quizbowl Agent Documentation
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
This documentation helps you build effective quizbowl agents for tossup and bonus questions.
|
5 |
+
|
6 |
+
## Documentation Files
|
7 |
+
|
8 |
+
- **[Goals and Evaluation](./goals-and-evaluation)**: Objectives and metrics for your agents across different competition modes.
|
9 |
+
- **[Tossup Agent Guide](./tossup-agent-guide)**: Step-by-step guide to build tossup agents
|
10 |
+
- **[Bonus Agent Guide](./bonus-agent-guide)**: Step-by-step guide to build bonus agents
|
11 |
+
- **[Advanced Pipeline Examples](./advanced-pipeline-examples)**: Complex pipeline configurations with examples
|
12 |
+
- **[UI Reference](./ui-reference)**: Complete reference for the web interface
|
13 |
+
- **[Best Practices](./best-practices)**: Tips and strategies for optimal performance
|
14 |
+
|
15 |
+
## Getting Started
|
16 |
+
|
17 |
+
1. Read [Goals and Evaluation](./goals-and-evaluation) to understand success metrics
|
18 |
+
2. Follow [Tossup Agent Guide](./tossup-agent-guide) to build your first agent
|
19 |
+
3. Test and iterate using the validation steps
|
20 |
+
4. Submit your agent for evaluation
|
21 |
+
|
22 |
+
For questions, check the GitHub repository issues or contact the competition organizers.
|
docs/advanced-pipeline-examples.md
CHANGED
@@ -1,73 +1,182 @@
|
|
1 |
-
#
|
2 |
-
|
3 |
-
This guide
|
4 |
-
|
5 |
-
##
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
2.
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
1.
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Advanced Pipeline Examples
|
2 |
+
|
3 |
+
This guide shows how to implement and validate sophisticated multi-step pipelines for quizbowl agents.
|
4 |
+
|
5 |
+
## Goals
|
6 |
+
Using advanced pipelines, you will:
|
7 |
+
- Improve accuracy by 15-25% over single-step agents
|
8 |
+
- Create specialized components for different tasks
|
9 |
+
- Implement effective confidence calibration
|
10 |
+
- Build robust buzzer strategies
|
11 |
+
|
12 |
+
## Two-Step Justified Confidence Pipeline
|
13 |
+
|
14 |
+
### Baseline Performance
|
15 |
+
Standard single-step agents typically achieve:
|
16 |
+
- Accuracy: ~65-70%
|
17 |
+
- Poorly calibrated confidence
|
18 |
+
- Limited explanation for answers
|
19 |
+
|
20 |
+
### Loading the Pipeline Example
|
21 |
+
|
22 |
+
1. Navigate to the "Tossup Agents" tab
|
23 |
+
2. Click "Select Pipeline to Import..." and choose "two-step-justified-confidence.yaml"
|
24 |
+
3. Click "Import Pipeline"
|
25 |
+
|
26 |
+
### Understanding the Pipeline Structure
|
27 |
+
|
28 |
+
This pipeline has two distinct steps:
|
29 |
+
|
30 |
+
#### Step A: Answer Generator
|
31 |
+
- Uses OpenAI/gpt-4o-mini
|
32 |
+
- Takes question text as input
|
33 |
+
- Generates an answer candidate
|
34 |
+
- Focuses solely on accurate answer generation
|
35 |
+
|
36 |
+
#### Step B: Confidence Evaluator
|
37 |
+
- Uses Cohere/command-r-plus
|
38 |
+
- Takes question text AND generated answer from Step A
|
39 |
+
- Evaluates confidence and provides justification
|
40 |
+
- Specialized for confidence assessment
|
41 |
+
|
42 |
+
### Validation
|
43 |
+
Test the pipeline and check:
|
44 |
+
- Is accuracy improved compared to single-step?
|
45 |
+
- Are confidence scores better calibrated?
|
46 |
+
- Does the justification explain reasoning clearly?
|
47 |
+
|
48 |
+
### Results
|
49 |
+
Two-step justified confidence typically achieves:
|
50 |
+
- Accuracy: ~80-85%
|
51 |
+
- Well-calibrated confidence scores
|
52 |
+
- Clear justification for answers and confidence
|
53 |
+
- More strategic buzzing
|
54 |
+
|
55 |
+
## Enhancing the Two-Step Pipeline
|
56 |
+
|
57 |
+
### Step 1: Upgrade Answer Generator
|
58 |
+
|
59 |
+
#### Current Performance
|
60 |
+
The default example uses gpt-4o-mini which may lack:
|
61 |
+
- Specialized knowledge in some areas
|
62 |
+
- Consistent answer formatting
|
63 |
+
|
64 |
+
#### Implementation
|
65 |
+
1. Click on Step A
|
66 |
+
2. Change model to a stronger option (e.g., gpt-4o)
|
67 |
+
3. Modify system prompt to focus on answer precision
|
68 |
+
|
69 |
+
#### Validation
|
70 |
+
Test with sample questions and check:
|
71 |
+
- Has answer accuracy improved?
|
72 |
+
- Is formatting more consistent?
|
73 |
+
|
74 |
+
#### Results
|
75 |
+
With upgraded answer generator:
|
76 |
+
- Accuracy increases to ~85-90%
|
77 |
+
- More consistent answer formats
|
78 |
+
|
79 |
+
### Step 2: Improve Confidence Evaluator
|
80 |
+
|
81 |
+
#### Current Performance
|
82 |
+
The default evaluator may:
|
83 |
+
- Over-estimate confidence on some topics
|
84 |
+
- Provide limited justification
|
85 |
+
|
86 |
+
#### Implementation
|
87 |
+
1. Click on Step B
|
88 |
+
2. Enhance the system prompt:
|
89 |
+
```
|
90 |
+
You are an expert confidence evaluator for quizbowl answers.
|
91 |
+
|
92 |
+
Your task:
|
93 |
+
1. Evaluate ONLY the correctness of the provided answer
|
94 |
+
2. Consider question completeness and available clues
|
95 |
+
3. Provide specific justification for your confidence score
|
96 |
+
4. Be especially critical of answers with limited supporting evidence
|
97 |
+
|
98 |
+
Remember:
|
99 |
+
- Early, difficult clues justify lower confidence
|
100 |
+
- Later, obvious clues justify higher confidence
|
101 |
+
- Domain expertise should be reflected in your assessment
|
102 |
+
```
|
103 |
+
|
104 |
+
#### Validation
|
105 |
+
Test and verify:
|
106 |
+
- Are confidence scores better aligned with correctness?
|
107 |
+
- Does justification include specific clues from questions?
|
108 |
+
- Is confidence calibrated appropriately for question position?
|
109 |
+
|
110 |
+
#### Results
|
111 |
+
With improved evaluator:
|
112 |
+
- More accurate confidence calibration
|
113 |
+
- Detailed justifications citing specific clues
|
114 |
+
- Better buzzing decisions
|
115 |
+
|
116 |
+
## Three-Step Pipeline with Analysis
|
117 |
+
|
118 |
+
### Concept
|
119 |
+
Adding a dedicated analysis step before answer generation:
|
120 |
+
|
121 |
+
1. **Step A: Question Analyzer**
|
122 |
+
- Identifies key clues, entities, and relationships
|
123 |
+
- Determines question category and format
|
124 |
+
|
125 |
+
2. **Step B: Answer Generator**
|
126 |
+
- Uses analysis to generate accurate answers
|
127 |
+
- Focuses on formatting and precision
|
128 |
+
|
129 |
+
3. **Step C: Confidence Evaluator**
|
130 |
+
- Assesses answer quality based on analysis and clues
|
131 |
+
- Determines optimal buzz timing
|
132 |
+
|
133 |
+
### Implementation
|
134 |
+
Create this pipeline from scratch or modify the two-step example.
|
135 |
+
|
136 |
+
### Validation
|
137 |
+
Compare to the two-step pipeline:
|
138 |
+
- Does the analysis step improve answer accuracy?
|
139 |
+
- Does it provide better performance on difficult questions?
|
140 |
+
- Are there improvements in early buzzing?
|
141 |
+
|
142 |
+
### Results
|
143 |
+
Three-step pipelines typically achieve:
|
144 |
+
- Accuracy: ~90-95%
|
145 |
+
- Earlier correct buzzes
|
146 |
+
- Exceptional performance on difficult questions
|
147 |
+
|
148 |
+
## Specialty Pipeline: Literature Focus
|
149 |
+
|
150 |
+
### Concept
|
151 |
+
Create a pipeline specialized for literature questions:
|
152 |
+
|
153 |
+
1. **Step A: Literary Analyzer**
|
154 |
+
- Identifies literary techniques, periods, and styles
|
155 |
+
- Recognizes author-specific clues
|
156 |
+
|
157 |
+
2. **Step B: Answer Generator**
|
158 |
+
- Specialized for literary works and authors
|
159 |
+
- Formats answers according to literary conventions
|
160 |
+
|
161 |
+
3. **Step C: Confidence Evaluator**
|
162 |
+
- Calibrated specifically for literature questions
|
163 |
+
|
164 |
+
### Implementation
|
165 |
+
Create specialized system prompts for each step focusing on literary knowledge.
|
166 |
+
|
167 |
+
### Validation
|
168 |
+
Test specifically on literature questions and compare to general pipeline.
|
169 |
+
|
170 |
+
### Results
|
171 |
+
Specialty pipelines can achieve:
|
172 |
+
- 95%+ accuracy in their specialized domain
|
173 |
+
- Earlier buzzing on category-specific questions
|
174 |
+
- Better performance on difficult clues
|
175 |
+
|
176 |
+
## Best Practices for Advanced Pipelines
|
177 |
+
|
178 |
+
1. **Focused Components**: Each step should have a clear, single responsibility
|
179 |
+
2. **Efficient Communication**: Pass only necessary information between steps
|
180 |
+
3. **Strong Fundamentals**: Start with a solid two-step pipeline before adding complexity
|
181 |
+
4. **Consistent Testing**: Validate each change against the same test set
|
182 |
+
5. **Strategic Model Selection**: Use different models for tasks where they excel
|
docs/goals-and-evaluation.md
ADDED
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Quizbowl Agent Goals and Evaluation
|
2 |
+
|
3 |
+
## Objectives
|
4 |
+
|
5 |
+
### Tossup Agents
|
6 |
+
- Respond to questions with the best guess with calibrated confidence
|
7 |
+
- Buzz at the earliest possible moment with sufficient information
|
8 |
+
- Avoid incorrect buzzes
|
9 |
+
- Maintain consistent performance across topics
|
10 |
+
|
11 |
+
### Bonus Agents
|
12 |
+
- Answer parts correctly with accurate confidence estimation
|
13 |
+
- Provide clear explanation of reasoning which will be used by human team members to validate / pick the suggested answer.
|
14 |
+
- Adapt to varying difficulty levels (easy, medium, hard)
|
15 |
+
|
16 |
+
## Performance Metrics
|
17 |
+
|
18 |
+
### Tossup Metrics
|
19 |
+
- **Accuracy**: Percentage of correct answers
|
20 |
+
- **Average Buzz Position**: How early in the question you buzz (earlier is better)
|
21 |
+
- **Confidence Calibration**: How well confidence score matches actual performance
|
22 |
+
- **Score**: Points earned based on buzz position and correctness
|
23 |
+
|
24 |
+
### Bonus Metrics
|
25 |
+
- **Accuracy**: Percentage of correct answers across all parts
|
26 |
+
- **Confidence Calibration**: How well confidence score matches actual performance
|
27 |
+
- **Explanation Quality**: Relevance and clarity of reasoning
|
28 |
+
|
29 |
+
## Evaluating Your Agent
|
30 |
+
|
31 |
+
### Testing Baseline Performance
|
32 |
+
1. Run the default agent configuration
|
33 |
+
2. Record metrics (accuracy, confidence, buzz position)
|
34 |
+
3. Identify specific weaknesses in performance
|
35 |
+
|
36 |
+
### Validating Improvements
|
37 |
+
After each enhancement:
|
38 |
+
1. Run the agent on the same development set of questions
|
39 |
+
2. Compare metrics to previous version
|
40 |
+
3. Check for improvements in weak areas
|
41 |
+
|
42 |
+
### Final Evaluation Criteria
|
43 |
+
Your final agent will be evaluated on:
|
44 |
+
1. Overall accuracy across diverse questions
|
45 |
+
2. Optimal buzz timing (neither too early nor too late)
|
46 |
+
3. Confidence threshold calibration
|
47 |
+
4. Explanation quality (for bonus agents)
|
48 |
+
|
49 |
+
<!-- ## Setting Goals for Your Agent
|
50 |
+
|
51 |
+
### Minimum Goals
|
52 |
+
- Accuracy above 60%
|
53 |
+
- Appropriate confidence threshold (0.7-0.9)
|
54 |
+
- Reasonable buzz positions
|
55 |
+
|
56 |
+
### Advanced Goals
|
57 |
+
- Multi-step pipelines with specialized components
|
58 |
+
- Accuracy above 85%
|
59 |
+
- Strategic early buzzing on familiar topics
|
60 |
+
- Detailed, accurate explanations for bonus questions -->
|
docs/imgs/bonus-output-panel.png
ADDED
![]() |
Git LFS Details
|
docs/imgs/import-pipeline.png
ADDED
![]() |
Git LFS Details
|
docs/imgs/pipeline-preview.png
ADDED
![]() |
Git LFS Details
|
docs/imgs/tossup-agent-pipeline.png
ADDED
![]() |
Git LFS Details
|
docs/tossup-agent-guide.md
ADDED
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Building an Effective Tossup Agent
|
2 |
+
|
3 |
+
## Goals
|
4 |
+
By the end of this guide, you will:
|
5 |
+
- Create a tossup agent that answers questions accurately
|
6 |
+
- Calibrate confidence thresholds for optimal buzzing
|
7 |
+
- Test performance on sample questions
|
8 |
+
- Submit your agent for evaluation
|
9 |
+
|
10 |
+
## Baseline System Performance
|
11 |
+
|
12 |
+
Let's import the simple tossup agent pipeline `umdclip/simple-tossup-pipeline` and examine the configuration:
|
13 |
+
|
14 |
+

|
15 |
+
|
16 |
+
The baseline system achieves:
|
17 |
+
- Accuracy: ~30% on sample questions
|
18 |
+
- Average Buzz Token Position: 40.40
|
19 |
+
- Average Confidence: 0.65
|
20 |
+
|
21 |
+
We'll improve this through targeted enhancements.
|
22 |
+
|
23 |
+
## Enhancement 1: Basic Model Configuration
|
24 |
+
|
25 |
+
### Current Performance
|
26 |
+
The default configuration uses `gpt-4o-mini` with temperature `0.7` and confidence threshold `0.85` for buzzer.
|
27 |
+
|
28 |
+
### Implementing the Enhancement
|
29 |
+
1. Navigate to "Tossup Agents" tab
|
30 |
+
2. Select a stronger model (e.g., gpt-4o)
|
31 |
+
3. Reduce temperature to 0.1 for more consistent outputs
|
32 |
+
4. Test on sample questions
|
33 |
+
|
34 |
+
### Validation
|
35 |
+
Run the agent on test questions and check:
|
36 |
+
- Has accuracy improved?
|
37 |
+
- Are confidence scores more consistent?
|
38 |
+
- Is your agent buzzing earlier?
|
39 |
+
|
40 |
+
### Results
|
41 |
+
With better model configuration:
|
42 |
+
- Accuracy increases to ~80%
|
43 |
+
- Avg Buzz Position increased to 59.60
|
44 |
+
|
45 |
+
## Enhancement 2: System Prompt Optimization
|
46 |
+
|
47 |
+
### Current Performance
|
48 |
+
The default prompt lacks specific instructions for:
|
49 |
+
- Answer formatting
|
50 |
+
- Confidence calibration
|
51 |
+
- Domain-specific knowledge
|
52 |
+
|
53 |
+
### Implementing the Enhancement
|
54 |
+
1. Click "System Prompt" tab
|
55 |
+
2. Add specific instructions:
|
56 |
+
|
57 |
+
```
|
58 |
+
You are a professional quizbowl player answering tossup questions.
|
59 |
+
|
60 |
+
Your task:
|
61 |
+
1. Analyze clues in the question text
|
62 |
+
2. Determine the most likely answer
|
63 |
+
3. Assess confidence on a scale from 0.0 to 1.0
|
64 |
+
|
65 |
+
Important guidelines:
|
66 |
+
- Give answers in the expected format (person's full name, complete title, etc.)
|
67 |
+
- Use 0.8+ confidence ONLY when absolutely certain
|
68 |
+
- For literature, include author's full name
|
69 |
+
- For science, include complete technical terms
|
70 |
+
```
|
71 |
+
|
72 |
+
### Validation
|
73 |
+
Test on the same questions and check:
|
74 |
+
- Are answers formatted more consistently?
|
75 |
+
- Is confidence more accurately reflecting correctness?
|
76 |
+
- Check specific categories where you added domain knowledge
|
77 |
+
|
78 |
+
### Results
|
79 |
+
With optimized prompts:
|
80 |
+
- Accuracy increases to ~75%
|
81 |
+
- Confidence scores align better with actual performance
|
82 |
+
- Answer formats become more consistent
|
83 |
+
|
84 |
+
## Enhancement 3: Confidence Calibration
|
85 |
+
|
86 |
+
### Current Performance
|
87 |
+
Even with better prompts, confidence thresholds may be:
|
88 |
+
- Too high (missing answerable questions)
|
89 |
+
- Too low (buzzing incorrectly)
|
90 |
+
|
91 |
+
### Implementing the Enhancement
|
92 |
+
1. Scroll to "Buzzer Settings"
|
93 |
+
2. Test different thresholds (0.7-0.9)
|
94 |
+
3. Find optimal balance between:
|
95 |
+
- Buzzing early enough to score points
|
96 |
+
- Waiting for sufficient confidence
|
97 |
+
|
98 |
+

|
99 |
+
|
100 |
+
### Validation
|
101 |
+
For each threshold:
|
102 |
+
1. Run tests on multiple questions
|
103 |
+
2. Check percentage of correct buzzes
|
104 |
+
3. Monitor average buzz position
|
105 |
+
|
106 |
+
### Results
|
107 |
+
With calibrated threshold (e.g., 0.75):
|
108 |
+
- Balance between accuracy and early buzzing
|
109 |
+
- Fewer incorrect buzzes
|
110 |
+
- Earlier correct buzzes
|
111 |
+
|
112 |
+
## Enhancement 4: Multi-Step Pipeline
|
113 |
+
|
114 |
+
### Current Performance
|
115 |
+
Single-step pipelines often struggle with:
|
116 |
+
- Accurately separating answer generation from confidence estimation
|
117 |
+
- Providing consistent performance across question types
|
118 |
+
|
119 |
+
### Implementing the Enhancement
|
120 |
+
1. Click "+ Add Step" to create a two-step pipeline:
|
121 |
+
- Step A: Answer Generator
|
122 |
+
- Step B: Confidence Evaluator
|
123 |
+
2. Configure each step:
|
124 |
+
- Step A focuses only on generating the best answer
|
125 |
+
- Step B evaluates confidence based on the answer and question
|
126 |
+
|
127 |
+
Let's load a multi-step pipeline `umdclip/two-step-justified-confidence` that does the same.
|
128 |
+
For more details on the pipeline, see [Advanced Pipeline Examples](./advanced-pipeline-examples.md)
|
129 |
+
|
130 |
+
### Validation
|
131 |
+
Test the multi-step pipeline and compare to single-step:
|
132 |
+
- Does separation of concerns improve performance?
|
133 |
+
- Are confidence scores more accurate?
|
134 |
+
- Is there improvement in early buzz positions?
|
135 |
+
|
136 |
+
## Final Evaluation and Submission
|
137 |
+
|
138 |
+
1. Run comprehensive testing across categories
|
139 |
+
2. Verify metrics match your goals
|
140 |
+
3. Export your pipeline configuration
|
141 |
+
4. Submit your agent for official evaluation
|
142 |
+
|
143 |
+
For complete UI reference, see [UI Reference](./ui-reference.md).
|
docs/ui-reference.md
ADDED
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Quizbowl Agent Web Interface Reference
|
2 |
+
|
3 |
+
This guide explains all elements of the web interface for creating and testing quizbowl agents.
|
4 |
+
|
5 |
+
## Navigation
|
6 |
+
|
7 |
+
The interface has four main tabs:
|
8 |
+
- **Tossup Agents**: Create and test agents for tossup questions
|
9 |
+
- **Bonus Round Agents**: Create and test agents for bonus questions
|
10 |
+
- **Leaderboard**: View leaderboard of agents
|
11 |
+
- **Help**: Access documentation and support resources
|
12 |
+
|
13 |
+
## Pipeline Creation Components
|
14 |
+
|
15 |
+
Let's walk through the components of the Tossup Agent pipeline creation interface.
|
16 |
+

|
17 |
+
|
18 |
+
### Model Step Management
|
19 |
+
|
20 |
+
A model step is a single llm call in the pipeline. Your pipeline can have multiple model steps.
|
21 |
+
- **+ Add Step**: Adds a new step to your pipeline
|
22 |
+
- **Step ID**: Unique identifier for each step (A, B, C, etc.)
|
23 |
+
- **Step Name**: Descriptive name for the step
|
24 |
+
- Available when more than one model step:
|
25 |
+
- **Delete Step** (×): Removes a step from the pipeline
|
26 |
+
- **Move Up** (↑): Moves a step up in the pipeline
|
27 |
+
- **Move Down** (↓): Moves a step down in the pipeline
|
28 |
+
|
29 |
+
### Model Selection
|
30 |
+
|
31 |
+
- **Model Dropdown**: Select language model provider and model
|
32 |
+
- **Temperature Slider**: Adjust randomness of outputs (0.0-1.0)
|
33 |
+
- Lower values (0.1-0.3): More consistent, deterministic outputs
|
34 |
+
- Higher values (0.7-1.0): More creative, varied outputs
|
35 |
+
|
36 |
+
### System Prompt
|
37 |
+
|
38 |
+
- **System Prompt Tab**: Contains instructions for the model
|
39 |
+
- **Text Editor**: Edit instructions directly, unfocus to apply changes to the system prompt
|
40 |
+
|
41 |
+
### Input/Output Configuration
|
42 |
+
|
43 |
+
#### Inputs Tab
|
44 |
+
|
45 |
+

|
46 |
+
|
47 |
+
- **Variable Used**: Reference name in pipeline (e.g., question_text)
|
48 |
+
- **Input Name**: Name the model sees (e.g., question)
|
49 |
+
- **Description**: Explains the input's purpose
|
50 |
+
- **+ Button**: Adds a new input variable
|
51 |
+
- **× Button**: Removes an input variable
|
52 |
+
|
53 |
+
#### Outputs Tab
|
54 |
+
|
55 |
+

|
56 |
+
|
57 |
+
- **Output Field**: Name of the output variable (e.g., answer)
|
58 |
+
- **Type Dropdown**: Data type (str, float, list, bool)
|
59 |
+
- **Description**: Explains what the output represents
|
60 |
+
- **Arrow Buttons**: Change output order
|
61 |
+
- **+ Button**: Adds a new output
|
62 |
+
- **× Button**: Removes an output
|
63 |
+
|
64 |
+
### Output Panel
|
65 |
+
|
66 |
+

|
67 |
+
|
68 |
+
#### Output Variables
|
69 |
+
|
70 |
+
Tossup agents are required to collect the following output variables:
|
71 |
+
- `answer`: The answer to the input question
|
72 |
+
- `confidence`: The confidence score of the answer
|
73 |
+
|
74 |
+
#### Buzzer Settings (For Tossup Agents)
|
75 |
+
|
76 |
+
- **Confidence Threshold**: Minimum value of the `confidence` output variable to consider a buzz (0.0-1.0)
|
77 |
+
- **Buzz Probability**: Minimum value of the normalized probability of the output tokens from the LLM. This is computed using the `logprobs` of the output tokens. $p(y|x) =\text{exp}(\Sigma_{y_i \in y} \text{logprob}(y_i))$. However, only some of the models support `logprobs`.
|
78 |
+
- **Method Dropdown**:
|
79 |
+
- AND: Both conditions must be true to buzz
|
80 |
+
- OR: Any condition can trigger a buzz
|
81 |
+
|
82 |
+
## Testing Components
|
83 |
+
|
84 |
+
### Question Selection
|
85 |
+
|
86 |
+
- **Question ID**: Enter ID to load specific question
|
87 |
+
- **Sample Question**: Use provided sample
|
88 |
+
- **Run Button**: Process question with current pipeline
|
89 |
+
|
90 |
+
### Results Visualization
|
91 |
+
|
92 |
+
#### Tossup Visualization
|
93 |
+
|
94 |
+

|
95 |
+
|
96 |
+
- **Highlighted Question Text**:
|
97 |
+
- Highlighted tokens are where we probe the model with the input question till this point
|
98 |
+
- Gray/Green/red highlighting based on whether the model has buzzed, buzzed correctly, or buzzed incorrectly
|
99 |
+
- Hover for answer/confidence details
|
100 |
+
|
101 |
+
- **Answer Popup**:
|
102 |
+
- Shows final answer
|
103 |
+
- Displays confidence score
|
104 |
+
- Indicates correctness
|
105 |
+
|
106 |
+
- **Buzz Confidence Graph**:
|
107 |
+
- X-axis: Token position
|
108 |
+
- Y-axis: Confidence (0.0-1.0)
|
109 |
+
- Blue line: Confidence progression
|
110 |
+
|
111 |
+
#### Bonus Visualization
|
112 |
+
|
113 |
+
- **Question Display**: Shows leadin and parts
|
114 |
+
- **Results Table**:
|
115 |
+
- Part number
|
116 |
+
- Correctness indicator
|
117 |
+
- Confidence score
|
118 |
+
- Prediction
|
119 |
+
- Explanation
|
120 |
+
|
121 |
+
## Pipeline Management
|
122 |
+
|
123 |
+
### Import/Export
|
124 |
+
|
125 |
+

|
126 |
+
- **Select Pipeline to Import** dropdown: Load existing pipeline configuration
|
127 |
+
- **Import Pipeline**: Apply selected pipeline configuration
|
128 |
+
|
129 |
+

|
130 |
+
- **Export Pipeline**: Save configuration as YAML
|
131 |
+
- **Pipeline Preview**: View and edit pipeline configuration in YAML format
|
132 |
+
|
133 |
+
### Evaluation and Submission
|
134 |
+
|
135 |
+
- **Evaluate**: Run comprehensive assessment
|
136 |
+
- **Model Name**: Name for submission
|
137 |
+
- **Description**: Details about your agent
|
138 |
+
- **Sign in with Hugging Face**: Authentication
|
139 |
+
- **Submit**: Submit agent for official evaluation
|
140 |
+
|
141 |
+
## Tips for Effective Use
|
142 |
+
|
143 |
+
- Use the system prompt to give clear instructions
|
144 |
+
- Test different confidence thresholds to find optimal settings
|
145 |
+
- Monitor buzz positions in the visualization
|
146 |
+
- Examine confidence trends to identify problem areas
|
147 |
+
- Use multi-step pipelines for complex tasks
|