Maharshi Gor commited on
Commit
0f6850b
·
1 Parent(s): 3a1af80

Added better documentation

Browse files
docs/README.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quizbowl Agent Documentation
2
+
3
+ ## Overview
4
+ This documentation helps you build effective quizbowl agents for tossup and bonus questions.
5
+
6
+ ## Documentation Files
7
+
8
+ - **[Goals and Evaluation](./goals-and-evaluation)**: Objectives and metrics for your agents across different competition modes.
9
+ - **[Tossup Agent Guide](./tossup-agent-guide)**: Step-by-step guide to build tossup agents
10
+ - **[Bonus Agent Guide](./bonus-agent-guide)**: Step-by-step guide to build bonus agents
11
+ - **[Advanced Pipeline Examples](./advanced-pipeline-examples)**: Complex pipeline configurations with examples
12
+ - **[UI Reference](./ui-reference)**: Complete reference for the web interface
13
+ - **[Best Practices](./best-practices)**: Tips and strategies for optimal performance
14
+
15
+ ## Getting Started
16
+
17
+ 1. Read [Goals and Evaluation](./goals-and-evaluation) to understand success metrics
18
+ 2. Follow [Tossup Agent Guide](./tossup-agent-guide) to build your first agent
19
+ 3. Test and iterate using the validation steps
20
+ 4. Submit your agent for evaluation
21
+
22
+ For questions, check the GitHub repository issues or contact the competition organizers.
docs/advanced-pipeline-examples.md CHANGED
@@ -1,73 +1,182 @@
1
- # Working with Advanced Pipeline Examples
2
-
3
- This guide demonstrates how to load, modify, and run an existing advanced pipeline example, focusing on the two-step justified confidence model for tossup questions.
4
-
5
- ## Loading the Two-Step Justified Confidence Example
6
-
7
- 1. Navigate to the "Tossup Agents" tab at the top of the interface.
8
-
9
- 2. Click the "Select Pipeline to Import..." dropdown and choose "two-step-justified-confidence.yaml".
10
-
11
- 3. Click "Import Pipeline" to load the example into the interface.
12
-
13
- ## Understanding the Two-Step Pipeline Structure
14
-
15
- The loaded pipeline has two distinct steps:
16
-
17
- 1. **Step A: Answer Generator**
18
- - Uses OpenAI/gpt-4o-mini
19
- - Takes question text as input
20
- - Generates an answer candidate
21
- - Uses a focused system prompt for answer generation only
22
-
23
- 2. **Step B: Confidence Evaluator**
24
- - Uses Cohere/command-r-plus
25
- - Takes the question text AND the generated answer from Step A
26
- - Evaluates confidence and provides justification
27
- - Uses a specialized system prompt for confidence evaluation
28
-
29
- This separation of concerns allows each model to focus on a specific task:
30
- - The first model concentrates solely on generating the most accurate answer
31
- - The second model evaluates how confident we should be in that answer
32
-
33
- ## Modifying the Pipeline for Better Performance
34
-
35
- Here are some ways to enhance the pipeline:
36
-
37
- 1. **Upgrade the Answer Generator**:
38
- - Click on Step A in the interface
39
- - Change the model from gpt-4o-mini to a more powerful model like gpt-4o
40
- - Modify the system prompt to include more specific instructions about quizbowl answer formatting
41
-
42
- 2. **Improve the Confidence Evaluator**:
43
- - Click on Step B
44
- - Add specific domain knowledge to the system prompt
45
- - For example, add: "Consider question length when evaluating confidence. Shorter, incomplete questions with less information revealed typically result in lower confidence scores."
46
- - Change the order of input variables so that model produces justification before confidence score, and hence conditions its confidence score on the justification.
47
-
48
- ## Running and Testing Your Modified Pipeline
49
-
50
- 1. After making your modifications, scroll down to adjust the buzzer settings:
51
- - Consider changing the confidence threshold based on the performance of your enhanced model
52
- - You might want to lower it slightly if you've improved the confidence evaluator
53
-
54
- 2. Test your modified pipeline:
55
- - Select a Question ID or use the provided sample question
56
- - Click "Run on Tossup Question"
57
- - Observe the answer, confidence score, and justification
58
-
59
- 3. Check the "Buzz Confidence" chart to see how confidence evolved during question processing
60
-
61
- ## Advantages of Multi-Step Pipelines
62
-
63
- Multi-step pipelines offer several benefits:
64
-
65
- 1. **Specialized Models**: Use different models for different tasks (e.g., GPT for general knowledge, Claude for reasoning)
66
-
67
- 2. **Focused Prompting**: Each step can have a targeted system prompt optimized for its specific task
68
-
69
- 3. **Chain of Thought**: Build sophisticated reasoning by connecting steps in a logical sequence
70
-
71
- 4. **Better Confidence Calibration**: Dedicated confidence evaluation typically results in more reliable buzzing
72
-
73
- 5. **Transparency**: The justification output helps you understand why the model made certain decisions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Advanced Pipeline Examples
2
+
3
+ This guide shows how to implement and validate sophisticated multi-step pipelines for quizbowl agents.
4
+
5
+ ## Goals
6
+ Using advanced pipelines, you will:
7
+ - Improve accuracy by 15-25% over single-step agents
8
+ - Create specialized components for different tasks
9
+ - Implement effective confidence calibration
10
+ - Build robust buzzer strategies
11
+
12
+ ## Two-Step Justified Confidence Pipeline
13
+
14
+ ### Baseline Performance
15
+ Standard single-step agents typically achieve:
16
+ - Accuracy: ~65-70%
17
+ - Poorly calibrated confidence
18
+ - Limited explanation for answers
19
+
20
+ ### Loading the Pipeline Example
21
+
22
+ 1. Navigate to the "Tossup Agents" tab
23
+ 2. Click "Select Pipeline to Import..." and choose "two-step-justified-confidence.yaml"
24
+ 3. Click "Import Pipeline"
25
+
26
+ ### Understanding the Pipeline Structure
27
+
28
+ This pipeline has two distinct steps:
29
+
30
+ #### Step A: Answer Generator
31
+ - Uses OpenAI/gpt-4o-mini
32
+ - Takes question text as input
33
+ - Generates an answer candidate
34
+ - Focuses solely on accurate answer generation
35
+
36
+ #### Step B: Confidence Evaluator
37
+ - Uses Cohere/command-r-plus
38
+ - Takes question text AND generated answer from Step A
39
+ - Evaluates confidence and provides justification
40
+ - Specialized for confidence assessment
41
+
42
+ ### Validation
43
+ Test the pipeline and check:
44
+ - Is accuracy improved compared to single-step?
45
+ - Are confidence scores better calibrated?
46
+ - Does the justification explain reasoning clearly?
47
+
48
+ ### Results
49
+ Two-step justified confidence typically achieves:
50
+ - Accuracy: ~80-85%
51
+ - Well-calibrated confidence scores
52
+ - Clear justification for answers and confidence
53
+ - More strategic buzzing
54
+
55
+ ## Enhancing the Two-Step Pipeline
56
+
57
+ ### Step 1: Upgrade Answer Generator
58
+
59
+ #### Current Performance
60
+ The default example uses gpt-4o-mini which may lack:
61
+ - Specialized knowledge in some areas
62
+ - Consistent answer formatting
63
+
64
+ #### Implementation
65
+ 1. Click on Step A
66
+ 2. Change model to a stronger option (e.g., gpt-4o)
67
+ 3. Modify system prompt to focus on answer precision
68
+
69
+ #### Validation
70
+ Test with sample questions and check:
71
+ - Has answer accuracy improved?
72
+ - Is formatting more consistent?
73
+
74
+ #### Results
75
+ With upgraded answer generator:
76
+ - Accuracy increases to ~85-90%
77
+ - More consistent answer formats
78
+
79
+ ### Step 2: Improve Confidence Evaluator
80
+
81
+ #### Current Performance
82
+ The default evaluator may:
83
+ - Over-estimate confidence on some topics
84
+ - Provide limited justification
85
+
86
+ #### Implementation
87
+ 1. Click on Step B
88
+ 2. Enhance the system prompt:
89
+ ```
90
+ You are an expert confidence evaluator for quizbowl answers.
91
+
92
+ Your task:
93
+ 1. Evaluate ONLY the correctness of the provided answer
94
+ 2. Consider question completeness and available clues
95
+ 3. Provide specific justification for your confidence score
96
+ 4. Be especially critical of answers with limited supporting evidence
97
+
98
+ Remember:
99
+ - Early, difficult clues justify lower confidence
100
+ - Later, obvious clues justify higher confidence
101
+ - Domain expertise should be reflected in your assessment
102
+ ```
103
+
104
+ #### Validation
105
+ Test and verify:
106
+ - Are confidence scores better aligned with correctness?
107
+ - Does justification include specific clues from questions?
108
+ - Is confidence calibrated appropriately for question position?
109
+
110
+ #### Results
111
+ With improved evaluator:
112
+ - More accurate confidence calibration
113
+ - Detailed justifications citing specific clues
114
+ - Better buzzing decisions
115
+
116
+ ## Three-Step Pipeline with Analysis
117
+
118
+ ### Concept
119
+ Adding a dedicated analysis step before answer generation:
120
+
121
+ 1. **Step A: Question Analyzer**
122
+ - Identifies key clues, entities, and relationships
123
+ - Determines question category and format
124
+
125
+ 2. **Step B: Answer Generator**
126
+ - Uses analysis to generate accurate answers
127
+ - Focuses on formatting and precision
128
+
129
+ 3. **Step C: Confidence Evaluator**
130
+ - Assesses answer quality based on analysis and clues
131
+ - Determines optimal buzz timing
132
+
133
+ ### Implementation
134
+ Create this pipeline from scratch or modify the two-step example.
135
+
136
+ ### Validation
137
+ Compare to the two-step pipeline:
138
+ - Does the analysis step improve answer accuracy?
139
+ - Does it provide better performance on difficult questions?
140
+ - Are there improvements in early buzzing?
141
+
142
+ ### Results
143
+ Three-step pipelines typically achieve:
144
+ - Accuracy: ~90-95%
145
+ - Earlier correct buzzes
146
+ - Exceptional performance on difficult questions
147
+
148
+ ## Specialty Pipeline: Literature Focus
149
+
150
+ ### Concept
151
+ Create a pipeline specialized for literature questions:
152
+
153
+ 1. **Step A: Literary Analyzer**
154
+ - Identifies literary techniques, periods, and styles
155
+ - Recognizes author-specific clues
156
+
157
+ 2. **Step B: Answer Generator**
158
+ - Specialized for literary works and authors
159
+ - Formats answers according to literary conventions
160
+
161
+ 3. **Step C: Confidence Evaluator**
162
+ - Calibrated specifically for literature questions
163
+
164
+ ### Implementation
165
+ Create specialized system prompts for each step focusing on literary knowledge.
166
+
167
+ ### Validation
168
+ Test specifically on literature questions and compare to general pipeline.
169
+
170
+ ### Results
171
+ Specialty pipelines can achieve:
172
+ - 95%+ accuracy in their specialized domain
173
+ - Earlier buzzing on category-specific questions
174
+ - Better performance on difficult clues
175
+
176
+ ## Best Practices for Advanced Pipelines
177
+
178
+ 1. **Focused Components**: Each step should have a clear, single responsibility
179
+ 2. **Efficient Communication**: Pass only necessary information between steps
180
+ 3. **Strong Fundamentals**: Start with a solid two-step pipeline before adding complexity
181
+ 4. **Consistent Testing**: Validate each change against the same test set
182
+ 5. **Strategic Model Selection**: Use different models for tasks where they excel
docs/goals-and-evaluation.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quizbowl Agent Goals and Evaluation
2
+
3
+ ## Objectives
4
+
5
+ ### Tossup Agents
6
+ - Respond to questions with the best guess with calibrated confidence
7
+ - Buzz at the earliest possible moment with sufficient information
8
+ - Avoid incorrect buzzes
9
+ - Maintain consistent performance across topics
10
+
11
+ ### Bonus Agents
12
+ - Answer parts correctly with accurate confidence estimation
13
+ - Provide clear explanation of reasoning which will be used by human team members to validate / pick the suggested answer.
14
+ - Adapt to varying difficulty levels (easy, medium, hard)
15
+
16
+ ## Performance Metrics
17
+
18
+ ### Tossup Metrics
19
+ - **Accuracy**: Percentage of correct answers
20
+ - **Average Buzz Position**: How early in the question you buzz (earlier is better)
21
+ - **Confidence Calibration**: How well confidence score matches actual performance
22
+ - **Score**: Points earned based on buzz position and correctness
23
+
24
+ ### Bonus Metrics
25
+ - **Accuracy**: Percentage of correct answers across all parts
26
+ - **Confidence Calibration**: How well confidence score matches actual performance
27
+ - **Explanation Quality**: Relevance and clarity of reasoning
28
+
29
+ ## Evaluating Your Agent
30
+
31
+ ### Testing Baseline Performance
32
+ 1. Run the default agent configuration
33
+ 2. Record metrics (accuracy, confidence, buzz position)
34
+ 3. Identify specific weaknesses in performance
35
+
36
+ ### Validating Improvements
37
+ After each enhancement:
38
+ 1. Run the agent on the same development set of questions
39
+ 2. Compare metrics to previous version
40
+ 3. Check for improvements in weak areas
41
+
42
+ ### Final Evaluation Criteria
43
+ Your final agent will be evaluated on:
44
+ 1. Overall accuracy across diverse questions
45
+ 2. Optimal buzz timing (neither too early nor too late)
46
+ 3. Confidence threshold calibration
47
+ 4. Explanation quality (for bonus agents)
48
+
49
+ <!-- ## Setting Goals for Your Agent
50
+
51
+ ### Minimum Goals
52
+ - Accuracy above 60%
53
+ - Appropriate confidence threshold (0.7-0.9)
54
+ - Reasonable buzz positions
55
+
56
+ ### Advanced Goals
57
+ - Multi-step pipelines with specialized components
58
+ - Accuracy above 85%
59
+ - Strategic early buzzing on familiar topics
60
+ - Detailed, accurate explanations for bonus questions -->
docs/imgs/bonus-output-panel.png ADDED

Git LFS Details

  • SHA256: 3580165a3e2a660beed6ef44a6a8e3871a17cb7c67a1f0f4dd033c680b2a0106
  • Pointer size: 130 Bytes
  • Size of remote file: 43.1 kB
docs/imgs/import-pipeline.png ADDED

Git LFS Details

  • SHA256: 90b2eb6415c2c5dd2169fd7bfdb9c61f5571e7e7be8fc19418bb0af8a41e17ae
  • Pointer size: 130 Bytes
  • Size of remote file: 68.6 kB
docs/imgs/pipeline-preview.png ADDED

Git LFS Details

  • SHA256: 5fe94a28fe40cd03a08df8c3700174d4e50c2224af2356a23d913f7a1af068de
  • Pointer size: 131 Bytes
  • Size of remote file: 199 kB
docs/imgs/tossup-agent-pipeline.png ADDED

Git LFS Details

  • SHA256: 6c849673fabe17a738ce4bc56f7f0a4a23ff57faca128169d0ce5bfaeb6d9ad9
  • Pointer size: 131 Bytes
  • Size of remote file: 212 kB
docs/tossup-agent-guide.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Building an Effective Tossup Agent
2
+
3
+ ## Goals
4
+ By the end of this guide, you will:
5
+ - Create a tossup agent that answers questions accurately
6
+ - Calibrate confidence thresholds for optimal buzzing
7
+ - Test performance on sample questions
8
+ - Submit your agent for evaluation
9
+
10
+ ## Baseline System Performance
11
+
12
+ Let's import the simple tossup agent pipeline `umdclip/simple-tossup-pipeline` and examine the configuration:
13
+
14
+ ![Default Tossup Configuration](./imgs/tossup-agent-pipeline.png)
15
+
16
+ The baseline system achieves:
17
+ - Accuracy: ~30% on sample questions
18
+ - Average Buzz Token Position: 40.40
19
+ - Average Confidence: 0.65
20
+
21
+ We'll improve this through targeted enhancements.
22
+
23
+ ## Enhancement 1: Basic Model Configuration
24
+
25
+ ### Current Performance
26
+ The default configuration uses `gpt-4o-mini` with temperature `0.7` and confidence threshold `0.85` for buzzer.
27
+
28
+ ### Implementing the Enhancement
29
+ 1. Navigate to "Tossup Agents" tab
30
+ 2. Select a stronger model (e.g., gpt-4o)
31
+ 3. Reduce temperature to 0.1 for more consistent outputs
32
+ 4. Test on sample questions
33
+
34
+ ### Validation
35
+ Run the agent on test questions and check:
36
+ - Has accuracy improved?
37
+ - Are confidence scores more consistent?
38
+ - Is your agent buzzing earlier?
39
+
40
+ ### Results
41
+ With better model configuration:
42
+ - Accuracy increases to ~80%
43
+ - Avg Buzz Position increased to 59.60
44
+
45
+ ## Enhancement 2: System Prompt Optimization
46
+
47
+ ### Current Performance
48
+ The default prompt lacks specific instructions for:
49
+ - Answer formatting
50
+ - Confidence calibration
51
+ - Domain-specific knowledge
52
+
53
+ ### Implementing the Enhancement
54
+ 1. Click "System Prompt" tab
55
+ 2. Add specific instructions:
56
+
57
+ ```
58
+ You are a professional quizbowl player answering tossup questions.
59
+
60
+ Your task:
61
+ 1. Analyze clues in the question text
62
+ 2. Determine the most likely answer
63
+ 3. Assess confidence on a scale from 0.0 to 1.0
64
+
65
+ Important guidelines:
66
+ - Give answers in the expected format (person's full name, complete title, etc.)
67
+ - Use 0.8+ confidence ONLY when absolutely certain
68
+ - For literature, include author's full name
69
+ - For science, include complete technical terms
70
+ ```
71
+
72
+ ### Validation
73
+ Test on the same questions and check:
74
+ - Are answers formatted more consistently?
75
+ - Is confidence more accurately reflecting correctness?
76
+ - Check specific categories where you added domain knowledge
77
+
78
+ ### Results
79
+ With optimized prompts:
80
+ - Accuracy increases to ~75%
81
+ - Confidence scores align better with actual performance
82
+ - Answer formats become more consistent
83
+
84
+ ## Enhancement 3: Confidence Calibration
85
+
86
+ ### Current Performance
87
+ Even with better prompts, confidence thresholds may be:
88
+ - Too high (missing answerable questions)
89
+ - Too low (buzzing incorrectly)
90
+
91
+ ### Implementing the Enhancement
92
+ 1. Scroll to "Buzzer Settings"
93
+ 2. Test different thresholds (0.7-0.9)
94
+ 3. Find optimal balance between:
95
+ - Buzzing early enough to score points
96
+ - Waiting for sufficient confidence
97
+
98
+ ![Buzzer Settings](./imgs/buzzer-settings.png)
99
+
100
+ ### Validation
101
+ For each threshold:
102
+ 1. Run tests on multiple questions
103
+ 2. Check percentage of correct buzzes
104
+ 3. Monitor average buzz position
105
+
106
+ ### Results
107
+ With calibrated threshold (e.g., 0.75):
108
+ - Balance between accuracy and early buzzing
109
+ - Fewer incorrect buzzes
110
+ - Earlier correct buzzes
111
+
112
+ ## Enhancement 4: Multi-Step Pipeline
113
+
114
+ ### Current Performance
115
+ Single-step pipelines often struggle with:
116
+ - Accurately separating answer generation from confidence estimation
117
+ - Providing consistent performance across question types
118
+
119
+ ### Implementing the Enhancement
120
+ 1. Click "+ Add Step" to create a two-step pipeline:
121
+ - Step A: Answer Generator
122
+ - Step B: Confidence Evaluator
123
+ 2. Configure each step:
124
+ - Step A focuses only on generating the best answer
125
+ - Step B evaluates confidence based on the answer and question
126
+
127
+ Let's load a multi-step pipeline `umdclip/two-step-justified-confidence` that does the same.
128
+ For more details on the pipeline, see [Advanced Pipeline Examples](./advanced-pipeline-examples.md)
129
+
130
+ ### Validation
131
+ Test the multi-step pipeline and compare to single-step:
132
+ - Does separation of concerns improve performance?
133
+ - Are confidence scores more accurate?
134
+ - Is there improvement in early buzz positions?
135
+
136
+ ## Final Evaluation and Submission
137
+
138
+ 1. Run comprehensive testing across categories
139
+ 2. Verify metrics match your goals
140
+ 3. Export your pipeline configuration
141
+ 4. Submit your agent for official evaluation
142
+
143
+ For complete UI reference, see [UI Reference](./ui-reference.md).
docs/ui-reference.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quizbowl Agent Web Interface Reference
2
+
3
+ This guide explains all elements of the web interface for creating and testing quizbowl agents.
4
+
5
+ ## Navigation
6
+
7
+ The interface has four main tabs:
8
+ - **Tossup Agents**: Create and test agents for tossup questions
9
+ - **Bonus Round Agents**: Create and test agents for bonus questions
10
+ - **Leaderboard**: View leaderboard of agents
11
+ - **Help**: Access documentation and support resources
12
+
13
+ ## Pipeline Creation Components
14
+
15
+ Let's walk through the components of the Tossup Agent pipeline creation interface.
16
+ ![Tossup Agent Pipeline Creation Interface](./imgs/tossup-agent-pipeline.png)
17
+
18
+ ### Model Step Management
19
+
20
+ A model step is a single llm call in the pipeline. Your pipeline can have multiple model steps.
21
+ - **+ Add Step**: Adds a new step to your pipeline
22
+ - **Step ID**: Unique identifier for each step (A, B, C, etc.)
23
+ - **Step Name**: Descriptive name for the step
24
+ - Available when more than one model step:
25
+ - **Delete Step** (×): Removes a step from the pipeline
26
+ - **Move Up** (↑): Moves a step up in the pipeline
27
+ - **Move Down** (↓): Moves a step down in the pipeline
28
+
29
+ ### Model Selection
30
+
31
+ - **Model Dropdown**: Select language model provider and model
32
+ - **Temperature Slider**: Adjust randomness of outputs (0.0-1.0)
33
+ - Lower values (0.1-0.3): More consistent, deterministic outputs
34
+ - Higher values (0.7-1.0): More creative, varied outputs
35
+
36
+ ### System Prompt
37
+
38
+ - **System Prompt Tab**: Contains instructions for the model
39
+ - **Text Editor**: Edit instructions directly, unfocus to apply changes to the system prompt
40
+
41
+ ### Input/Output Configuration
42
+
43
+ #### Inputs Tab
44
+
45
+ ![Inputs Tab](./imgs/inputs-tab.png)
46
+
47
+ - **Variable Used**: Reference name in pipeline (e.g., question_text)
48
+ - **Input Name**: Name the model sees (e.g., question)
49
+ - **Description**: Explains the input's purpose
50
+ - **+ Button**: Adds a new input variable
51
+ - **× Button**: Removes an input variable
52
+
53
+ #### Outputs Tab
54
+
55
+ ![Outputs Tab](./imgs/outputs-tab.png)
56
+
57
+ - **Output Field**: Name of the output variable (e.g., answer)
58
+ - **Type Dropdown**: Data type (str, float, list, bool)
59
+ - **Description**: Explains what the output represents
60
+ - **Arrow Buttons**: Change output order
61
+ - **+ Button**: Adds a new output
62
+ - **× Button**: Removes an output
63
+
64
+ ### Output Panel
65
+
66
+ ![Buzzer Settings](./imgs/buzzer-settings.png)
67
+
68
+ #### Output Variables
69
+
70
+ Tossup agents are required to collect the following output variables:
71
+ - `answer`: The answer to the input question
72
+ - `confidence`: The confidence score of the answer
73
+
74
+ #### Buzzer Settings (For Tossup Agents)
75
+
76
+ - **Confidence Threshold**: Minimum value of the `confidence` output variable to consider a buzz (0.0-1.0)
77
+ - **Buzz Probability**: Minimum value of the normalized probability of the output tokens from the LLM. This is computed using the `logprobs` of the output tokens. $p(y|x) =\text{exp}(\Sigma_{y_i \in y} \text{logprob}(y_i))$. However, only some of the models support `logprobs`.
78
+ - **Method Dropdown**:
79
+ - AND: Both conditions must be true to buzz
80
+ - OR: Any condition can trigger a buzz
81
+
82
+ ## Testing Components
83
+
84
+ ### Question Selection
85
+
86
+ - **Question ID**: Enter ID to load specific question
87
+ - **Sample Question**: Use provided sample
88
+ - **Run Button**: Process question with current pipeline
89
+
90
+ ### Results Visualization
91
+
92
+ #### Tossup Visualization
93
+
94
+ ![Tossup Results](./imgs/tossup-viz.png)
95
+
96
+ - **Highlighted Question Text**:
97
+ - Highlighted tokens are where we probe the model with the input question till this point
98
+ - Gray/Green/red highlighting based on whether the model has buzzed, buzzed correctly, or buzzed incorrectly
99
+ - Hover for answer/confidence details
100
+
101
+ - **Answer Popup**:
102
+ - Shows final answer
103
+ - Displays confidence score
104
+ - Indicates correctness
105
+
106
+ - **Buzz Confidence Graph**:
107
+ - X-axis: Token position
108
+ - Y-axis: Confidence (0.0-1.0)
109
+ - Blue line: Confidence progression
110
+
111
+ #### Bonus Visualization
112
+
113
+ - **Question Display**: Shows leadin and parts
114
+ - **Results Table**:
115
+ - Part number
116
+ - Correctness indicator
117
+ - Confidence score
118
+ - Prediction
119
+ - Explanation
120
+
121
+ ## Pipeline Management
122
+
123
+ ### Import/Export
124
+
125
+ ![Import Pipeline](./imgs/import-pipeline.png)
126
+ - **Select Pipeline to Import** dropdown: Load existing pipeline configuration
127
+ - **Import Pipeline**: Apply selected pipeline configuration
128
+
129
+ ![Export Pipeline](./imgs/pipeline-preview.png)
130
+ - **Export Pipeline**: Save configuration as YAML
131
+ - **Pipeline Preview**: View and edit pipeline configuration in YAML format
132
+
133
+ ### Evaluation and Submission
134
+
135
+ - **Evaluate**: Run comprehensive assessment
136
+ - **Model Name**: Name for submission
137
+ - **Description**: Details about your agent
138
+ - **Sign in with Hugging Face**: Authentication
139
+ - **Submit**: Submit agent for official evaluation
140
+
141
+ ## Tips for Effective Use
142
+
143
+ - Use the system prompt to give clear instructions
144
+ - Test different confidence thresholds to find optimal settings
145
+ - Monitor buzz positions in the visualization
146
+ - Examine confidence trends to identify problem areas
147
+ - Use multi-step pipelines for complex tasks