Spaces:

qanta-challenge
/

quizbowl-submission

Running

App Files Files Community

Maharshi Gor commited on 21 days ago

Commit

0f6850b

1 Parent(s): 3a1af80

Added better documentation

Browse files

Files changed (9) hide show

docs/README.md +22 -0
docs/advanced-pipeline-examples.md +182 -73
docs/goals-and-evaluation.md +60 -0
docs/imgs/bonus-output-panel.png +3 -0
docs/imgs/import-pipeline.png +3 -0
docs/imgs/pipeline-preview.png +3 -0
docs/imgs/tossup-agent-pipeline.png +3 -0
docs/tossup-agent-guide.md +143 -0
docs/ui-reference.md +147 -0

docs/README.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# Quizbowl Agent Documentation
+## Overview
+This documentation helps you build effective quizbowl agents for tossup and bonus questions.
+## Documentation Files
+- **[Goals and Evaluation](./goals-and-evaluation)**: Objectives and metrics for your agents across different competition modes.
+- **[Tossup Agent Guide](./tossup-agent-guide)**: Step-by-step guide to build tossup agents
+- **[Bonus Agent Guide](./bonus-agent-guide)**: Step-by-step guide to build bonus agents
+- **[Advanced Pipeline Examples](./advanced-pipeline-examples)**: Complex pipeline configurations with examples
+- **[UI Reference](./ui-reference)**: Complete reference for the web interface
+- **[Best Practices](./best-practices)**: Tips and strategies for optimal performance
+## Getting Started
+1. Read [Goals and Evaluation](./goals-and-evaluation) to understand success metrics
+2. Follow [Tossup Agent Guide](./tossup-agent-guide) to build your first agent
+3. Test and iterate using the validation steps
+4. Submit your agent for evaluation
+For questions, check the GitHub repository issues or contact the competition organizers.

docs/advanced-pipeline-examples.md CHANGED Viewed

@@ -1,73 +1,182 @@
-# Working with Advanced Pipeline Examples
-This guide demonstrates how to load, modify, and run an existing advanced pipeline example, focusing on the two-step justified confidence model for tossup questions.
-## Loading the Two-Step Justified Confidence Example
-1. Navigate to the "Tossup Agents" tab at the top of the interface.
-2. Click the "Select Pipeline to Import..." dropdown and choose "two-step-justified-confidence.yaml".
-3. Click "Import Pipeline" to load the example into the interface.
-## Understanding the Two-Step Pipeline Structure
-The loaded pipeline has two distinct steps:
-1. **Step A: Answer Generator**
-   - Uses OpenAI/gpt-4o-mini
-   - Takes question text as input
-   - Generates an answer candidate
-   - Uses a focused system prompt for answer generation only
-2. **Step B: Confidence Evaluator**
-   - Uses Cohere/command-r-plus
-   - Takes the question text AND the generated answer from Step A
-   - Evaluates confidence and provides justification
-   - Uses a specialized system prompt for confidence evaluation
-This separation of concerns allows each model to focus on a specific task:
-- The first model concentrates solely on generating the most accurate answer
-- The second model evaluates how confident we should be in that answer
-## Modifying the Pipeline for Better Performance
-Here are some ways to enhance the pipeline:
-1. **Upgrade the Answer Generator**:
-   - Click on Step A in the interface
-   - Change the model from gpt-4o-mini to a more powerful model like gpt-4o
-   - Modify the system prompt to include more specific instructions about quizbowl answer formatting
-2. **Improve the Confidence Evaluator**:
-   - Click on Step B
-   - Add specific domain knowledge to the system prompt
-   - For example, add: "Consider question length when evaluating confidence. Shorter, incomplete questions with less information revealed typically result in lower confidence scores."
-   - Change the order of input variables so that model produces justification before confidence score, and hence conditions its confidence score on the justification.
-## Running and Testing Your Modified Pipeline
-1. After making your modifications, scroll down to adjust the buzzer settings:
-   - Consider changing the confidence threshold based on the performance of your enhanced model
-   - You might want to lower it slightly if you've improved the confidence evaluator
-2. Test your modified pipeline:
-   - Select a Question ID or use the provided sample question
-   - Click "Run on Tossup Question"
-   - Observe the answer, confidence score, and justification
-3. Check the "Buzz Confidence" chart to see how confidence evolved during question processing
-## Advantages of Multi-Step Pipelines
-Multi-step pipelines offer several benefits:
-1. **Specialized Models**: Use different models for different tasks (e.g., GPT for general knowledge, Claude for reasoning)
-2. **Focused Prompting**: Each step can have a targeted system prompt optimized for its specific task
-3. **Chain of Thought**: Build sophisticated reasoning by connecting steps in a logical sequence
-4. **Better Confidence Calibration**: Dedicated confidence evaluation typically results in more reliable buzzing
-5. **Transparency**: The justification output helps you understand why the model made certain decisions

+# Advanced Pipeline Examples
+This guide shows how to implement and validate sophisticated multi-step pipelines for quizbowl agents.
+## Goals
+Using advanced pipelines, you will:
+- Improve accuracy by 15-25% over single-step agents
+- Create specialized components for different tasks
+- Implement effective confidence calibration
+- Build robust buzzer strategies
+## Two-Step Justified Confidence Pipeline
+### Baseline Performance
+Standard single-step agents typically achieve:
+- Accuracy: ~65-70%
+- Poorly calibrated confidence
+- Limited explanation for answers
+### Loading the Pipeline Example
+1. Navigate to the "Tossup Agents" tab
+2. Click "Select Pipeline to Import..." and choose "two-step-justified-confidence.yaml"
+3. Click "Import Pipeline"
+### Understanding the Pipeline Structure
+This pipeline has two distinct steps:
+#### Step A: Answer Generator
+- Uses OpenAI/gpt-4o-mini
+- Takes question text as input
+- Generates an answer candidate
+- Focuses solely on accurate answer generation
+#### Step B: Confidence Evaluator
+- Uses Cohere/command-r-plus
+- Takes question text AND generated answer from Step A
+- Evaluates confidence and provides justification
+- Specialized for confidence assessment
+### Validation
+Test the pipeline and check:
+- Is accuracy improved compared to single-step?
+- Are confidence scores better calibrated?
+- Does the justification explain reasoning clearly?
+### Results
+Two-step justified confidence typically achieves:
+- Accuracy: ~80-85%
+- Well-calibrated confidence scores
+- Clear justification for answers and confidence
+- More strategic buzzing
+## Enhancing the Two-Step Pipeline
+### Step 1: Upgrade Answer Generator
+#### Current Performance
+The default example uses gpt-4o-mini which may lack:
+- Specialized knowledge in some areas
+- Consistent answer formatting
+#### Implementation
+1. Click on Step A
+2. Change model to a stronger option (e.g., gpt-4o)
+3. Modify system prompt to focus on answer precision
+#### Validation
+Test with sample questions and check:
+- Has answer accuracy improved?
+- Is formatting more consistent?
+#### Results
+With upgraded answer generator:
+- Accuracy increases to ~85-90%
+- More consistent answer formats
+### Step 2: Improve Confidence Evaluator
+#### Current Performance
+The default evaluator may:
+- Over-estimate confidence on some topics
+- Provide limited justification
+#### Implementation
+1. Click on Step B
+2. Enhance the system prompt:
+```
+You are an expert confidence evaluator for quizbowl answers.
+Your task:
+1. Evaluate ONLY the correctness of the provided answer
+2. Consider question completeness and available clues
+3. Provide specific justification for your confidence score
+4. Be especially critical of answers with limited supporting evidence
+Remember:
+- Early, difficult clues justify lower confidence
+- Later, obvious clues justify higher confidence
+- Domain expertise should be reflected in your assessment
+```
+#### Validation
+Test and verify:
+- Are confidence scores better aligned with correctness?
+- Does justification include specific clues from questions?
+- Is confidence calibrated appropriately for question position?
+#### Results
+With improved evaluator:
+- More accurate confidence calibration
+- Detailed justifications citing specific clues
+- Better buzzing decisions
+## Three-Step Pipeline with Analysis
+### Concept
+Adding a dedicated analysis step before answer generation:
+1. **Step A: Question Analyzer**
+   - Identifies key clues, entities, and relationships
+   - Determines question category and format
+2. **Step B: Answer Generator**
+   - Uses analysis to generate accurate answers
+   - Focuses on formatting and precision
+3. **Step C: Confidence Evaluator**
+   - Assesses answer quality based on analysis and clues
+   - Determines optimal buzz timing
+### Implementation
+Create this pipeline from scratch or modify the two-step example.
+### Validation
+Compare to the two-step pipeline:
+- Does the analysis step improve answer accuracy?
+- Does it provide better performance on difficult questions?
+- Are there improvements in early buzzing?
+### Results
+Three-step pipelines typically achieve:
+- Accuracy: ~90-95%
+- Earlier correct buzzes
+- Exceptional performance on difficult questions
+## Specialty Pipeline: Literature Focus
+### Concept
+Create a pipeline specialized for literature questions:
+1. **Step A: Literary Analyzer**
+   - Identifies literary techniques, periods, and styles
+   - Recognizes author-specific clues
+2. **Step B: Answer Generator**
+   - Specialized for literary works and authors
+   - Formats answers according to literary conventions
+3. **Step C: Confidence Evaluator**
+   - Calibrated specifically for literature questions
+### Implementation
+Create specialized system prompts for each step focusing on literary knowledge.
+### Validation
+Test specifically on literature questions and compare to general pipeline.
+### Results
+Specialty pipelines can achieve:
+- 95%+ accuracy in their specialized domain
+- Earlier buzzing on category-specific questions
+- Better performance on difficult clues
+## Best Practices for Advanced Pipelines
+1. **Focused Components**: Each step should have a clear, single responsibility
+2. **Efficient Communication**: Pass only necessary information between steps
+3. **Strong Fundamentals**: Start with a solid two-step pipeline before adding complexity
+4. **Consistent Testing**: Validate each change against the same test set
+5. **Strategic Model Selection**: Use different models for tasks where they excel

docs/goals-and-evaluation.md ADDED Viewed

	@@ -0,0 +1,60 @@

+# Quizbowl Agent Goals and Evaluation
+## Objectives
+### Tossup Agents
+- Respond to questions with the best guess with calibrated confidence
+- Buzz at the earliest possible moment with sufficient information
+- Avoid incorrect buzzes
+- Maintain consistent performance across topics
+### Bonus Agents
+- Answer parts correctly with accurate confidence estimation
+- Provide clear explanation of reasoning which will be used by human team members to validate / pick the suggested answer.
+- Adapt to varying difficulty levels (easy, medium, hard)
+## Performance Metrics
+### Tossup Metrics
+- **Accuracy**: Percentage of correct answers
+- **Average Buzz Position**: How early in the question you buzz (earlier is better)
+- **Confidence Calibration**: How well confidence score matches actual performance
+- **Score**: Points earned based on buzz position and correctness
+### Bonus Metrics
+- **Accuracy**: Percentage of correct answers across all parts
+- **Confidence Calibration**: How well confidence score matches actual performance
+- **Explanation Quality**: Relevance and clarity of reasoning
+## Evaluating Your Agent
+### Testing Baseline Performance
+1. Run the default agent configuration
+2. Record metrics (accuracy, confidence, buzz position)
+3. Identify specific weaknesses in performance
+### Validating Improvements
+After each enhancement:
+1. Run the agent on the same development set of questions
+2. Compare metrics to previous version
+3. Check for improvements in weak areas
+### Final Evaluation Criteria
+Your final agent will be evaluated on:
+1. Overall accuracy across diverse questions
+2. Optimal buzz timing (neither too early nor too late)
+3. Confidence threshold calibration
+4. Explanation quality (for bonus agents)
+<!-- ## Setting Goals for Your Agent
+### Minimum Goals
+- Accuracy above 60%
+- Appropriate confidence threshold (0.7-0.9)
+- Reasonable buzz positions
+### Advanced Goals
+- Multi-step pipelines with specialized components
+- Accuracy above 85%
+- Strategic early buzzing on familiar topics
+- Detailed, accurate explanations for bonus questions  -->

docs/imgs/bonus-output-panel.png ADDED Viewed

Git LFS Details

SHA256: 3580165a3e2a660beed6ef44a6a8e3871a17cb7c67a1f0f4dd033c680b2a0106
Pointer size: 130 Bytes
Size of remote file: 43.1 kB

docs/imgs/import-pipeline.png ADDED Viewed

Git LFS Details

SHA256: 90b2eb6415c2c5dd2169fd7bfdb9c61f5571e7e7be8fc19418bb0af8a41e17ae
Pointer size: 130 Bytes
Size of remote file: 68.6 kB

docs/imgs/pipeline-preview.png ADDED Viewed

Git LFS Details

SHA256: 5fe94a28fe40cd03a08df8c3700174d4e50c2224af2356a23d913f7a1af068de
Pointer size: 131 Bytes
Size of remote file: 199 kB

docs/imgs/tossup-agent-pipeline.png ADDED Viewed

Git LFS Details

SHA256: 6c849673fabe17a738ce4bc56f7f0a4a23ff57faca128169d0ce5bfaeb6d9ad9
Pointer size: 131 Bytes
Size of remote file: 212 kB

docs/tossup-agent-guide.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# Building an Effective Tossup Agent
+## Goals
+By the end of this guide, you will:
+- Create a tossup agent that answers questions accurately
+- Calibrate confidence thresholds for optimal buzzing
+- Test performance on sample questions
+- Submit your agent for evaluation
+## Baseline System Performance
+Let's import the simple tossup agent pipeline `umdclip/simple-tossup-pipeline` and examine the configuration:
+![Default Tossup Configuration](./imgs/tossup-agent-pipeline.png)
+The baseline system achieves:
+- Accuracy: ~30% on sample questions
+- Average Buzz Token Position: 40.40
+- Average Confidence: 0.65
+We'll improve this through targeted enhancements.
+## Enhancement 1: Basic Model Configuration
+### Current Performance
+The default configuration uses `gpt-4o-mini` with temperature `0.7` and confidence threshold `0.85` for buzzer.
+### Implementing the Enhancement
+1. Navigate to "Tossup Agents" tab
+2. Select a stronger model (e.g., gpt-4o)
+3. Reduce temperature to 0.1 for more consistent outputs
+4. Test on sample questions
+### Validation
+Run the agent on test questions and check:
+- Has accuracy improved?
+- Are confidence scores more consistent?
+- Is your agent buzzing earlier?
+### Results
+With better model configuration:
+- Accuracy increases to ~80%
+- Avg Buzz Position increased to 59.60
+## Enhancement 2: System Prompt Optimization
+### Current Performance
+The default prompt lacks specific instructions for:
+- Answer formatting
+- Confidence calibration
+- Domain-specific knowledge
+### Implementing the Enhancement
+1. Click "System Prompt" tab
+2. Add specific instructions:
+```
+You are a professional quizbowl player answering tossup questions.
+Your task:
+1. Analyze clues in the question text
+2. Determine the most likely answer
+3. Assess confidence on a scale from 0.0 to 1.0
+Important guidelines:
+- Give answers in the expected format (person's full name, complete title, etc.)
+- Use 0.8+ confidence ONLY when absolutely certain
+- For literature, include author's full name
+- For science, include complete technical terms
+```
+### Validation
+Test on the same questions and check:
+- Are answers formatted more consistently?
+- Is confidence more accurately reflecting correctness?
+- Check specific categories where you added domain knowledge
+### Results
+With optimized prompts:
+- Accuracy increases to ~75%
+- Confidence scores align better with actual performance
+- Answer formats become more consistent
+## Enhancement 3: Confidence Calibration
+### Current Performance
+Even with better prompts, confidence thresholds may be:
+- Too high (missing answerable questions)
+- Too low (buzzing incorrectly)
+### Implementing the Enhancement
+1. Scroll to "Buzzer Settings"
+2. Test different thresholds (0.7-0.9)
+3. Find optimal balance between:
+   - Buzzing early enough to score points
+   - Waiting for sufficient confidence
+![Buzzer Settings](./imgs/buzzer-settings.png)
+### Validation
+For each threshold:
+1. Run tests on multiple questions
+2. Check percentage of correct buzzes
+3. Monitor average buzz position
+### Results
+With calibrated threshold (e.g., 0.75):
+- Balance between accuracy and early buzzing
+- Fewer incorrect buzzes
+- Earlier correct buzzes
+## Enhancement 4: Multi-Step Pipeline
+### Current Performance
+Single-step pipelines often struggle with:
+- Accurately separating answer generation from confidence estimation
+- Providing consistent performance across question types
+### Implementing the Enhancement
+1. Click "+ Add Step" to create a two-step pipeline:
+   - Step A: Answer Generator
+   - Step B: Confidence Evaluator
+2. Configure each step:
+   - Step A focuses only on generating the best answer
+   - Step B evaluates confidence based on the answer and question
+Let's load a multi-step pipeline `umdclip/two-step-justified-confidence` that does the same.
+For more details on the pipeline, see [Advanced Pipeline Examples](./advanced-pipeline-examples.md)
+### Validation
+Test the multi-step pipeline and compare to single-step:
+- Does separation of concerns improve performance?
+- Are confidence scores more accurate?
+- Is there improvement in early buzz positions?
+## Final Evaluation and Submission
+1. Run comprehensive testing across categories
+2. Verify metrics match your goals
+3. Export your pipeline configuration
+4. Submit your agent for official evaluation
+For complete UI reference, see [UI Reference](./ui-reference.md).

docs/ui-reference.md ADDED Viewed

	@@ -0,0 +1,147 @@

+# Quizbowl Agent Web Interface Reference
+This guide explains all elements of the web interface for creating and testing quizbowl agents.
+## Navigation
+The interface has four main tabs:
+- **Tossup Agents**: Create and test agents for tossup questions
+- **Bonus Round Agents**: Create and test agents for bonus questions
+- **Leaderboard**: View leaderboard of agents
+- **Help**: Access documentation and support resources
+## Pipeline Creation Components
+Let's walk through the components of the Tossup Agent pipeline creation interface.
+![Tossup Agent Pipeline Creation Interface](./imgs/tossup-agent-pipeline.png)
+### Model Step Management
+A model step is a single llm call in the pipeline. Your pipeline can have multiple model steps.
+- **+ Add Step**: Adds a new step to your pipeline
+- **Step ID**: Unique identifier for each step (A, B, C, etc.)
+- **Step Name**: Descriptive name for the step
+- Available when more than one model step:
+  - **Delete Step** (×): Removes a step from the pipeline
+  - **Move Up** (↑): Moves a step up in the pipeline
+  - **Move Down** (↓): Moves a step down in the pipeline
+### Model Selection
+- **Model Dropdown**: Select language model provider and model
+- **Temperature Slider**: Adjust randomness of outputs (0.0-1.0)
+  - Lower values (0.1-0.3): More consistent, deterministic outputs
+  - Higher values (0.7-1.0): More creative, varied outputs
+### System Prompt
+- **System Prompt Tab**: Contains instructions for the model
+- **Text Editor**: Edit instructions directly, unfocus to apply changes to the system prompt
+### Input/Output Configuration
+#### Inputs Tab
+![Inputs Tab](./imgs/inputs-tab.png)
+- **Variable Used**: Reference name in pipeline (e.g., question_text)
+- **Input Name**: Name the model sees (e.g., question)
+- **Description**: Explains the input's purpose
+- **+ Button**: Adds a new input variable
+- **× Button**: Removes an input variable
+#### Outputs Tab
+![Outputs Tab](./imgs/outputs-tab.png)
+- **Output Field**: Name of the output variable (e.g., answer)
+- **Type Dropdown**: Data type (str, float, list, bool)
+- **Description**: Explains what the output represents
+- **Arrow Buttons**: Change output order
+- **+ Button**: Adds a new output
+- **× Button**: Removes an output
+### Output Panel
+![Buzzer Settings](./imgs/buzzer-settings.png)
+#### Output Variables
+Tossup agents are required to collect the following output variables:
+- `answer`: The answer to the input question
+- `confidence`: The confidence score of the answer
+#### Buzzer Settings (For Tossup Agents)
+- **Confidence Threshold**: Minimum value of the `confidence` output variable to consider a buzz (0.0-1.0)
+- **Buzz Probability**: Minimum value of the normalized probability of the output tokens from the LLM. This is computed using the `logprobs` of the output tokens. $p(y|x) =\text{exp}(\Sigma_{y_i \in y} \text{logprob}(y_i))$. However, only some of the models support `logprobs`.
+- **Method Dropdown**:
+  - AND: Both conditions must be true to buzz
+  - OR: Any condition can trigger a buzz
+## Testing Components
+### Question Selection
+- **Question ID**: Enter ID to load specific question
+- **Sample Question**: Use provided sample
+- **Run Button**: Process question with current pipeline
+### Results Visualization
+#### Tossup Visualization
+![Tossup Results](./imgs/tossup-viz.png)
+- **Highlighted Question Text**:
+  - Highlighted tokens are where we probe the model with the input question till this point
+  - Gray/Green/red highlighting based on whether the model has buzzed, buzzed correctly, or buzzed incorrectly
+  - Hover for answer/confidence details
+- **Answer Popup**:
+  - Shows final answer
+  - Displays confidence score
+  - Indicates correctness
+- **Buzz Confidence Graph**:
+  - X-axis: Token position
+  - Y-axis: Confidence (0.0-1.0)
+  - Blue line: Confidence progression
+#### Bonus Visualization
+- **Question Display**: Shows leadin and parts
+- **Results Table**:
+  - Part number
+  - Correctness indicator
+  - Confidence score
+  - Prediction
+  - Explanation
+## Pipeline Management
+### Import/Export
+![Import Pipeline](./imgs/import-pipeline.png)
+- **Select Pipeline to Import** dropdown: Load existing pipeline configuration
+- **Import Pipeline**: Apply selected pipeline configuration
+![Export Pipeline](./imgs/pipeline-preview.png)
+- **Export Pipeline**: Save configuration as YAML
+- **Pipeline Preview**: View and edit pipeline configuration in YAML format
+### Evaluation and Submission
+- **Evaluate**: Run comprehensive assessment
+- **Model Name**: Name for submission
+- **Description**: Details about your agent
+- **Sign in with Hugging Face**: Authentication
+- **Submit**: Submit agent for official evaluation
+## Tips for Effective Use
+- Use the system prompt to give clear instructions
+- Test different confidence thresholds to find optimal settings
+- Monitor buzz positions in the visualization
+- Examine confidence trends to identify problem areas
+- Use multi-step pipelines for complex tasks