Spaces:

qanta-challenge
/

quizbowl-submission

Running

App Files Files Community

Maharshi Gor commited on Apr 10

Commit

7acf14e

1 Parent(s): 02b7dec

Configure Git LFS for PNG files

Browse files

Files changed (7) hide show

.gitattributes +1 -0
docs/advanced-pipeline-examples.md +73 -0
docs/imgs/buzzer-settings.png +3 -0
docs/imgs/inputs-tab.png +3 -0
docs/imgs/outputs-tab.png +3 -0
docs/imgs/tossup-viz.png +3 -0
docs/walkthrough.md +169 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

docs/advanced-pipeline-examples.md ADDED Viewed

	@@ -0,0 +1,73 @@

+# Working with Advanced Pipeline Examples
+This guide demonstrates how to load, modify, and run an existing advanced pipeline example, focusing on the two-step justified confidence model for tossup questions.
+## Loading the Two-Step Justified Confidence Example
+1. Navigate to the "Tossup Agents" tab at the top of the interface.
+2. Click the "Select Pipeline to Import..." dropdown and choose "two-step-justified-confidence.yaml".
+3. Click "Import Pipeline" to load the example into the interface.
+## Understanding the Two-Step Pipeline Structure
+The loaded pipeline has two distinct steps:
+1. **Step A: Answer Generator**
+   - Uses OpenAI/gpt-4o-mini
+   - Takes question text as input
+   - Generates an answer candidate
+   - Uses a focused system prompt for answer generation only
+2. **Step B: Confidence Evaluator**
+   - Uses Cohere/command-r-plus
+   - Takes the question text AND the generated answer from Step A
+   - Evaluates confidence and provides justification
+   - Uses a specialized system prompt for confidence evaluation
+This separation of concerns allows each model to focus on a specific task:
+- The first model concentrates solely on generating the most accurate answer
+- The second model evaluates how confident we should be in that answer
+## Modifying the Pipeline for Better Performance
+Here are some ways to enhance the pipeline:
+1. **Upgrade the Answer Generator**:
+   - Click on Step A in the interface
+   - Change the model from gpt-4o-mini to a more powerful model like gpt-4o
+   - Modify the system prompt to include more specific instructions about quizbowl answer formatting
+2. **Improve the Confidence Evaluator**:
+   - Click on Step B
+   - Add specific domain knowledge to the system prompt
+   - For example, add: "Consider question length when evaluating confidence. Shorter, incomplete questions with less information revealed typically result in lower confidence scores."
+   - Change the order of input variables so that model produces justification before confidence score, and hence conditions its confidence score on the justification.
+## Running and Testing Your Modified Pipeline
+1. After making your modifications, scroll down to adjust the buzzer settings:
+   - Consider changing the confidence threshold based on the performance of your enhanced model
+   - You might want to lower it slightly if you've improved the confidence evaluator
+2. Test your modified pipeline:
+   - Select a Question ID or use the provided sample question
+   - Click "Run on Tossup Question"
+   - Observe the answer, confidence score, and justification
+3. Check the "Buzz Confidence" chart to see how confidence evolved during question processing
+## Advantages of Multi-Step Pipelines
+Multi-step pipelines offer several benefits:
+1. **Specialized Models**: Use different models for different tasks (e.g., GPT for general knowledge, Claude for reasoning)
+2. **Focused Prompting**: Each step can have a targeted system prompt optimized for its specific task
+3. **Chain of Thought**: Build sophisticated reasoning by connecting steps in a logical sequence
+4. **Better Confidence Calibration**: Dedicated confidence evaluation typically results in more reliable buzzing
+5. **Transparency**: The justification output helps you understand why the model made certain decisions

docs/imgs/buzzer-settings.png ADDED Viewed

Git LFS Details

SHA256: 370ed9110cd42aee698e060e65fdad8649d373e78fb423a8962718707ece1b17
Pointer size: 131 Bytes
Size of remote file: 103 kB

docs/imgs/inputs-tab.png ADDED Viewed

Git LFS Details

SHA256: 6d40a821ea6ce78756240761a61b23145066d83d2878d75aa15a2b40f68b038b
Pointer size: 130 Bytes
Size of remote file: 55.2 kB

docs/imgs/outputs-tab.png ADDED Viewed

Git LFS Details

SHA256: 48bc47f2b4defd58cdda91faf478c9ab7dac9d29cc903f3c7c3389ad83ef7a34
Pointer size: 130 Bytes
Size of remote file: 87.7 kB

docs/imgs/tossup-viz.png ADDED Viewed

Git LFS Details

SHA256: ff53dc572479564e4c0f26fb73ce664a85ef49e4f0c3eafcd675a30de944ffe3
Pointer size: 131 Bytes
Size of remote file: 574 kB

docs/walkthrough.md ADDED Viewed

	@@ -0,0 +1,169 @@

+# Quizbowl Agent Web Interface Walkthrough
+This walkthrough guide will help you create, test, and submit your quizbowl agent for both tossup and bonus questions using our web interface.
+## Overview
+Our web interface allows you to:
+- Create and import pipeline workflows
+- Configure tossup and bonus agents
+- Test agents on sample questions
+- Visualize agent performance
+- Export your pipeline configurations to yaml files.
+- Submit your agents to our competition for full evaluation.
+## Creating a Tossup Agent
+1. Navigate to the "Tossup Agents" tab at the top of the interface.
+2. **Creating a Pipeline**:
+   - Note the input variable `question_text` and output variables `answer`, `confidence` required for tossup agents.
+   - The default setup includes a single agent step labeled "A: Tossup Agent".
+   - You can add more steps using the "+ Add Step" button for multi-step pipelines.
+3. **Configuring Your Agent**:
+   - Select your preferred model from the dropdown (e.g., `OpenAI/gpt-4o-mini`).
+   - Adjust the temperature slider (higher values = more creativity, lower = more deterministic).
+   - Click on the "System Prompt" tab to customize your agent's instructions.
+   - Your system prompt is crucial - it tells the LLM how to interpret questions and format answers.
+4. **Managing Input and Output Variables**:
+   ![Inputs Tab](./imgs/inputs-tab.png)
+   - Click the "Inputs" tab to view and modify input variables:
+     - Each input has a "Variable Used" (how it's referenced in the pipeline)
+     - "Input Name" (what the model sees)
+     - "Description" (explains the input's purpose)
+     - Use the "+" button to add a new input variable
+     - Use the "×" button to remove an input variable
+   ![Outputs Tab](./imgs/outputs-tab.png)
+   - Click the "Outputs" tab to manage output variables:
+     - Each output has an "Output Field" name
+     - "Type" dropdown (str, float, etc.) to define the data type
+     - "Description" explaining the output's purpose
+     - Use arrow buttons to change output order
+     - Use the "+" button to add a new output
+     - Use the "×" button to remove an output
+   - When adding a new output, consider the following types:
+     - `str`: For text outputs like answers
+     - `float`: For numerical outputs like confidence scores (0.0-1.0)
+     - `bool`: For true/false outputs
+     - `list`: For array outputs like multiple answer candidates
+5. **Buzzer Settings**:
+   ![Buzzer Settings](./imgs/buzzer-settings.png)
+   - Scroll down to the "Buzzer settings" section.
+   - Set your confidence threshold (e.g., 0.85) - your agent will only buzz when its confidence exceeds this value.
+   - Choose a method (AND/OR) if combining multiple conditions.
+   - Adjust probability settings if needed.
+6. **Testing Your Agent**:
+   - Enter a Question ID or use the provided sample question.
+   - Check "Early Stop" if you want to stop processing once the agent buzzes.
+   - Click "Run on Tossup Question" to test your agent.
+   - Review the answer and click "Buzz Confidence" to see confidence metrics.
+7. **Understanding Tossup Results Visualization**:
+   ![Tossup Results Visualization showing a question about Sigmund Freud with a confidence graph](./imgs/tossup-viz.png)
+   After running a tossup question, you'll see the results displayed in several ways:
+   - **Highlighted Question Text**:
+     - Key terms are highlighted throughout the question, where the agent was evaluated.
+     - Click any highlighted word to see an answer popup with the confidence at that point.
+     - Buzz point appears in green / red (e.g., "Eckstein's") based on whether the agent was correct or not.
+   - **Answer Popup**:
+     - Displays final answer, confidence score, and correctness indicator
+     - Appears when hovering over buzzpoints or hovering over highlighted terms
+   - **Buzz Confidence Graph**:
+     - X-axis: token position; Y-axis: confidence (0.0-1.0)
+     - Blue line shows confidence progression
+     - Amber vertical line marks buzz point
+     - Dashed horizontal line shows confidence threshold
+   This visualization helps evaluate:
+   - Most informative clues
+   - Confidence threshold calibration
+   - Whether agent should be more aggressive or conservative
+8. **Exporting Your Pipeline**:
+   - Once you're satisfied with your agent, click "Export Pipeline" to save your configuration.
+   - Click the "Pipeline Preview" dropdown to see the YAML configuration.
+   - You can download this configuration for future use or submission.
+## Creating a Bonus Agent
+1. Navigate to the "Bonus Round Agents" tab at the top of the interface.
+2. **Creating a Pipeline**:
+   - Note the input variables `leadin, part` and output variables `answer, confidence, explanation` required for bonus agents.
+   - The default setup includes a single agent step labeled "A: Bonus Agent".
+3. **Configuring Your Agent**:
+   - Select your preferred model from the dropdown.
+   - Adjust the temperature slider.
+   - Customize the system prompt to provide instructions for handling bonus questions.
+   - Include clear formatting expectations for answer, confidence, and explanation.
+4. **Testing Your Agent**:
+   - Enter a bonus question with leadin and part text.
+   - Click the appropriate run button to test your agent.
+   - Review the agent's answer, confidence, and explanation.
+5. **Exporting Your Pipeline**:
+   - Click "Export Pipeline" to save your configuration.
+## Advanced Features
+### Multi-step Pipelines
+1. **Adding Steps**:
+   - Click "+ Add Step" to add additional processing steps.
+   - Each step can use different models or system prompts.
+   - Use steps for different tasks like analysis, answer generation, and confidence calculation.
+2. **Variable Mapping**:
+   - Connect outputs from earlier steps to inputs for later steps.
+   - The final output variables must map to your defined outputs (answer, confidence, etc.).
+### Importing Existing Pipelines
+1. Click the "Select Pipeline to Import..." dropdown.
+2. Choose an existing pipeline to load its configuration.
+3. Click "Import Pipeline" to load it into the interface.
+4. Modify as needed for your specific use case.
+For a detailed example of importing and modifying a sophisticated multi-step pipeline, see our [Advanced Pipeline Examples](./advanced-pipeline-examples.md) guide, which walks through enhancing the two-step justified confidence model.
+## Submitting Your Agent
+1. **Evaluate Your Agent**:
+   - Before submission, click "Evaluate" to run a thorough assessment.
+   - This helps identify potential issues before formal submission.
+2. **Model Submission**:
+   - Fill in the "Model Name" and "Description" fields with appropriate information.
+   - Click "Sign in with Hugging Face" to authenticate.
+   - Click "Submit" to submit your agent for official evaluation.
+## Best Practices
+1. **System Prompts**:
+   - Be specific about output formats in your system prompts.
+   - For tossups, instruct the model to provide confidence scores.
+   - For bonuses, instruct the model to explain its reasoning.
+2. **Confidence Calibration**:
+   - Fine-tune your buzzer threshold based on testing.
+   - Too high: agent might miss answerable questions.
+   - Too low: agent might buzz incorrectly.
+3. **Testing Thoroughly**:
+   - Test your agent on various question types and difficulties.
+   - Check performance on both early and late clues for tossups.
+Good luck with your quizbowl agent submission!