Spaces:
Paused
Paused
# Evaluations | |
This directory contains end-to-end pipelines for AI-enhanced evaluation. We will introduce the evaluation pipeline and the data format in this document. | |
## Generate Answers | |
### ChatGPT (gpt-3.5-turbo) | |
Make sure you have setup the OpenAI API Key in your environment. Then run: | |
```bash | |
python qa_baseline_gpt35.py --question table/question.jsonl --output table/answer/answer_gpt35.jsonl | |
``` | |
### Bard | |
Unfortunately, Bard has not release its public APIs till now. You may have to enter the anwsers manually. Or you could find a third-party project that interfaces with Bard. | |
### Vicuna and others | |
To generate answers with Vicuna or other models, specify path to the model checkpoint. Then run: | |
```bash | |
python model_qa.py --model-name /model/path --question-file tables/question.jsonl --answer-file table/answer/answer.jsonl | |
``` | |
## Evaluate Answers Automatically | |
### Generete Reviews with GPT-4 | |
PS: If you do not current have access to GPT-4 API, but you have access to GPT-4 chatbot, you can evaluate the answers manually, according to the instructions in the **Data Format** section. `table/review/*.jsonl` are some examples of reviews. | |
TODO: add instructions | |
## Visualize Results | |
You can generate the data for the webpage by running: | |
```bash | |
python eval/generate_webpage_data_from_table.py | |
``` | |
Then you can serve a static website in `webpage` to see the results. | |
## Data Format | |
If you want to have a deeper understanding of our evaluation pipeline or want to contribute to the evaluation process, you need to learn the data format we used for evaluation. | |
Our evaluation data are encoded with [JSON Lines](https://jsonlines.org/). | |
### Random ID Generation | |
We use the `shortuuid` Python library for generating short random UUIDs. | |
```python | |
import shortuuid | |
shortuuid.uuid() -> str | |
``` | |
### Models | |
`model.jsonl` contains model information we used for generating anwsers. | |
Each row contains a record of a model with the following field: | |
* `model_id` (str): A unique ID for a model. Models with different IDs is supposed to have different performance. This ID is generated by `{model_name}:{model_version}`. | |
* `model_name` (str): The name of a model. This is not unique, because a model could be trained and updated continuously, but it is still considered as the same model with different versions. | |
* `model_version` (str): The version of a model. | |
* `model_metadata` (Any): Any metadata of a model (descriptions etc). This is optional. | |
For example: | |
```json | |
{ | |
"model_id": "vicuna-13b:v1", | |
"model_name": "vicuna-13b", | |
"model_version": "v1", | |
"model_metadata": "learning rate 1e-5, 3 epochs, 13b" | |
} | |
``` | |
### Prompts | |
We store prompts in `prompt.jsonl`. Each row contains a record of a prompt with the following field: | |
* `prompt_id` (int): A unique integer ID for a prompt. Prompts with different IDs are supposed to have different purpose. | |
* `system_prompt` (str): The system prompt given to a model. This is the prompt that the model sees first. | |
* `prompt_template` (str): The prompt body. This is the user prompt that the model sees after the system prompt. It is a Python f-string template, so that we can fill in the inputs later. | |
* `defaults` (dict): A dictionary of default values for the prompt template. It can be empty. | |
* `description` (str): A description of the functionality of the prompt. | |
For example: | |
```json | |
{ | |
"prompt_id": 1, | |
"system_prompt": "You are a helpful assistant.", | |
"prompt_template": "[Question]\n{question}\n\n[Assistant 1]\n{answer_1}\n\n[End of Assistant 1]\n\n[Assistant 2]\n{answer_2}\n\n[End of Assistant 2]\n\n[System]\n{prompt}\n\n", | |
"defaults": {"prompt": "Which assistant is more helpful?"}, | |
"description": "Compare two assistants' answers to a question." | |
} | |
``` | |
### Reviewers | |
`reviewer.jsonl` contains reviewer information we used for reviewing answers generated by different models. Each row contains a record of a reviewer with the following field: | |
* `reviewer_id` (str): A unique ID for a reviewer. Reviewers with different IDs is supposed to have different reviewing performance. | |
* `prompt_id` (str): The ID of the prompt given to the reviewer (e.g., an AI assistant). Different prompts could result in different reviewing performance. | |
* `metadata` (dict): Metadata of a reviewer about its configurations. | |
* `description` (str): A description of the reviewer. | |
For example: | |
```json | |
{ | |
"reviewer_id": "gpt-4-0328-default", | |
"prompt_id": 1, | |
"temperature": 0.2, | |
"max_tokens": 8192, | |
"description": "GPT-4 for generic questions." | |
} | |
``` | |
### Questions | |
`question.jsonl` contains questions we used for evaluation. Each row contains a record of a question with the following field: | |
* `question_id` (int): A unique integer for a question. Questions with different IDs is supposed to be different. | |
* `text` (str): The question text. | |
* `category` (str): The category of the question. Questions with the same category are supposed to be similar or originate from the same source. | |
### Answers | |
`answer/xxx.jsonl` contains answers generated by different models. Each row contains a record of an answer with the following field: | |
* `answer_id` (str): A unique UUID for an answer. Answers with different IDs is supposed to be different. | |
* `question_id` (int): The ID of the question the answer is generated for. | |
* `model_id` (str): The ID of the model the answer is generated by. | |
* `text` (str): The answer text. | |
* `metadata` (dict): Any metadata of the answer. | |
Example: | |
```json | |
{ | |
"answer_id": "[short uuid]", | |
"question_id": 1, | |
"model_id": "vicuna-13b:v1", | |
"text": "Here are five tips...", | |
"metadata": {} | |
} | |
``` | |
### Reviews | |
`review/xxx.jsonl` contains reviews given by reviewers, comparing peformance between a pair of models. Each row contains a record of a review with the following field: | |
* `review_id` (str): A unique UUID for a review. Reviews with different IDs is supposed to be different. | |
* `question_id` (int): The ID of the question the review is given for. | |
* `answer1_id` (str): The ID of the first answer. | |
* `answer2_id` (str): The ID of the second answer. | |
* `text` (str): The review text. | |
* `score` (list): A list of scores given by the reviewer. The first score is for the first answer, and the second score is for the second answer. | |
* `reviewer_id` (str): The ID of the reviewer. | |
* `metadata` (dict): Any metadata of the review. | |
```json | |
{ | |
"review_id": "[short uuid]", | |
"question_id": 1, | |
"answer1_id": "[answer1_id]", | |
"answer2_id": "[answer2_id]", | |
"text": "Assistant 2 is better...", | |
"score": [9.0, 7.5], | |
"reviewer_id": "gpt-4-0328-default", | |
"metadata": {} | |
} | |
``` | |