|
# Alpaca Instruction Following Dataset |
|
|
|
## Motivation |
|
### For what purpose was the dataset created? |
|
To enable more open-source research on instruction following large language models, we use generate 52K instruction-followng demonstrations using OpenAI's text-davinci-003 model. |
|
|
|
### Who created the dataset |
|
- [Rohan Taori](https://www.rohantaori.com/) |
|
- [Ishaan Gulrajani](https://ishaan.io/) |
|
- [Tianyi Zhang](https://tiiiger.github.io/) |
|
- [Yann Dubois](https://yanndubs.github.io/) |
|
- [Xuechen Li](https://www.lxuechen.com/) |
|
- [Carlos Guestrin](https://guestrin.su.domains/) |
|
- [Percy Liang](https://cs.stanford.edu/~pliang/) |
|
- [Tatsunori B. Hashimoto](https://thashim.github.io/) |
|
|
|
## Composition |
|
|
|
### What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? |
|
The instruction following demonstrations are bootstrapped by following the [seed set](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl) released from the self-instruct project. |
|
Given that the dataset is generated, it is difficult to pinpoint who/what the instances represent. |
|
|
|
### How many instances are there in total |
|
In total, there are 52,002 instances in the dataset. |
|
|
|
### Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? |
|
not applicable. |
|
|
|
### What data does each instance consist of? |
|
|
|
- `instruction`: `str`, describes the task the model should perform. Each of the 52K instructions is unique. |
|
- `input`: `str`, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input. |
|
- `output`: `str`, the answer to the instruction as generated by `text-davinci-003`. |
|
|
|
### Is any information missing from individual instances? |
|
no. |
|
|
|
### Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? |
|
not applicable. |
|
|
|
### Is there a label or target associated with each instance? |
|
the finetuning target is the response generated by `text-davinci-003`. |
|
|
|
### Are there recommended data splits (e.g., training, development/validation, testing)? |
|
The Alpaca models (both demo and the ones that will be released) are trained on all 52K data. |
|
There is no recommended data split for the dataset. |
|
|
|
### Are there any errors, sources of noise, or redundancies in the dataset? |
|
All 52k instructions are unique. However, some generated instructions may not be sensible, i.e., there may not exist any good response to the instruction. |
|
|
|
### Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? |
|
the dataset is self-contained. |
|
|
|
### Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? |
|
no. |
|
|
|
### Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? |
|
The generated may contain a few inappropriate responses. In our preliminary testing, we have not encountered any offensive responses. |
|
|
|
## Collection process |
|
The [Github repository](https://github.com/tatsu-lab/stanford_alpaca) contains the code to generate the dataset. |
|
|
|
## Uses |
|
|
|
### Has the dataset been used for any tasks already? |
|
The dataset is used to train the Alpaca models that are both used for the demo and released. |
|
|
|
### Is there a repository that links to any or all papers or systems that use the dataset? |
|
Please see https://github.com/tatsu-lab/stanford_alpaca |
|
|
|
### Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? |
|
This dataset is generated by using the OpenAI's API. Therefore, this dataset cannot be used for commerical usage that compete with OpenAI. |
|
|
|
### Are there tasks for which the dataset should not be used? |
|
The dataset should not be used for commerical usage that compete with OpenAI. |
|
|
|
## Distribution |
|
### Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? |
|
The dataset can be freely downloaded. |
|
|
|
### How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? |
|
The dataset can be downloaded from the [Github repository](https://github.com/tatsu-lab/stanford_alpaca) as a json file. |
|
|
|
### Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? |
|
This dataset is distributed under [the ODC-By license](https://opendatacommons.org/licenses/by/1-0/). |
|
|
|
### Have any third parties imposed IP-based or other restrictions on the data associated with the instances? |
|
no |
|
|
|
### Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? |
|
no |
|
|
|
## Maintenance |
|
|
|
### Who is supporting/hosting/maintaining the dataset? |
|
The dataset is hosted on github and the Github repository is maintained by Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li. |
|
|
|
### How can the owner/curator/manager of the dataset be contacted (e.g., email address)? |
|
Please open an issue in the [Github repository](https://github.com/tatsu-lab/stanford_alpaca) |
|
|
|
### Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? |
|
We do not have plan to update the dataset. |