Spaces:
Running
Running
File size: 1,815 Bytes
a679cf2 906b628 c5e73ca 906b628 c5e73ca 906b628 059564d 906b628 c5e73ca 906b628 a679cf2 906b628 c5e73ca 906b628 c5e73ca 906b628 a679cf2 906b628 f0128b6 c5e73ca 01bc423 dc2880c f0128b6 dc2880c c5e73ca 2afd5bf c5e73ca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
# How we used ShareGPT to create our benchmark dataset
## sg_90k_part1_html_cleaned.json
### Download ShareGPT dataset
```
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part1_html_cleaned.json
```
### Install Fastchat
```
pip install fschat
```
### Clean data:
```
pip install polyglot pyicu pycld2
python -m fastchat.data.optional_clean --in sg_90k_part1_html_cleaned.json --out sg_90k_part1_html_cleaned_lang.json --keep-lang en
```
### Extract first prompt
```
python extract_first.py --in-file sg_90k_part1_html_cleaned_lang.json --out-file sg_90k_part1_html_cleaned_lang_first.json
```
### Sample data
```
python -m fastchat.data.sample --in sg_90k_part1_html_cleaned_lang_first.json --out sg_90k_part1_html_cleaned_lang_first_sampled.json --end 10000 --max-length 10000
```
### Sorted data
We sort the requests by sequence length, placing the longest sequences first. This approach minimizes the amount of padding required and allows for early detection of out-of-memory.
```
python sort.py --data-dir sg_90k_part1_html_cleaned_lang_first_sampled.json --out-file sg_90k_part1_html_cleaned_lang_first_sampled_sorted.json
```
## ShareGPT_V3_filtered.json
### Download ShareGPT dataset
```
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
### Install Transformers
```
pip install transformers
```
### Filter conversations with too long prompts/responses, conversations not started by "human", extract first turn, and randomly sample 500 prompts
```
python filter_dataset.py
```
### Compare the response length distribution of sampled dataset with respect to initial dataset
```
pip install matplotlib numpy
python compare_distributions.py
```
|