File size: 8,018 Bytes
d90b3a8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
# Data Scripts
## `preprocess_data.py`
Takes a raw dataset, splits it up, tokenizes it, and saves it as numpy files that can be memmapped and used efficiently by the training code.
```
usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS]
--tokenizer-type
{HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer,TiktokenTokenizer,SPMTokenizer}
[--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix
OUTPUT_PREFIX [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS]
[--log-interval LOG_INTERVAL]
options:
-h, --help show this help message and exit
input data:
--input INPUT Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma
separated list
--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
space separate listed of keys to extract from jsonl. Default: text
--num-docs NUM_DOCS Optional: Number of documents in the input data (if known) for an accurate progress bar.
tokenizer:
--tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer,TiktokenTokenizer,SPMTokenizer}
What type of tokenizer to use.
--vocab-file VOCAB_FILE
Path to the vocab file
--merge-file MERGE_FILE
Path to the BPE merge file (if necessary).
--append-eod Append an <eod> token to the end of a document.
--ftfy Use ftfy to clean text
output data:
--output-prefix OUTPUT_PREFIX
Path to binary output file without suffix
--dataset-impl {lazy,cached,mmap}
Dataset implementation to use. Default: mmap
runtime:
--workers WORKERS Number of worker processes to launch
--log-interval LOG_INTERVAL
Interval between progress updates
```
## `preprocess_data_with_mask.py`
Does the same but also creates `label` tensors if the dataset has labels.
N.B. If using this, you **must** specify your data when training/finetuning with the following configs
```json
"train_data_paths": ["train_documents"],
"test_data_paths": ["test_documents"],
"valid_data_paths": ["test_documents"],
"label_data_paths": ["label_documents"]
```
the `"data_path"` option will not work with `"label_data_paths"`.
```
usage: preprocess_data_with_mask.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]]
[--mask-before-token MASK_BEFORE_TOKEN] [--num-docs NUM_DOCS] --tokenizer-type
{HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer}
[--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy]
--output-prefix OUTPUT_PREFIX [--dataset-impl {lazy,cached,mmap}]
[--workers WORKERS] [--log-interval LOG_INTERVAL]
options:
-h, --help show this help message and exit
input data:
--input INPUT Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma
separated list
--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
space separate listed of keys to extract from jsonl. Defa
--mask-before-token MASK_BEFORE_TOKEN
apply loss masks before certain token(s). If multi-token pattern, separate by commas without
space, e.g. --mask-before-token 0,1,1270 to use the token pattern [0,1,1270].
--num-docs NUM_DOCS Optional: Number of documents in the input data (if known) for an accurate progress bar.
tokenizer:
--tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer}
What type of tokenizer to use.
--vocab-file VOCAB_FILE
Path to the vocab file
--merge-file MERGE_FILE
Path to the BPE merge file (if necessary).
--append-eod Append an <eod> token to the end of a document.
--ftfy Use ftfy to clean text
output data:
--output-prefix OUTPUT_PREFIX
Path to binary output file without suffix
--dataset-impl {lazy,cached,mmap}
Dataset implementation to use. Default: mmap
runtime:
--workers WORKERS Number of worker processes to launch
--log-interval LOG_INTERVAL
Interval between progress updates
```
## `preprocess_data_with_chat_template.py`
Similar, but uses huggingface's [chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating) to
tokenize the data to support multiturn and more complicated use cases.
N.B. If using this, you **must** specify your data when training/finetuning with the following configs
```json
"train_data_paths": ["train_documents"],
"test_data_paths": ["test_documents"],
"valid_data_paths": ["test_documents"],
"label_data_paths": ["label_documents"]
```
the `"data_path"` option will not work with `"label_data_paths"`.
```
usage: preprocess_data_with_chat_template.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--no-mask]
[--generation-role GENERATION_ROLE] [--only-last] [--num-docs NUM_DOCS]
--tokenizer-path TOKENIZER_PATH [--ftfy] --output-prefix OUTPUT_PREFIX
[--dataset-impl {lazy,cached,mmap}] [--workers WORKERS]
[--log-interval LOG_INTERVAL]
options:
-h, --help show this help message and exit
input data:
--input INPUT Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma separated list
--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
space separate listed of keys to extract from jsonl. Default: text
--no-mask If set, this will not mask any tokens in the input data.
--generation-role GENERATION_ROLE
The role of the model generating the chat, usually 'assistant'. Default: assistant
--only-last If set, this will mask everything except the last turn in the chat.
--num-docs NUM_DOCS Optional: Number of documents in the input data (if known) for an accurate progress bar.
tokenizer:
--tokenizer-path TOKENIZER_PATH
Path to HF Tokenizer.
--ftfy Use ftfy to clean text
output data:
--output-prefix OUTPUT_PREFIX
Path to binary output file without suffix
--dataset-impl {lazy,cached,mmap}
Dataset implementation to use. Default: mmap
runtime:
--workers WORKERS Number of worker processes to launch
--log-interval LOG_INTERVAL
Interval between progress updates
```
## `multinode_prepare_data.sh`
Does the same but distributed over multiple nodes.
```
# USAGE:
# This script allows you to prepare your dataset using multiple nodes by chunking the individual files and distributed the chunks
# over the processes.
# This bash script takes a single text file as input argument.
# The text file contains a valid filepath in each line, leading to a jsonl-file.
# Furthermore an environment variable for the rank and the world size needs to be set.
# These default to the SLURM and OMPI variables in this order of priority, but they can be set manually as well
# using the variables $RANK and $WORLD_SIZE, which will overwrite the cluster-specific variables.
# You can also add all arguments of the prepare_data.py script to this script and it will simply pass them through.
```
## `corpora.py`
Has information for common datasets. Primarily meant for use in top-level `prepare_data.py` script.
|