|
# Pre-train LLaMA on RedPajama |
|
|
|
This howto will walk you through setting up the RedPajama dataset and launching the pre-training script. |
|
|
|
## What's RedPajama |
|
|
|
[RedPajama](https://github.com/togethercomputer/RedPajama-Data) is an open-source reproduction of the original LLaMA training dataset. |
|
|
|
It contains a total of 1.2 trillion tokens, divided into |
|
|
|
```text |
|
Commoncrawl 878B |
|
C4 175B |
|
GitHub 59B |
|
Books 26B |
|
ArXiv 28B |
|
Wikipedia 24B |
|
StackExchange 20B |
|
``` |
|
|
|
The [RedPajama repo](https://github.com/togethercomputer/RedPajama-Data) contains the source code for collecting and preparing |
|
the dataset, and it is Apache 2.0 licensed. |
|
|
|
The data itself is licensed according to the original licenses with which its invidivdual parts were released. |
|
The GitHub datasets are limited to MIT, BSD, or Apache 2.0 repositories. |
|
|
|
Along with the full [RedPajama-1T dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), |
|
the [RedPajama-1T-Sample](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample) 1B sample dataset |
|
is also available for development. |
|
|
|
You can download the data using git lfs: |
|
|
|
```bash |
|
# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install |
|
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T data/RedPajama-Data-1T |
|
``` |
|
|
|
```bash |
|
# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install |
|
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample data/RedPajama-Data-1T-Sample |
|
``` |
|
|
|
## Prepare RedPajama for training |
|
|
|
The dataset consists of 2084 `jsonl` files (the sample dataset contains 11). In order to start pre-training lit-llama |
|
on it, you need to read, tokenize, and write the data in binary chunks. This will leverage the `PackedDataset` |
|
streaming dataset that comes with lit-llama. |
|
|
|
Do to so, run |
|
|
|
```bash |
|
python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T --tokenizer_path checkpoints/lit-llama/tokenizer.model --destination_path data/lit-redpajama |
|
``` |
|
|
|
or |
|
|
|
```bash |
|
python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T-Sample --tokenizer_path checkpoints/lit-llama/tokenizer.model --destination_path data/lit-redpajama-sample --sample True |
|
``` |
|
|
|
for the sample dataset. |
|
|
|
In the above we are assuming that you will be using the same tokenizer as used in LLaMA, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here. |
|
|
|
The script will take a while to run, so time for :tea: |
|
|
|
## Pre-training |
|
|
|
Running the pre-training script requires at least 4 GPUs with 40GB+ each (A100). |
|
|
|
```bash |
|
python train_redpajama.py --devices 4 --train_data_dir data/lit-redpajama |
|
``` |
|
|
|
For running on the sample dataset: |
|
|
|
```bash |
|
python train_redpajama.py --devices 4 --train_data_dir data/lit-redpajama-sample |
|
``` |
|
|
|
The script will save checkpoints periodically to the folder `out/`. |
|
|
|
The `train_redpajama.py` script will pre-train the LLaMA 7B model with FSDP in |
|
`bfloat16` precision and gradient accumulation. |
|
|
|
You can easily change the size of the model by passing a different string to |
|
|
|
```python |
|
config = LLaMAConfig.from_name("7B") |
|
``` |
|
|
|
in the `main` function. |
|
|
|
Keep in mind that the original LLaMA training for the 7B model required 83k A100 80GB |
|
hours, so you'll need access to a cluster. |
|
|
|
Once you're in a cluster, you can follow [these instructions](https://lightning.ai/docs/fabric/stable/guide/multi_node/other.html) |
|
to launch the script across machines: |
|
|
|
- [SLURM cluster](https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html) |
|
- [Barebones cluster](https://lightning.ai/docs/fabric/stable/guide/multi_node/barebones.html) |
|
- [MPI](https://lightning.ai/docs/fabric/stable/guide/multi_node/other.html) |
|
|
|
The script contains several configurations and hyperparameters you can tweak: |
|
|
|
```python |
|
out_dir = "out/training" |
|
save_interval = 1000 |
|
eval_interval = 1000 |
|
eval_iters = 100 |
|
log_interval = 1 |
|
|
|
# Hyperparameters |
|
learning_rate = 6e-4 |
|
batch_size = 125 |
|
micro_batch_size = 5 |
|
max_iters = 600000 # num_epochs * epoch_size // devices |
|
weight_decay = 1e-1 |
|
beta1 = 0.9 |
|
beta2 = 0.95 |
|
grad_clip = 1.0 |
|
decay_lr = True |
|
warmup_iters = 2000 |
|
lr_decay_iters = max_iters |
|
min_lr = 6e-5 |
|
``` |
|
|
|
In particular, `micro_batch_size` should be adjusted so the process will use the available |
|
GPU memory. |
|
|
|
Last, logging is kept minimal in the script. In order to use a particular logger |
|
please refer to <https://lightning.ai/docs/fabric/stable/api/loggers.html> or |
|
call a logging client library like `wandb` directly. |
|
|