Full Finetuning

Full finetuning updates all layers in the pretrained LLaMA model. This regular finetuning procedure is typically considered as the baseline for parameter-efficient alternatives such as Low-Rank Adaptation (LoRA) or LLaMA-Adapter.

The current finetune_full.py we provide uses 4 A100 GPUs with a fully-sharded data parallel strategy to finetune Lit-LLaMA 7B on Alpaca dataset. The A100 GPUs have 40 GB each, but it may require less memory to finetune this model.

Preparation

The steps here only need to be done once:

Follow the instructions in the README to install the dependencies.
Download and convert the weights and save them in the ./checkpoints folder as described here.
Download the data and generate the Alpaca instruction tuning dataset:
```
python scripts/prepare_alpaca.py
```
or prepare your own dataset.

Running the finetuning

python finetune_full.py

You can speed up training by setting the devices variable in the script to utilize more GPUs if available or increase the batch_size. Depending on the available GPU memory, you can also tune the micro_batch_size parameter to utilize the GPU efficiently.

For example, the following settings will let you finetune the model in 32 hours using a fully-sharded data parallel strategy:

devices = 4
batch_size = 128 // devices
micro_batch_size = 4

This script will save checkpoints periodically to the folder out/.

Note All scripts support argument customization

Test the model

You can test the finetuned model with your own instructions by running:

python generate_full.py \
    --prompt "Recommend a movie to watch on the weekend." \
    --quantize llm.int8

Output:

A good movie to watch on the weekend would be The Lion King, since it's a classic family film that everyone can enjoy...

If your GPU supports bfloat16, the script will automatically use it. Together with --quantize llm.int8, this brings the memory consumption down to ~8 GB.

Tune on your dataset

With only a few modifications, you can prepare and train on your own instruction dataset.

Create a json file in which each row holds one instruction-response pair. A row has an entry for 'instruction', 'input', and 'output', where 'input' is optional an can be the empty string if the instruction doesn't require a context. Below is an example json file:
```
[
    {
        "instruction": "Arrange the given numbers in ascending order.",
        "input": "2, 4, 0, 8, 3",
        "output": "0, 2, 3, 4, 8"
    },
    ...
]
```
Make a copy of scripts/prepare_alpaca.py and name it what you want:
```
cp scripts/prepare_alpaca.py scripts/prepare_mydata.py
```
Modify scripts/prepare_mydata.py to read the json data file.
Run the script to generate the preprocessed, tokenized train-val split:
```
python scripts/prepare_mydata.py --destination_path data/mydata/
```
Run finetune_full.py by passing in the location of your data (and optionally other parameters):
```
python finetune_full.py --data_dir data/mydata/ --out_dir out/myexperiment
```

Troubleshooting

If you run into a CUDA error "Expected is_sm80 to be true, but got false", uncomment the line torch.backends.cuda.enable_flash_sdp(False) in the script below (see https://github.com/Lightning-AI/lit-llama/issues/101).