test_model / howto /finetune_full.md
khoicrtp's picture
init
12001a9
# Full Finetuning
Full finetuning updates all layers in the pretrained LLaMA model. This *regular* finetuning procedure is typically considered as the baseline for parameter-efficient alternatives such as Low-Rank Adaptation (LoRA) or LLaMA-Adapter.
The current [finetune_full.py](../scripts/finetune_full.py) we provide uses 4 A100 GPUs with a fully-sharded data parallel strategy to finetune Lit-LLaMA 7B on [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) dataset. The A100 GPUs have 40 GB each, but it may require less memory to finetune this model.
## Preparation
The steps here only need to be done once:
1. Follow the instructions in the [README](README.md) to install the dependencies.
2. Download and convert the weights and save them in the `./checkpoints` folder as described [here](download_weights.md).
4. Download the data and generate the Alpaca instruction tuning dataset:
```bash
python scripts/prepare_alpaca.py
```
or [prepare your own dataset](#tune-on-your-own-dataset).
## Running the finetuning
```bash
python finetune_full.py
```
You can speed up training by setting the `devices` variable in the script to utilize more GPUs if available or increase the `batch_size`.
Depending on the available GPU memory, you can also tune the `micro_batch_size` parameter to utilize the GPU efficiently.
For example, the following settings will let you finetune the model in 32 hours using a fully-sharded data parallel strategy:
```python
devices = 4
batch_size = 128 // devices
micro_batch_size = 4
```
This script will save checkpoints periodically to the folder `out/`.
> **Note**
> All scripts support argument [customization](customize_paths.md)
## Test the model
You can test the finetuned model with your own instructions by running:
```bash
python generate_full.py \
--prompt "Recommend a movie to watch on the weekend." \
--quantize llm.int8
```
Output:
```
A good movie to watch on the weekend would be The Lion King, since it's a classic family film that everyone can enjoy...
```
If your GPU supports `bfloat16`, the script will automatically use it. Together with `--quantize llm.int8`, this brings the memory consumption down to ~8 GB.
## Tune on your dataset
With only a few modifications, you can prepare and train on your own instruction dataset.
1. Create a json file in which each row holds one instruction-response pair.
A row has an entry for 'instruction', 'input', and 'output', where 'input' is optional an can be
the empty string if the instruction doesn't require a context. Below is an example json file:
```
[
{
"instruction": "Arrange the given numbers in ascending order.",
"input": "2, 4, 0, 8, 3",
"output": "0, 2, 3, 4, 8"
},
...
]
```
2. Make a copy of `scripts/prepare_alpaca.py` and name it what you want:
```bash
cp scripts/prepare_alpaca.py scripts/prepare_mydata.py
```
3. Modify `scripts/prepare_mydata.py` to read the json data file.
4. Run the script to generate the preprocessed, tokenized train-val split:
```bash
python scripts/prepare_mydata.py --destination_path data/mydata/
```
5. Run `finetune_full.py` by passing in the location of your data (and optionally other parameters):
```bash
python finetune_full.py --data_dir data/mydata/ --out_dir out/myexperiment
```
## Troubleshooting
If you run into a CUDA error "Expected is_sm80 to be true, but got false", uncomment the line
`torch.backends.cuda.enable_flash_sdp(False)` in the script below (see https://github.com/Lightning-AI/lit-llama/issues/101).