|
## Setup Notes |
|
|
|
For this model, a VM with 2 T4 GPUs was used. |
|
|
|
To get the training to work on the 2 GPUs (utilize both GPUS simultaneously), the following command was used to initiate training. |
|
|
|
WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path 'b-mc2/sql-create-context' --output_dir './lora-alpaca' --num_epochs 1 --micro_batch_size 16 |
|
|
|
Note 1. Micro batch size was increased from the default 4 to 16. Note that increasing it further is possible based on other training that has been performed. This was a first attempt. |
|
|
|
Note 2. Output directory was initially lora-alpaca and then contents were moved to new folder when initializing git repository. |
|
|
|
|
|
## Log |
|
|
|
(sqltest) chrisdono@deep-learning-duo-t4-3:~/alpaca-lora$ WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py --base_model 'decapoda-research/lla$ |
|
a-7b-hf' --data_path 'b-mc2/sql-create-context' --output_dir './lora-alpaca' --num_epochs 1 --micro_batch_size 16 |
|
WARNING:torch.distributed.run: |
|
***************************************** |
|
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appli |
|
cation as needed. |
|
***************************************** |
|
|
|
|
|
===================================BUG REPORT=================================== |
|
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues |
|
================================================================================ |
|
===================================BUG REPORT=================================== |
|
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues |
|
================================================================================ |
|
/opt/conda/envs/sqltest/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /opt/conda/envs/sqltest did not contain libcudart.so as expected! Searching further path |
|
s... |
|
warn(msg) |
|
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so |
|
CUDA SETUP: Highest compute capability among GPUs detected: 7.5 |
|
CUDA SETUP: Detected CUDA version 113 |
|
CUDA SETUP: Loading binary /opt/conda/envs/sqltest/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... |
|
/opt/conda/envs/sqltest/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /opt/conda/envs/sqltest did not contain libcudart.so as expected! Searching further path |
|
s... |
|
warn(msg) |
|
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so |
|
CUDA SETUP: Highest compute capability among GPUs detected: 7.5 |
|
CUDA SETUP: Detected CUDA version 113 |
|
CUDA SETUP: Loading binary /opt/conda/envs/sqltest/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... |
|
Training Alpaca-LoRA model with params: |
|
base_model: decapoda-research/llama-7b-hf |
|
data_path: b-mc2/sql-create-context |
|
output_dir: ./lora-alpaca |
|
batch_size: 128 |
|
micro_batch_size: 16 |
|
num_epochs: 1 |
|
learning_rate: 0.0003 |
|
cutoff_len: 256 |
|
val_set_size: 2000 |
|
lora_r: 8 |
|
lora_alpha: 16 |
|
lora_dropout: 0.05 |
|
lora_target_modules: ['q_proj', 'v_proj'] |
|
train_on_inputs: True |
|
add_eos_token: False |
|
group_by_length: False |
|
wandb_project: |
|
wandb_run_name: |
|
wandb_watch: |
|
wandb_log_model: |
|
resume_from_checkpoint: False |
|
prompt template: alpaca |
|
|
|
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [01:24<00:00, 2.57s/it] |
|
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [01:24<00:00, 2.57s/it] |
|
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. |
|
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. |
|
The class this function is called from is 'LlamaTokenizer'. |
|
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. |
|
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. |
|
The class this function is called from is 'LlamaTokenizer'. |
|
Found cached dataset json (/home/chrisdono/.cache/huggingface/datasets/b-mc2___json/b-mc2--sql-create-context-d62c31544f758e00/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e |
|
233e6e) |
|
0%| | 0/1 [00:00<?, ?it/s] |
|
Found cached dataset json (/home/chrisdono/.cache/huggingface/datasets/b-mc2___json/b-mc2--sql-create-context-d62c31544f758e00/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e |
|
233e6e) |
|
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 9.30it/s] |
|
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 7.83it/s] |
|
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 |
|
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 |
|
Loading cached split indices for dataset at /home/chrisdono/.cache/huggingface/datasets/b-mc2___json/b-mc2--sql-create-context-d62c31544f758e00/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b |
|
2dd7af1cf934bed8e233e6e/cache-5a5ac0bd39fc20e0.arrow and /home/chrisdono/.cache/huggingface/datasets/b-mc2___json/b-mc2--sql-create-context-d62c31544f758e00/0.0.0/fe5dd6ea2639a6df622901539cb5 |
|
50cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-782fec259d4b8f6a.arrow |
|
Loading cached split indices for dataset at /home/chrisdono/.cache/huggingface/datasets/b-mc2___json/b-mc2--sql-create-context-d62c31544f758e00/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b |
|
2dd7af1cf934bed8e233e6e/cache-5a5ac0bd39fc20e0.arrow and /home/chrisdono/.cache/huggingface/datasets/b-mc2___json/b-mc2--sql-create-context-d62c31544f758e00/0.0.0/fe5dd6ea2639a6df622901539cb5 |
|
50cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-782fec259d4b8f6a.arrow |
|
{'loss': 2.7003, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.02} |
|
{'loss': 2.566, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.03} |
|
{'loss': 2.2648, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.05} |
|
{'loss': 1.657, 'learning_rate': 0.00011099999999999999, 'epoch': 0.07} |
|
{'loss': 1.1599, 'learning_rate': 0.00014099999999999998, 'epoch': 0.08} |
|
{'loss': 0.9037, 'learning_rate': 0.00017099999999999998, 'epoch': 0.1} |
|
{'loss': 0.8137, 'learning_rate': 0.000201, 'epoch': 0.12} |
|
{'loss': 0.7827, 'learning_rate': 0.00023099999999999998, 'epoch': 0.13} |
|
{'loss': 0.7554, 'learning_rate': 0.000261, 'epoch': 0.15} |
|
{'loss': 0.7357, 'learning_rate': 0.00029099999999999997, 'epoch': 0.17} |
|
{'loss': 0.6893, 'learning_rate': 0.0002957831325301205, 'epoch': 0.18} |
|
{'loss': 0.6606, 'learning_rate': 0.00028975903614457827, 'epoch': 0.2} |
|
{'loss': 0.6506, 'learning_rate': 0.0002837349397590361, 'epoch': 0.22} |
|
{'loss': 0.6462, 'learning_rate': 0.00027771084337349395, 'epoch': 0.23} [215/1857] |
|
{'loss': 0.6315, 'learning_rate': 0.0002716867469879518, 'epoch': 0.25} |
|
{'loss': 0.6337, 'learning_rate': 0.0002656626506024096, 'epoch': 0.27} |
|
{'loss': 0.6223, 'learning_rate': 0.00025963855421686746, 'epoch': 0.28} |
|
{'loss': 0.6136, 'learning_rate': 0.00025361445783132525, 'epoch': 0.3} |
|
{'loss': 0.6198, 'learning_rate': 0.00024759036144578314, 'epoch': 0.32} |
|
{'loss': 0.6084, 'learning_rate': 0.00024156626506024095, 'epoch': 0.33} |
|
{'eval_loss': 0.608456552028656, 'eval_runtime': 123.856, 'eval_samples_per_second': 16.148, 'eval_steps_per_second': 1.009, 'epoch': 0.33} |
|
{'loss': 0.6021, 'learning_rate': 0.00023554216867469876, 'epoch': 0.35} |
|
{'loss': 0.5949, 'learning_rate': 0.0002295180722891566, 'epoch': 0.37} |
|
{'loss': 0.5972, 'learning_rate': 0.00022349397590361444, 'epoch': 0.38} |
|
{'loss': 0.5922, 'learning_rate': 0.00021746987951807228, 'epoch': 0.4} |
|
{'loss': 0.5876, 'learning_rate': 0.0002114457831325301, 'epoch': 0.42} |
|
{'loss': 0.5788, 'learning_rate': 0.00020542168674698793, 'epoch': 0.43} |
|
{'loss': 0.5894, 'learning_rate': 0.0001993975903614458, 'epoch': 0.45} |
|
{'loss': 0.5877, 'learning_rate': 0.0001933734939759036, 'epoch': 0.47} |
|
{'loss': 0.5835, 'learning_rate': 0.00018734939759036142, 'epoch': 0.48} |
|
{'loss': 0.5791, 'learning_rate': 0.00018132530120481925, 'epoch': 0.5} |
|
{'loss': 0.5841, 'learning_rate': 0.00017530120481927712, 'epoch': 0.52} |
|
{'loss': 0.5728, 'learning_rate': 0.00016927710843373493, 'epoch': 0.53} |
|
{'loss': 0.569, 'learning_rate': 0.00016325301204819274, 'epoch': 0.55} |
|
{'loss': 0.5709, 'learning_rate': 0.00015722891566265058, 'epoch': 0.57} |
|
{'loss': 0.5762, 'learning_rate': 0.00015120481927710845, 'epoch': 0.58} |
|
{'loss': 0.5704, 'learning_rate': 0.00014518072289156626, 'epoch': 0.6} |
|
{'loss': 0.5661, 'learning_rate': 0.0001391566265060241, 'epoch': 0.62} |
|
{'loss': 0.5662, 'learning_rate': 0.00013313253012048193, 'epoch': 0.63} |
|
{'loss': 0.5674, 'learning_rate': 0.00012710843373493975, 'epoch': 0.65} |
|
{'loss': 0.5635, 'learning_rate': 0.00012108433734939758, 'epoch': 0.67} |
|
{'eval_loss': 0.568750262260437, 'eval_runtime': 122.9061, 'eval_samples_per_second': 16.273, 'eval_steps_per_second': 1.017, 'epoch': 0.67} |
|
{'loss': 0.5609, 'learning_rate': 0.00011506024096385541, 'epoch': 0.69} |
|
{'loss': 0.5724, 'learning_rate': 0.00010903614457831325, 'epoch': 0.7} |
|
{'loss': 0.5603, 'learning_rate': 0.00010301204819277107, 'epoch': 0.72} |
|
{'loss': 0.5599, 'learning_rate': 9.698795180722891e-05, 'epoch': 0.74} |
|
{'loss': 0.5655, 'learning_rate': 9.096385542168674e-05, 'epoch': 0.75} |
|
{'loss': 0.5578, 'learning_rate': 8.493975903614457e-05, 'epoch': 0.77} |
|
{'loss': 0.5577, 'learning_rate': 7.89156626506024e-05, 'epoch': 0.79} |
|
{'loss': 0.5606, 'learning_rate': 7.289156626506024e-05, 'epoch': 0.8} |
|
{'loss': 0.5496, 'learning_rate': 6.686746987951806e-05, 'epoch': 0.82} |
|
{'loss': 0.5635, 'learning_rate': 6.08433734939759e-05, 'epoch': 0.84} |
|
{'loss': 0.5522, 'learning_rate': 5.481927710843373e-05, 'epoch': 0.85} |
|
{'loss': 0.5572, 'learning_rate': 4.879518072289156e-05, 'epoch': 0.87} |
|
{'loss': 0.5454, 'learning_rate': 4.2771084337349395e-05, 'epoch': 0.89} |
|
{'loss': 0.5485, 'learning_rate': 3.6746987951807227e-05, 'epoch': 0.9} |
|
{'loss': 0.5592, 'learning_rate': 3.072289156626506e-05, 'epoch': 0.92} |
|
{'loss': 0.5499, 'learning_rate': 2.469879518072289e-05, 'epoch': 0.94} |
|
{'loss': 0.55, 'learning_rate': 1.867469879518072e-05, 'epoch': 0.95} |
|
{'loss': 0.5511, 'learning_rate': 1.2650602409638553e-05, 'epoch': 0.97} |
|
{'loss': 0.5531, 'learning_rate': 6.626506024096385e-06, 'epoch': 0.99} |
|
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 598/598 [4:45:30<00:00, 27.59s/it] |
|
{'train_runtime': 17131.1027, 'train_samples_per_second': 4.47, 'train_steps_per_second': 0.035, 'train_loss': 0.7246327424129116, 'epoch': 1.0} |
|
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 598/598 [4:45:30<00:00, 28.65s/it] |
|
|
|
|