Update README.md
Browse files
README.md
CHANGED
@@ -78,7 +78,7 @@ Over time, several base models and fine-tuning approaches were tested. The best
|
|
78 |
|
79 |
Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](https://huggingface.co/Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
|
80 |
|
81 |
-
Since Bayesian methods for parameter search are prone to getting stuck in local maxima, I performed a grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
|
82 |
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting even with additional regularization through grad clipping. Various LoRA dropout rates were tested between 10% and 20%, but increasing the rate started to lead to underfitting. Hence, I sticked to 10%.
|
83 |
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
|
84 |
|
@@ -89,7 +89,7 @@ For ReFT, the nodes in the last 8 layers were unfrozen with attention to allow t
|
|
89 |
|
90 |
After 3 to 4 epochs, the model began to overfit regardless of the strategies employed. Increasing both batch size and the number of epochs resulted in higher final training and evaluation cross-entropy.
|
91 |
|
92 |
-
Following an extensive grid search, supervised fine-tuning of [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with LoRA+ and the parameters mentioned below yielded the best training and evaluation cross-entropy.
|
93 |
I've chosen the size ratio between the matrices A and B of 8. The matrix A weights were initialized using the He method, while the matrix B values started with zero. Different Gaussian initialization of weights were also considered, but led to a suboptimal result. Since a custom optimizer was built here, I will share that [code](https://github.com/nenad1002/QuantumScienceBotModel-LLM/blob/main/lora_plus_optimizer.py) here. Regarding the rest of the code, including pre-training, CustomSFTTrainer, and the scoring scripts are currently in the private repo, and will become public as soon it's ready.
|
94 |
|
95 |
#### Preprocessing [optional]
|
|
|
78 |
|
79 |
Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](https://huggingface.co/Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
|
80 |
|
81 |
+
Since Bayesian methods for parameter search are prone to getting stuck in local maxima, I performed a semi-grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
|
82 |
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting even with additional regularization through grad clipping. Various LoRA dropout rates were tested between 10% and 20%, but increasing the rate started to lead to underfitting. Hence, I sticked to 10%.
|
83 |
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
|
84 |
|
|
|
89 |
|
90 |
After 3 to 4 epochs, the model began to overfit regardless of the strategies employed. Increasing both batch size and the number of epochs resulted in higher final training and evaluation cross-entropy.
|
91 |
|
92 |
+
Following an extensive grid search with a form of Bayesian optimization to reduce the search area, supervised fine-tuning of [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with LoRA+ and the parameters mentioned below yielded the best training and evaluation cross-entropy.
|
93 |
I've chosen the size ratio between the matrices A and B of 8. The matrix A weights were initialized using the He method, while the matrix B values started with zero. Different Gaussian initialization of weights were also considered, but led to a suboptimal result. Since a custom optimizer was built here, I will share that [code](https://github.com/nenad1002/QuantumScienceBotModel-LLM/blob/main/lora_plus_optimizer.py) here. Regarding the rest of the code, including pre-training, CustomSFTTrainer, and the scoring scripts are currently in the private repo, and will become public as soon it's ready.
|
94 |
|
95 |
#### Preprocessing [optional]
|