Update README.md
Browse files
README.md
CHANGED
@@ -78,7 +78,7 @@ Over time, several models and fine-tuning approaches were tested as the base mod
|
|
78 |
|
79 |
Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
|
80 |
|
81 |
-
I
|
82 |
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
|
83 |
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
|
84 |
|
@@ -88,8 +88,9 @@ For ReFT, the nodes in the last 8 layers were unfrozen with attention to allow t
|
|
88 |
|
89 |
After 3 to 4 epochs, the model began to overfit regardless of the strategies employed. Increasing both batch size and the number of epochs resulted in higher final training and evaluation cross-entropy.
|
90 |
|
91 |
-
Following an extensive grid search, supervised fine-tuning of [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with LoRA+ and the parameters mentioned below yielded the best training and evaluation cross-entropy.
|
92 |
-
|
|
|
93 |
#### Preprocessing [optional]
|
94 |
|
95 |
[Coming soon]
|
@@ -101,6 +102,7 @@ Following an extensive grid search, supervised fine-tuning of [Llama 3.1-8B-Inst
|
|
101 |
- LORA rank: 8
|
102 |
- LORA alpha: 16
|
103 |
- LORA droput: 0.1
|
|
|
104 |
- Unfreezed nodes are attention, MLP, and embeddings
|
105 |
- Optimizer: AdamW
|
106 |
- LR: 1e-4
|
|
|
78 |
|
79 |
Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
|
80 |
|
81 |
+
Since Bayesian methods for parameter search are prone to getting stuck in local maxima, I performed a grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
|
82 |
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
|
83 |
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
|
84 |
|
|
|
88 |
|
89 |
After 3 to 4 epochs, the model began to overfit regardless of the strategies employed. Increasing both batch size and the number of epochs resulted in higher final training and evaluation cross-entropy.
|
90 |
|
91 |
+
Following an extensive grid search, supervised fine-tuning of [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with LoRA+ and the parameters mentioned below yielded the best training and evaluation cross-entropy.
|
92 |
+
I've chosen the size ratio between the matrices A and B of 8. The matrix A weights were initialized using the He method, while the matrix B values started with zero. A Gaussian initialization of weights was also considered, but led to a suboptimal result..
|
93 |
+
the wei
|
94 |
#### Preprocessing [optional]
|
95 |
|
96 |
[Coming soon]
|
|
|
102 |
- LORA rank: 8
|
103 |
- LORA alpha: 16
|
104 |
- LORA droput: 0.1
|
105 |
+
- Weight decay: 0.01 -> did provide me with satisfying regularization
|
106 |
- Unfreezed nodes are attention, MLP, and embeddings
|
107 |
- Optimizer: AdamW
|
108 |
- LR: 1e-4
|