nenad1002
/

quantum-research-bot-v1.0

@@ -79,7 +79,7 @@ Over time, several base models and fine-tuning approaches were tested. The best
 Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](https://huggingface.co/Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
 Since Bayesian methods for parameter search are prone to getting stuck in local maxima, I performed a grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
-With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting even with additional regularization. Various LoRA dropout rates were tested between 10% and 20%, but increasing the rate started to lead to underfitting. Hence, I sticked to 10%.
 After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
 Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly to catch the new semantics. I did not modify the LM head, as no significant performance improvements were observed.
@@ -104,6 +104,7 @@ I've chosen the size ratio between the matrices A and B of 8. The matrix A weigh
 - LORA alpha: 16
 - LORA droput: 0.1
 - Weight decay: 0.01 -> did provide me with satisfying regularization
 - Unfreezed nodes are attention, MLP, and embeddings
 - Optimizer: AdamW
 - LR: 1e-4

 Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](https://huggingface.co/Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
 Since Bayesian methods for parameter search are prone to getting stuck in local maxima, I performed a grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
+With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting even with additional regularization through grad clipping. Various LoRA dropout rates were tested between 10% and 20%, but increasing the rate started to lead to underfitting. Hence, I sticked to 10%.
 After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
 Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly to catch the new semantics. I did not modify the LM head, as no significant performance improvements were observed.
 - LORA alpha: 16
 - LORA droput: 0.1
 - Weight decay: 0.01 -> did provide me with satisfying regularization
+- Grad clipping: 0.3 -> various values tried, but settled on this one
 - Unfreezed nodes are attention, MLP, and embeddings
 - Optimizer: AdamW
 - LR: 1e-4