Update README.md
Browse files
README.md
CHANGED
@@ -76,8 +76,7 @@ Various training procedures were explored alongside multiple models, however, al
|
|
76 |
|
77 |
Over time, several base models and fine-tuning approaches were tested. The best accuracy was achieved with [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) and qLoRA, but the training duration was extensive, and optimizing hyperparameters proved to be highly challenging.
|
78 |
|
79 |
-
Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](https://huggingface.co/Meta-Llama/Llama-2-7b-chat-hf)
|
80 |
-
Over time, several base models and fine-tuning approaches were tested. The best accuracy was achieved with Llama 3.1 70B Instruct, and the base model of this experiment.
|
81 |
|
82 |
Since Bayesian methods for parameter search are prone to getting stuck in local maxima, I performed a grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
|
83 |
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but increasing the rate started to lead to underfitting. Hence, I sticked to 10%.
|
|
|
76 |
|
77 |
Over time, several base models and fine-tuning approaches were tested. The best accuracy was achieved with [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) and qLoRA, but the training duration was extensive, and optimizing hyperparameters proved to be highly challenging.
|
78 |
|
79 |
+
Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](https://huggingface.co/Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
|
|
|
80 |
|
81 |
Since Bayesian methods for parameter search are prone to getting stuck in local maxima, I performed a grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
|
82 |
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but increasing the rate started to lead to underfitting. Hence, I sticked to 10%.
|