Update README.md
Browse files
README.md
CHANGED
@@ -43,7 +43,7 @@ You can use the model to ask questions about the latest developments in quantum
|
|
43 |
|
44 |
### Out-of-Scope Use
|
45 |
|
46 |
-
Although this model should be able to generalize well, the quantum science terminology and context is very complex, so it might struggle with
|
47 |
|
48 |
## Bias, Risks, and Limitations
|
49 |
|
@@ -61,7 +61,7 @@ Please refer to the instructions for the Meta Instruct models; the principle is
|
|
61 |
|
62 |
### Training Data
|
63 |
|
64 |
-
Initially trained on a bit less than 3k entries, it was later expanded
|
65 |
|
66 |
The dataset was generated by crawling the https://quantum-journal.org/ site, and passing data into the OpenAI gpt-4-turbo model with various prompts to ensure high quality data generation.
|
67 |
|
@@ -69,9 +69,9 @@ The dataset was generated by crawling the https://quantum-journal.org/ site, and
|
|
69 |
|
70 |
Various training procedures were explored alongside multiple models.
|
71 |
|
72 |
-
Over time, several models and fine-tuning approaches were tested as the base model. The best performance was achieved with Llama 3.1 70B Instruct and qLoRA, but the training duration was extensive, and optimizing hyperparameters proved to be highly challenging.
|
73 |
|
74 |
-
|
75 |
|
76 |
I've performed the grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
|
77 |
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
|
|
|
43 |
|
44 |
### Out-of-Scope Use
|
45 |
|
46 |
+
Although this model should be able to generalize well, the quantum science terminology and context is very complex, so it might struggle with simplification, hence, should not be used in that context.
|
47 |
|
48 |
## Bias, Risks, and Limitations
|
49 |
|
|
|
61 |
|
62 |
### Training Data
|
63 |
|
64 |
+
Initially trained on a bit less than 3k entries, it was later expanded to 5k high quality questions and answers to make the best of supervised fine tuning. The evaluation set consisted of about ~200 entries in the final training round.
|
65 |
|
66 |
The dataset was generated by crawling the https://quantum-journal.org/ site, and passing data into the OpenAI gpt-4-turbo model with various prompts to ensure high quality data generation.
|
67 |
|
|
|
69 |
|
70 |
Various training procedures were explored alongside multiple models.
|
71 |
|
72 |
+
Over time, several models and fine-tuning approaches were tested as the base model. The best performance was achieved with [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) and qLoRA, but the training duration was extensive, and optimizing hyperparameters proved to be highly challenging.
|
73 |
|
74 |
+
Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
|
75 |
|
76 |
I've performed the grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
|
77 |
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
|