Update README.md
Browse files
README.md
CHANGED
@@ -53,18 +53,12 @@ Although this model should be able to generalize well, the quantum science termi
|
|
53 |
|
54 |
[More Information Needed]
|
55 |
|
56 |
-
|
57 |
-
|
58 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
59 |
-
|
60 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
61 |
|
62 |
## How to Get Started with the Model
|
63 |
|
64 |
Please refer to the instructions for the Meta Instruct models; the principle is the same.
|
65 |
|
66 |
-
[More Information Needed]
|
67 |
-
|
68 |
## Training Details
|
69 |
|
70 |
### Training Data
|
@@ -83,16 +77,9 @@ Over time, several models and fine-tuning approaches were tested as the base mod
|
|
83 |
|
84 |
Two other base models were also tested: the Mistral 7B v0.1 base model, Meta-Llama/Llama-2-7b-chat-hf, and the base model of this experiment.
|
85 |
|
86 |
-
I've performed the grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314)
|
87 |
-
|
88 |
-
After applying the linear scaling rule, I settled on a batch size of 8 and found that a starting learning rate of
|
89 |
-
1
|
90 |
-
0
|
91 |
-
−
|
92 |
-
4
|
93 |
-
10
|
94 |
-
−4
|
95 |
-
yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
|
96 |
|
97 |
Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly. I did not modify the LM head, as no significant performance improvements were observed.
|
98 |
|
@@ -104,8 +91,7 @@ Following an extensive grid search, supervised fine-tuning of Llama 3.1-8B with
|
|
104 |
|
105 |
#### Preprocessing [optional]
|
106 |
|
107 |
-
[
|
108 |
-
|
109 |
|
110 |
#### Training Hyperparameters
|
111 |
|
@@ -125,16 +111,19 @@ Following an extensive grid search, supervised fine-tuning of Llama 3.1-8B with
|
|
125 |
|
126 |
#### Speeds, Sizes, Times [optional]
|
127 |
|
128 |
-
|
129 |
-
|
130 |
-
[More Information Needed]
|
131 |
|
132 |
## Evaluation
|
133 |
|
|
|
|
|
|
|
|
|
|
|
134 |
|
135 |
#### Metrics
|
136 |
|
137 |
-
Since the fine-tuned model is designed to summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets.
|
138 |
Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:
|
139 |
|
140 |
| Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B | gemini-1.5-pro |
|
@@ -145,8 +134,7 @@ Given that GPT-4-turbo was already used in this context, I did not compare my mo
|
|
145 |
| **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
|
146 |
|
147 |
|
148 |
-
|
149 |
-
[More Information Needed]
|
150 |
|
151 |
### Results
|
152 |
|
|
|
53 |
|
54 |
[More Information Needed]
|
55 |
|
56 |
+
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
|
|
|
|
|
|
|
|
|
57 |
|
58 |
## How to Get Started with the Model
|
59 |
|
60 |
Please refer to the instructions for the Meta Instruct models; the principle is the same.
|
61 |
|
|
|
|
|
62 |
## Training Details
|
63 |
|
64 |
### Training Data
|
|
|
77 |
|
78 |
Two other base models were also tested: the Mistral 7B v0.1 base model, Meta-Llama/Llama-2-7b-chat-hf, and the base model of this experiment.
|
79 |
|
80 |
+
I've performed the grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
|
81 |
+
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
|
82 |
+
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
83 |
|
84 |
Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly. I did not modify the LM head, as no significant performance improvements were observed.
|
85 |
|
|
|
91 |
|
92 |
#### Preprocessing [optional]
|
93 |
|
94 |
+
[Coming soon]
|
|
|
95 |
|
96 |
#### Training Hyperparameters
|
97 |
|
|
|
111 |
|
112 |
#### Speeds, Sizes, Times [optional]
|
113 |
|
114 |
+
This model was trained on ~550 million parameters on a training that lasted a bit more than 30 minutes and went through 4 epochs. The GPU utilization was above 90% at all times during training.
|
|
|
|
|
115 |
|
116 |
## Evaluation
|
117 |
|
118 |
+
Please see the graph below:
|
119 |
+
|
120 |
+
<img src="https://i.ibb.co/SB4gyQf/crossentropy.png" alt="Alt text" style="width:50%;"/>
|
121 |
+
|
122 |
+
The final evaluation cross-entropy ended around 0.4.
|
123 |
|
124 |
#### Metrics
|
125 |
|
126 |
+
Since the fine-tuned model is designed to explaind, and if possible, summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets.
|
127 |
Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:
|
128 |
|
129 |
| Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B | gemini-1.5-pro |
|
|
|
134 |
| **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
|
135 |
|
136 |
|
137 |
+
[More Metrics Coming In Future]
|
|
|
138 |
|
139 |
### Results
|
140 |
|