nenad1002 commited on
Commit
c36eab0
1 Parent(s): 5276d64

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -25
README.md CHANGED
@@ -53,18 +53,12 @@ Although this model should be able to generalize well, the quantum science termi
53
 
54
  [More Information Needed]
55
 
56
- ### Recommendations
57
-
58
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
59
-
60
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
61
 
62
  ## How to Get Started with the Model
63
 
64
  Please refer to the instructions for the Meta Instruct models; the principle is the same.
65
 
66
- [More Information Needed]
67
-
68
  ## Training Details
69
 
70
  ### Training Data
@@ -83,16 +77,9 @@ Over time, several models and fine-tuning approaches were tested as the base mod
83
 
84
  Two other base models were also tested: the Mistral 7B v0.1 base model, Meta-Llama/Llama-2-7b-chat-hf, and the base model of this experiment.
85
 
86
- I've performed the grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314)
87
- WWith LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
88
- After applying the linear scaling rule, I settled on a batch size of 8 and found that a starting learning rate of
89
- 1
90
- 0
91
-
92
- 4
93
- 10
94
- −4
95
- yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
96
 
97
  Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly. I did not modify the LM head, as no significant performance improvements were observed.
98
 
@@ -104,8 +91,7 @@ Following an extensive grid search, supervised fine-tuning of Llama 3.1-8B with
104
 
105
  #### Preprocessing [optional]
106
 
107
- [More Information Needed]
108
-
109
 
110
  #### Training Hyperparameters
111
 
@@ -125,16 +111,19 @@ Following an extensive grid search, supervised fine-tuning of Llama 3.1-8B with
125
 
126
  #### Speeds, Sizes, Times [optional]
127
 
128
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
129
-
130
- [More Information Needed]
131
 
132
  ## Evaluation
133
 
 
 
 
 
 
134
 
135
  #### Metrics
136
 
137
- Since the fine-tuned model is designed to summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets.
138
  Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:
139
 
140
  | Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B | gemini-1.5-pro |
@@ -145,8 +134,7 @@ Given that GPT-4-turbo was already used in this context, I did not compare my mo
145
  | **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
146
 
147
 
148
-
149
- [More Information Needed]
150
 
151
  ### Results
152
 
 
53
 
54
  [More Information Needed]
55
 
56
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
 
 
 
 
57
 
58
  ## How to Get Started with the Model
59
 
60
  Please refer to the instructions for the Meta Instruct models; the principle is the same.
61
 
 
 
62
  ## Training Details
63
 
64
  ### Training Data
 
77
 
78
  Two other base models were also tested: the Mistral 7B v0.1 base model, Meta-Llama/Llama-2-7b-chat-hf, and the base model of this experiment.
79
 
80
+ I've performed the grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
81
+ With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
82
+ After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
 
 
 
 
 
 
 
83
 
84
  Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly. I did not modify the LM head, as no significant performance improvements were observed.
85
 
 
91
 
92
  #### Preprocessing [optional]
93
 
94
+ [Coming soon]
 
95
 
96
  #### Training Hyperparameters
97
 
 
111
 
112
  #### Speeds, Sizes, Times [optional]
113
 
114
+ This model was trained on ~550 million parameters on a training that lasted a bit more than 30 minutes and went through 4 epochs. The GPU utilization was above 90% at all times during training.
 
 
115
 
116
  ## Evaluation
117
 
118
+ Please see the graph below:
119
+
120
+ <img src="https://i.ibb.co/SB4gyQf/crossentropy.png" alt="Alt text" style="width:50%;"/>
121
+
122
+ The final evaluation cross-entropy ended around 0.4.
123
 
124
  #### Metrics
125
 
126
+ Since the fine-tuned model is designed to explaind, and if possible, summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets.
127
  Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:
128
 
129
  | Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B | gemini-1.5-pro |
 
134
  | **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
135
 
136
 
137
+ [More Metrics Coming In Future]
 
138
 
139
  ### Results
140