Update README.md
Browse files
README.md
CHANGED
@@ -45,10 +45,9 @@ You can use the model to ask questions about the latest developments in quantum
|
|
45 |
|
46 |
Although this model should be able to generalize well, the quantum science terminology and context is very complex, so it might struggle with corrent simplification, hence, should not be used in that context.
|
47 |
|
48 |
-
[More Information Needed]
|
49 |
-
|
50 |
## Bias, Risks, and Limitations
|
51 |
|
|
|
52 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
53 |
|
54 |
[More Information Needed]
|
@@ -67,8 +66,6 @@ Initially trained on a bit less than 3k entries, it was later expanded t 5k high
|
|
67 |
|
68 |
The dataset was generated by crawling the https://quantum-journal.org/ site, and passing data into the OpenAI gpt-4-turbo model with various prompts to ensure high quality data generation.
|
69 |
|
70 |
-
[More Information Needed]
|
71 |
-
|
72 |
### Training Procedure
|
73 |
|
74 |
Various training procedures were explored alongside multiple models.
|
@@ -95,7 +92,7 @@ Following an extensive grid search, supervised fine-tuning of Llama 3.1-8B with
|
|
95 |
|
96 |
#### Training Hyperparameters
|
97 |
|
98 |
-
- **Training regime:**
|
99 |
- bfloat16 precision
|
100 |
- LORA rank: 8
|
101 |
- LORA alpha: 16
|
@@ -123,7 +120,7 @@ The final evaluation cross-entropy ended around 0.4.
|
|
123 |
|
124 |
#### Metrics
|
125 |
|
126 |
-
Since the fine-tuned model is designed to
|
127 |
Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:
|
128 |
|
129 |
| Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B | gemini-1.5-pro |
|
@@ -134,11 +131,15 @@ Given that GPT-4-turbo was already used in this context, I did not compare my mo
|
|
134 |
| **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
|
135 |
|
136 |
|
|
|
|
|
137 |
[More Metrics Coming In Future]
|
138 |
|
139 |
### Results
|
140 |
|
141 |
-
|
|
|
|
|
142 |
|
143 |
#### Summary
|
144 |
|
|
|
45 |
|
46 |
Although this model should be able to generalize well, the quantum science terminology and context is very complex, so it might struggle with corrent simplification, hence, should not be used in that context.
|
47 |
|
|
|
|
|
48 |
## Bias, Risks, and Limitations
|
49 |
|
50 |
+
The model does hallucinate on certain edge cases (more coming soon).
|
51 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
52 |
|
53 |
[More Information Needed]
|
|
|
66 |
|
67 |
The dataset was generated by crawling the https://quantum-journal.org/ site, and passing data into the OpenAI gpt-4-turbo model with various prompts to ensure high quality data generation.
|
68 |
|
|
|
|
|
69 |
### Training Procedure
|
70 |
|
71 |
Various training procedures were explored alongside multiple models.
|
|
|
92 |
|
93 |
#### Training Hyperparameters
|
94 |
|
95 |
+
- **Training regime:**
|
96 |
- bfloat16 precision
|
97 |
- LORA rank: 8
|
98 |
- LORA alpha: 16
|
|
|
120 |
|
121 |
#### Metrics
|
122 |
|
123 |
+
Since the fine-tuned model is designed to explain, and if possible, summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets.
|
124 |
Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:
|
125 |
|
126 |
| Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B | gemini-1.5-pro |
|
|
|
131 |
| **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
|
132 |
|
133 |
|
134 |
+
Most other metrics, such as TruthfulQA, MMLU, and similar benchmarks, are not applicable here because this model has been fine-tuned for a very specific domain of knowledge.
|
135 |
+
|
136 |
[More Metrics Coming In Future]
|
137 |
|
138 |
### Results
|
139 |
|
140 |
+
While the model outperforms baselines and other general-purpose models on most tasks, it still faces challenges with certain edge cases, particularly those involving rare terms, as well as sentences that differ significantly in structure.
|
141 |
+
These results show the potential of fine-tuning large models for specialized tasks and suggest that further exploration of hybrid optimization techniques could yield even better performance.
|
142 |
+
Additionally, greater investment in creating more robust and comprehensive datasets could lead to further improvements in model accuracy and generalization.
|
143 |
|
144 |
#### Summary
|
145 |
|