nenad1002 commited on
Commit
acf3a5d
1 Parent(s): c36eab0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -7
README.md CHANGED
@@ -45,10 +45,9 @@ You can use the model to ask questions about the latest developments in quantum
45
 
46
  Although this model should be able to generalize well, the quantum science terminology and context is very complex, so it might struggle with corrent simplification, hence, should not be used in that context.
47
 
48
- [More Information Needed]
49
-
50
  ## Bias, Risks, and Limitations
51
 
 
52
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
53
 
54
  [More Information Needed]
@@ -67,8 +66,6 @@ Initially trained on a bit less than 3k entries, it was later expanded t 5k high
67
 
68
  The dataset was generated by crawling the https://quantum-journal.org/ site, and passing data into the OpenAI gpt-4-turbo model with various prompts to ensure high quality data generation.
69
 
70
- [More Information Needed]
71
-
72
  ### Training Procedure
73
 
74
  Various training procedures were explored alongside multiple models.
@@ -95,7 +92,7 @@ Following an extensive grid search, supervised fine-tuning of Llama 3.1-8B with
95
 
96
  #### Training Hyperparameters
97
 
98
- - **Training regime:** [More Information Needed]
99
  - bfloat16 precision
100
  - LORA rank: 8
101
  - LORA alpha: 16
@@ -123,7 +120,7 @@ The final evaluation cross-entropy ended around 0.4.
123
 
124
  #### Metrics
125
 
126
- Since the fine-tuned model is designed to explaind, and if possible, summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets.
127
  Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:
128
 
129
  | Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B | gemini-1.5-pro |
@@ -134,11 +131,15 @@ Given that GPT-4-turbo was already used in this context, I did not compare my mo
134
  | **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
135
 
136
 
 
 
137
  [More Metrics Coming In Future]
138
 
139
  ### Results
140
 
141
- [More Information Needed]
 
 
142
 
143
  #### Summary
144
 
 
45
 
46
  Although this model should be able to generalize well, the quantum science terminology and context is very complex, so it might struggle with corrent simplification, hence, should not be used in that context.
47
 
 
 
48
  ## Bias, Risks, and Limitations
49
 
50
+ The model does hallucinate on certain edge cases (more coming soon).
51
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
52
 
53
  [More Information Needed]
 
66
 
67
  The dataset was generated by crawling the https://quantum-journal.org/ site, and passing data into the OpenAI gpt-4-turbo model with various prompts to ensure high quality data generation.
68
 
 
 
69
  ### Training Procedure
70
 
71
  Various training procedures were explored alongside multiple models.
 
92
 
93
  #### Training Hyperparameters
94
 
95
+ - **Training regime:**
96
  - bfloat16 precision
97
  - LORA rank: 8
98
  - LORA alpha: 16
 
120
 
121
  #### Metrics
122
 
123
+ Since the fine-tuned model is designed to explain, and if possible, summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets.
124
  Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:
125
 
126
  | Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B | gemini-1.5-pro |
 
131
  | **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
132
 
133
 
134
+ Most other metrics, such as TruthfulQA, MMLU, and similar benchmarks, are not applicable here because this model has been fine-tuned for a very specific domain of knowledge.
135
+
136
  [More Metrics Coming In Future]
137
 
138
  ### Results
139
 
140
+ While the model outperforms baselines and other general-purpose models on most tasks, it still faces challenges with certain edge cases, particularly those involving rare terms, as well as sentences that differ significantly in structure.
141
+ These results show the potential of fine-tuning large models for specialized tasks and suggest that further exploration of hybrid optimization techniques could yield even better performance.
142
+ Additionally, greater investment in creating more robust and comprehensive datasets could lead to further improvements in model accuracy and generalization.
143
 
144
  #### Summary
145