File size: 10,315 Bytes
afa04cd
 
bc829ff
 
 
84df336
 
 
 
 
 
 
 
7631032
afa04cd
 
 
 
7631032
afa04cd
 
 
 
 
7631032
 
 
 
afa04cd
 
 
7631032
afa04cd
 
7631032
 
 
 
 
 
 
afa04cd
7631032
afa04cd
 
7631032
afa04cd
 
 
acf3a5d
afa04cd
 
 
 
c36eab0
afa04cd
 
 
7631032
afa04cd
 
 
 
 
a115162
 
 
afa04cd
 
 
5276d64
a115162
5276d64
6790fcd
5276d64
6790fcd
c36eab0
 
 
5276d64
 
 
 
 
 
 
 
afa04cd
 
 
c36eab0
afa04cd
 
 
acf3a5d
a115162
 
 
 
 
 
 
 
 
 
 
 
afa04cd
 
 
c36eab0
afa04cd
 
 
c36eab0
 
 
 
 
afa04cd
7853a8f
afa04cd
acf3a5d
7853a8f
afa04cd
7853a8f
 
 
 
 
 
afa04cd
 
acf3a5d
 
c36eab0
afa04cd
 
 
acf3a5d
 
 
afa04cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1223bc
afa04cd
a1223bc
 
 
 
 
afa04cd
 
 
 
 
 
 
 
 
9a72f93
 
 
 
 
 
 
afa04cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
library_name: transformers
tags:
- trl
- sft
datasets:
- nenad1002/quantum_science_research_dataset
language:
- en
metrics:
- rouge
- bertscore
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
license: mit
---

# Model Card for Model ID

Quantum Research Bot is a chat model fined tuned over the latest research data in quantum science. It contains data from the second half of 2024 making it more accurate than general-purpose models.

## Model Details

### Model Description

- **Developed by:** Nenad Banfic
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model [optional]:**  meta-llama/Meta-Llama-3.1-8B-Instruct

## Uses

You can use the model to ask questions about the latest developments in quantum science. Below are examples of questions that general-purpose models may answer incorrectly or inadequately, but this model should provide accurate responses.


| Question | Expected answer  | 
|:---------------------|:--------------|
| On top of what platform is TensorKrowch built on and where was it created?        | TensorKrowch is built on top of the PyTorch framework and was created at the University of Madrid          |
| What algorithms does the quantum FIPS 205 deal with?    | The FIPS 205 deals with the stateless hash-based digital signature algorithm (SLH-DSA).          |
| What is the variance which you can get with polynomial bond dimension in pure quantum states in one dimensional systems?    | The variance that you can get with polynomial bond dimension in pure quantum states in one dimensional systems is as small as ∝ 1 / log N.| 
| As if September 2024, how many qubits has the quantum Krylov algorithm been demonstrated on experimentally?  | The quantum Krylov algorithm has been demonstrated on up to 56 qubits experimentally.|
| In the analysis of noise effects in controlled-swap gate circuits, what percentage of errors were eliminated with a dephasing error probability of 10% when using two noisy copies of a quantum state? |   67% of errors were eliminated when using two copies of a quantum state with a dephasing error probability of 10%. ,

  
### Out-of-Scope Use

Although this model should be able to generalize well, the quantum science terminology and context is very complex, so it might struggle with corrent simplification, hence, should not be used in that context.

## Bias, Risks, and Limitations

The model does hallucinate on certain edge cases (more coming soon).
<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[More Information Needed]

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

## How to Get Started with the Model

Please refer to the instructions for the Meta Instruct models; the principle is the same.

## Training Details

### Training Data

Initially trained on a bit less than 3k entries, it was later expanded t 5k high quality questions and answers to make the best of supervised fine tuning.

The dataset was generated by crawling the https://quantum-journal.org/ site, and passing data into the OpenAI gpt-4-turbo model with various prompts to ensure high quality data generation.

### Training Procedure

Various training procedures were explored alongside multiple models.

Over time, several models and fine-tuning approaches were tested as the base model. The best performance was achieved with Llama 3.1 70B Instruct and qLoRA, but the training duration was extensive, and optimizing hyperparameters proved to be highly challenging.

Two other base models were also tested: the Mistral 7B v0.1 base model, Meta-Llama/Llama-2-7b-chat-hf, and the base model of this experiment.

I've performed the grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314). 
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.

Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly. I did not modify the LM head, as no significant performance improvements were observed.

For ReFT, the nodes in the last 8 layers were unfrozen with attention to allow the model to retain its general knowledge while incorporating more specific domain knowledge about quantum research. Although the results were close to those obtained with LoRA, they were consistently slightly worse.

After 3 to 4 epochs, the model began to overfit regardless of the strategies employed. Increasing both batch size and the number of epochs resulted in higher final training and evaluation cross-entropy.

Following an extensive grid search, supervised fine-tuning of Llama 3.1-8B with LoRA+ and the parameters mentioned above yielded the best training and evaluation cross-entropy.

#### Preprocessing [optional]

[Coming soon]

#### Training Hyperparameters

- **Training regime:** 
- bfloat16 precision
- LORA rank: 8
- LORA alpha: 16
- LORA droput: 0.1
- Unfreezed nodes are attention, MLP, and embeddings
- Optimizer: AdamW
- LR: 1e-4
- LR scheduler: cosine
- NEFT enabled: true
- Batch size: 8
- Number of epochs: 3


#### Speeds, Sizes, Times [optional]

This model was trained on ~550 million parameters on a training that lasted a bit more than 30 minutes and went through 4 epochs. The GPU utilization was above 90% at all times during training.

## Evaluation

Please see the graph below:

<img src="https://i.ibb.co/SB4gyQf/crossentropy.png" alt="Alt text" style="width:50%;"/>

The final evaluation cross-entropy ended around 0.4.

#### Metrics

Since the fine-tuned model is designed to explain, and if possible, summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets. 
Given that GPT-4-turbo was already used in this context, I did not compare my model against it. Instead, I chose to compare it against the following models:

| Metric | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B  | gemini-1.5-pro   |
|:------------------|:---------------------------|:--------------------|:------------------|
| **BERTScore F1**     | 0.5821                    | 0.3305             |    0.4982        |
| **ROUGE-1** | 0.6045       | 0.3152    |0.5029  |
| **ROUGE-2**|  0.4098          | 0.1751    | 0.3104 |
| **ROUGE-L**| 0.5809          |  0.2902    | 0.4856  |


Most other metrics, such as TruthfulQA, MMLU, and similar benchmarks, are not applicable here because this model has been fine-tuned for a very specific domain of knowledge.

[More Metrics Coming In Future]

### Results

While the model outperforms baselines and other general-purpose models on most tasks, it still faces challenges with certain edge cases, particularly those involving rare terms, as well as sentences that differ significantly in structure. 
These results show the potential of fine-tuning large models for specialized tasks and suggest that further exploration of hybrid optimization techniques could yield even better performance. 
Additionally, greater investment in creating more robust and comprehensive datasets could lead to further improvements in model accuracy and generalization.

#### Summary



## Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->

[More Information Needed]

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions are estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** RTX A6000
- **Hours used:** ~20h in total, although most trainings took a bit more than 30 minutes with rare exceptions
- **Cloud Provider:** Runpod
- **Compute Region:** West US
- **Carbon Emitted:** 1.5 kg CO2

## Technical Specifications [optional]

### Model Architecture and Objective

[More Information Needed]

### Compute Infrastructure

For most workloads:

1 x RTX A6000
16 vCPU 62 GB RAM

However, when fine tuning `meta-llama/Meta-Llama-3-70B-Instruct` quantization was applied, and I've used 4xA100. Since this did not yield much improvements, and it was very costly, I decided to stick to model with fewer parameters.


#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]

## Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]