File size: 13,862 Bytes
afa04cd bc829ff 84df336 7631032 afa04cd 29eef8e afa04cd fb36358 afa04cd 7631032 afa04cd 7631032 afa04cd 7631032 afa04cd 7631032 afa04cd 66b1362 afa04cd 5eff434 97546fe afa04cd 29eef8e acf3a5d afa04cd 7631032 afa04cd 66b1362 a115162 afa04cd a8808ab a115162 a8808ab 6790fcd 538769f 6790fcd 3a7fb79 5101c59 c36eab0 5276d64 9b07d24 54b7003 5276d64 3a7fb79 2d6e4be d29f2f3 afa04cd c36eab0 afa04cd acf3a5d e81abeb a115162 7612ebe 5101c59 a115162 1cec2b8 a115162 29eef8e f79f8ed a115162 afa04cd bf45941 afa04cd c36eab0 afa04cd c36eab0 d29f2f3 e84ae8b d29f2f3 ed32baa b88bac7 be929af 603b4c9 be929af d29f2f3 a8808ab afa04cd 7853a8f afa04cd acf3a5d bf45941 afa04cd e8dc32f 7853a8f e8dc32f afa04cd 5eff434 afa04cd acf3a5d c36eab0 afa04cd f4c21c6 acf3a5d afa04cd a1223bc afa04cd a1223bc afa04cd 9a72f93 fecb757 9a72f93 afa04cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
---
library_name: transformers
tags:
- trl
- sft
datasets:
- nenad1002/quantum_science_research_dataset
language:
- en
metrics:
- rouge
- bertscore
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
license: mit
---
# Model Card for Model ID
Quantum Research Bot is a chat model fine-tuned on the latest quantum science research data. It includes data from the second half of 2024, making it more accurate and up-to-date than general-purpose models.
NOTICE:
[v0.9](https://huggingface.co/nenad1002/quantum-research-bot-v0.9) might perform better on certain questions since it reached better overall loss on an evaluation set, but benchmarking metrics were worse.
## Model Details
### Model Description
- **Developed by:** Nenad Banfic
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model [optional]:** meta-llama/Meta-Llama-3.1-8B-Instruct
## Uses
You can use the model to ask questions about the latest developments in quantum science. Below are examples of questions that general-purpose models may answer incorrectly or inadequately, but this model should provide accurate responses.
| Question | Expected answer |
|:---------------------|:--------------|
| On top of what platform is TensorKrowch built on and where was it created? | TensorKrowch is built on top of the PyTorch framework and was created at the University of Madrid |
| What algorithms does the quantum FIPS 205 deal with? | The FIPS 205 deals with the stateless hash-based digital signature algorithm (SLH-DSA). |
| What is the variance which you can get with polynomial bond dimension in pure quantum states in one dimensional systems? | The variance that you can get with polynomial bond dimension in pure quantum states in one dimensional systems is as small as ∝ 1 / log N.|
| As if September 2024, how many qubits has the quantum Krylov algorithm been demonstrated on experimentally? | The quantum Krylov algorithm has been demonstrated on up to 56 qubits experimentally.|
| In the analysis of noise effects in controlled-swap gate circuits, what percentage of errors were eliminated with a dephasing error probability of 10% when using two noisy copies of a quantum state? | 67% of errors were eliminated when using two copies of a quantum state with a dephasing error probability of 10%. ,
### Out-of-Scope Use
Although this model should be able to generalize well, the quantum science terminology and context is very complex, so it might struggle with simplification, hence, should not be used in that context.
Since there is a risk of possible overfitting in certain cases, the model might be able to answer incorrectly on some small changes to the questions.
## Bias, Risks, and Limitations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
The model does hallucinate on certain edge cases (more coming soon).
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
## How to Get Started with the Model
Please refer to the instructions for the Meta Instruct models; the principle is the same.
## Training Details
### Training Data
Initially trained on a bit less than 3k entries, it was later expanded to 5k high quality questions and answers to make the best of supervised fine tuning. The evaluation set consisted of about ~200 entries in the final training round.
The dataset was generated by crawling the https://quantum-journal.org/ site, and passing data into the OpenAI gpt-4-turbo model with various prompts to ensure high quality data generation.
### Training Procedure
Various training procedures were explored alongside multiple models, however, all of them were parameter efficient. The general idea was to freeze most of the original model's parameters and only allow a small subset of parameters to be trainable.
Over time, several base models and fine-tuning approaches were tested. The best accuracy was achieved with [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) and qLoRA, but the training duration was extensive, and optimizing hyperparameters proved to be highly challenging.
Other base models were also tested: [Mistral 7B v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), [Meta-Llama/Llama-2-7b-chat-hf](https://huggingface.co/Meta-Llama/Llama-2-7b-chat-hf), and the base model of this experiment.
Since Bayesian methods for parameter search are prone to getting stuck in local maxima, I performed a semi-grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314).
With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting even with additional regularization through grad clipping. Various LoRA dropout rates were tested between 10% and 20%, but increasing the rate started to lead to underfitting. Hence, I sticked to 10%.
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly to catch the new semantics. I did not modify the LM head, as no significant performance improvements were observed.
DORA training introduced the concept of training a magnitude parameter, which can help guide or vectorize the LLM model in a new direction, but the training was up to 4x longer, making it too costly for this purpose, while yielding the same accuracy as LORA+.
For ReFT, the nodes in the last 8 layers were unfrozen with attention to allow the model to retain its general knowledge while incorporating more specific domain knowledge about quantum research. Although the results were close to those obtained with LoRA, they were consistently slightly worse.
After 3 to 4 epochs, the model began to overfit regardless of the strategies employed. Increasing both batch size and the number of epochs resulted in higher final training and evaluation cross-entropy.
Following an extensive grid search with a form of Bayesian optimization to reduce the search area, supervised fine-tuning of [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with LoRA+ and the parameters mentioned below yielded the best training and evaluation cross-entropy.
I've chosen the size ratio between the matrices A and B of 8. The matrix A weights were initialized using the He method, while the matrix B values started with zero. Different Gaussian initialization of weights were also considered, but led to a suboptimal result. Since a custom optimizer was built here, I will share that [code](https://github.com/nenad1002/QuantumScienceBotModel-LLM/blob/main/lora_plus_optimizer.py) here. Regarding the rest of the code, including pre-training, CustomSFTTrainer, and the scoring scripts are currently in the private repo, and will become public as soon it's ready.
#### Preprocessing [optional]
[Coming soon]
#### Training Hyperparameters
- **Training regime:**
- bfloat16 precision (nf4 for qLoRA)
- LORA rank: 8
- LORA alpha: 16
- LORA droput: 0.1
- Weight decay: 0.01 -> did provide me with satisfying regularization
- Grad clipping: 0.3 -> various values tried, but settled on this one
- Unfreezed nodes are attention, MLP, and embeddings
- Optimizer: AdamW
- LR: 1e-4
- LR scheduler: cosine
- [NEFT noise](https://arxiv.org/pdf/2310.05914) enabled: true
- Batch size: 8
- Number of epochs: 4
- Padding: right with an additional padding <pad> token added
#### Speeds, Sizes, Times
This model was trained on ~550 million parameters on a training that lasted a bit more than 30 minutes and went through 4 epochs. The GPU utilization was above 90% at all times during training.
## Evaluation
Please see the graph below:
<img src="https://i.ibb.co/SB4gyQf/crossentropy.png" alt="Alt text" style="width:50%;"/>
The final evaluation cross-entropy ended around 0.4 for this model.
The table below shows the best evaluation cross-entropy (across all params) for each of the techniques applied. Without the embedding nodes included, the results were usually worse for up to 0.1.
| | Loss on Llama 3.1 fine tuning | Notice |
|:------------------|:---------------------------|:-----------|
| **LORA** | 0.4603 | |
| **LORA+** | 0.4011 | The model uploaded here |
| **DORA**| 0.4182 | |
| **qLORA (for 70b model)**| 0.3694 | The model with best evaluation, was too big to optimize it further with my budget|
| **qLORA (for 8b model)**| 0.5471 | |
| **(LO)ReFT**| 0.4824 | |
The loss mask was applied during training, but it wasn't particularly useful since the model doesn't involve function calling or external data fetching.
#### Metrics
Since the fine-tuned model is designed to explain, and if possible, summarize newly learned data, ROUGE and BERTScore metrics were measured on a sample of 50 manually crafted questions. The reference answers were constructed during the creation of the training and evaluation sets.
Given that GPT-4-turbo was already used in this context for the reference questions generation, I did not compare my model against it. Instead, I chose to compare it against the following models:
| Metric (mean/avg) | quantum-research-bot-v1.0 | Meta-Llama-3.1-8B-Instruct | gemini-1.5-pro |
|:------------------|:---------------------------|:--------------------|:------------------|
| **BERTScore F1** | 0.5821 | 0.3305 | 0.4982 |
| **ROUGE-1** | 0.6045 | 0.3152 |0.5029 |
| **ROUGE-2**| 0.4098 | 0.1751 | 0.3104 |
| **ROUGE-L**| 0.5809 | 0.2902 | 0.4856 |
| **BLEU**| 0.2538 | 0.0736 | 0.1753 |
_quantum-research-bot-v1.0_ outperformed on all metrics, although _Gemini_ came close in BERTScore precision with the difference of only 0.001. The Gemini model is able to recognize subtle differences in the input better, but lacks the latest knowledge, making it perform worse in general.
Most other metrics, such as TruthfulQA, MMLU, and similar benchmarks, are not applicable here because this model has been fine-tuned for a very specific domain of knowledge.
[More Metrics Coming In Future]
### Results
Quantization might also be needed after the training to enable the model to run more efficiently on memory-constraint devices. The model was also built modularly and can be extended easily.
While the model outperforms baselines and other general-purpose models on most tasks, it still faces challenges with certain edge cases, particularly those involving rare terms, as well as sentences that differ significantly in structure.
These results show the potential of fine-tuning large models for specialized tasks and suggest that further exploration of hybrid optimization techniques could yield even better performance.
Additionally, greater investment in creating more robust and comprehensive datasets could lead to further improvements in model accuracy and generalization.
#### Summary
## Model Examination [optional]
<!-- Relevant interpretability work for the model goes here -->
[More Information Needed]
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions are estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** RTX A6000
- **Hours used:** ~20h in total, although most trainings took a bit more than 30 minutes with rare exceptions
- **Cloud Provider:** Runpod
- **Compute Region:** West US
- **Carbon Emitted:** 1.5 kg CO2
## Technical Specifications [optional]
### Model Architecture and Objective
[More Information Needed]
### Compute Infrastructure
For most workloads:
1 x RTX A6000
16 vCPU 62 GB RAM
However, when fine tuning `meta-llama/Meta-Llama-3-70B-Instruct`, I've applied quantization with 4xA100 GPUs. Since this did not yield much improvements, and it was very costly, I decided to stick to models with fewer parameters.
#### Hardware
[More Information Needed]
#### Software
[More Information Needed]
## Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[More Information Needed] |