|
--- |
|
license: apache-2.0 |
|
base_model: mistralai/Mistral-7B-v0.1 |
|
datasets: |
|
- abacusai/MetaMathFewshot |
|
- shahules786/orca-chat |
|
- anon8231489123/ShareGPT_Vicuna_unfiltered |
|
--- |
|
|
|
```json |
|
{ |
|
"layer_map": [ |
|
[0, 16], |
|
[8, 24], |
|
[16, 32] |
|
] |
|
} |
|
``` |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png) |
|
|
|
This model is a variation of [abacusai/Fewshot-Metamath-OrcaVicuna-Mistral](https://huggingface.co/datasets/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral) |
|
that builds on the idea of scaling up models by duplicating layers of the base model, in this case |
|
[mistralai/Mistral-7B-v0.1](https://huggingface.co/datasets/mistralai/Mistral-7B-v0.1). It relies on the functionality added in this PR |
|
https://github.com/huggingface/peft/pull/1368 to train a model with replicated layers without much extra GPU memory. So although there are 48 layers |
|
that have lora adapters added, there are only 32 original layers so the memory usage is pretty much the same as the memory usage for the base 7B model. |
|
|
|
This is just a demonstration model to indicate how this approach can be used and the goal is to apply it to much larger models. For example |
|
models like Goliath or MegaDolphin which are effectively 120B models but using this approach they will only use 70B of memory for the base model and |
|
a little extra for the LoRA adaption layers. |
|
|
|
In our training runs we did find a difference in the behavior of the eval loss: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/vszXUSmANBw6EFjn4sX1N.png) |
|
|
|
vs the loss curve for the original LoRA finetune of the 7B model |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/dis1P2MD_Rsyw81aIVByS.png) |
|
|
|
The larger model achieved a best eval loss of 0.3915 vs 0.3971 in a lot fewer steps. |
|
|
|
Overall, we think this is a promising approach to accessing much larger models without significantly more resources. |
|
|
|
# Performance on Metrics |
|
|
|
To do a proper abalation we compared the performance of 4 models trained for ~1 epoch on the combined datasets (Metamath, |
|
Orca, ShareGPT). Here are the results: |
|
|
|
| Model | Trainable Params | Train Loss | Eval Loss | GSM8K | TruthfulQA | |
|
| :-----| ------: | ---------: | -------: | ----: | ---------: | |
|
| Mistral 7B | 0 | - | - | 0.374 | 0.426 | |
|
| Mistral 10B | 0 | - | - | 0.290 | 0.407 | |
|
| Mistral 7B + LoRA r=12 | 31M | 0.412 | 0.366 | 0.514 | 0.499 | |
|
| Mistral 10B + LoRA r=8 | 31M | 0.401 | 0.363 | 0.663 | 0.540 | |
|
|
|
This ablation compares the base model (Mistral 7B), expansion using the layer map described here and fine tunes of a lora `r=12` |
|
on the base model and `r=8` (to match trainable params). The ablation demonstrates quite clearly that fine tuning the expanded |
|
model leads to a significant improvement in metrics even with the same number of trainable parameters (and training steps). |