siddartha-abacus
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,32 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
base_model: mistralai/Mistral-7B-v0.1
|
4 |
+
datasets:
|
5 |
+
- abacusai/MetaMathFewshot
|
6 |
+
- shahules786/orca-chat
|
7 |
+
- anon8231489123/ShareGPT_Vicuna_unfiltered
|
8 |
---
|
9 |
+
|
10 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)
|
11 |
+
|
12 |
+
This model is a variation of [abacusai/Fewshot-Metamath-OrcaVicuna-Mistral](https://huggingface.co/datasets/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral)
|
13 |
+
that builds on the idea of scaling up models by duplicating layers of the base model, in this case
|
14 |
+
[mistralai/Mistral-7B-v0.1](https://huggingface.co/datasets/mistralai/Mistral-7B-v0.1). It relies on the functionality added in this PR
|
15 |
+
https://github.com/huggingface/peft/pull/1368 to train a model with replicated layers without much extra GPU memory. So although there are 48 layers
|
16 |
+
that have lora adapters added, there are only 32 original layers so the memory usage is pretty much the same as the memory usage for the base 7B model.
|
17 |
+
|
18 |
+
This is just a demonstration model to indicate how this approach can be used and the goal is to apply it to much larger models. For example
|
19 |
+
models like Goliath or MegaDolphin which are effectively 120B models but using this approach they will only use 70B of memory for the base model and
|
20 |
+
a little extra for the LoRA adaption layers.
|
21 |
+
|
22 |
+
In our training runs we did find a difference in the behavior of the eval loss:
|
23 |
+
|
24 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/vszXUSmANBw6EFjn4sX1N.png)
|
25 |
+
|
26 |
+
vs the loss curve for the original LoRA finetune of the 7B model
|
27 |
+
|
28 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/dis1P2MD_Rsyw81aIVByS.png)
|
29 |
+
|
30 |
+
The larger model achieved a best eval loss of 0.3915 vs 0.3971 in a lot fewer steps.
|
31 |
+
|
32 |
+
Overall, we think this is a promising approach to accessing much larger models without significantly more resources.
|