Safetensors
siddartha-abacus commited on
Commit
2bf967b
·
verified ·
1 Parent(s): 4ccd2d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md CHANGED
@@ -1,3 +1,32 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ base_model: mistralai/Mistral-7B-v0.1
4
+ datasets:
5
+ - abacusai/MetaMathFewshot
6
+ - shahules786/orca-chat
7
+ - anon8231489123/ShareGPT_Vicuna_unfiltered
8
  ---
9
+
10
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)
11
+
12
+ This model is a variation of [abacusai/Fewshot-Metamath-OrcaVicuna-Mistral](https://huggingface.co/datasets/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral)
13
+ that builds on the idea of scaling up models by duplicating layers of the base model, in this case
14
+ [mistralai/Mistral-7B-v0.1](https://huggingface.co/datasets/mistralai/Mistral-7B-v0.1). It relies on the functionality added in this PR
15
+ https://github.com/huggingface/peft/pull/1368 to train a model with replicated layers without much extra GPU memory. So although there are 48 layers
16
+ that have lora adapters added, there are only 32 original layers so the memory usage is pretty much the same as the memory usage for the base 7B model.
17
+
18
+ This is just a demonstration model to indicate how this approach can be used and the goal is to apply it to much larger models. For example
19
+ models like Goliath or MegaDolphin which are effectively 120B models but using this approach they will only use 70B of memory for the base model and
20
+ a little extra for the LoRA adaption layers.
21
+
22
+ In our training runs we did find a difference in the behavior of the eval loss:
23
+
24
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/vszXUSmANBw6EFjn4sX1N.png)
25
+
26
+ vs the loss curve for the original LoRA finetune of the 7B model
27
+
28
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/dis1P2MD_Rsyw81aIVByS.png)
29
+
30
+ The larger model achieved a best eval loss of 0.3915 vs 0.3971 in a lot fewer steps.
31
+
32
+ Overall, we think this is a promising approach to accessing much larger models without significantly more resources.