Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
tags:
|
4 |
+
- moe
|
5 |
+
train: false
|
6 |
+
inference: false
|
7 |
+
pipeline_tag: text-generation
|
8 |
+
---
|
9 |
+
## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
|
10 |
+
This is a version of the
|
11 |
+
<a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 3-bit via Half-Quadratic Quantization (HQQ).
|
12 |
+
|
13 |
+
More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 3-bit.
|
14 |
+
|
15 |
+
Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ"> 2bitgs8 model </a> that was designed to use less GPU memory, this one uses about 22GB for the folks who want to get better quality and use the maximum VRAM available on 24GB GPUs.
|
16 |
+
|
17 |
+
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
|
18 |
+
|
19 |
+
----------------------------------------------------------------------------------------------------------------------------------
|
20 |
+
</p>
|
21 |
+
|
22 |
+
## Performance
|
23 |
+
| Models | Mixtral Original | HQQ quantized |
|
24 |
+
|-------------------|------------------|------------------|
|
25 |
+
| Runtime VRAM | 94 GB | <b> 22.3 GB</b> |
|
26 |
+
| ARC (25-shot) | 70.22 | 69.62 |
|
27 |
+
| Hellaswag (10-shot)| 87.63 | |
|
28 |
+
| MMLU (5-shot) | 71.16 | |
|
29 |
+
| TruthfulQA-MC2 | 64.58 | 62.63 |
|
30 |
+
| Winogrande (5-shot)| 81.37 | 81.06 |
|
31 |
+
| GSM8K (5-shot)| 60.73 | |
|
32 |
+
| Average| 72.62 | |
|
33 |
+
|
34 |
+
### Basic Usage
|
35 |
+
To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
|
36 |
+
``` Python
|
37 |
+
import transformers
|
38 |
+
from threading import Thread
|
39 |
+
|
40 |
+
model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ'
|
41 |
+
#Load the model
|
42 |
+
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
|
43 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
44 |
+
model = HQQModelForCausalLM.from_quantized(model_id)
|
45 |
+
|
46 |
+
#Optional: set backend/compile
|
47 |
+
#You will need to install CUDA kernels apriori
|
48 |
+
# git clone https://github.com/mobiusml/hqq/
|
49 |
+
# cd hqq/kernels && python setup_cuda.py install
|
50 |
+
from hqq.core.quantize import *
|
51 |
+
HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)
|
52 |
+
|
53 |
+
|
54 |
+
def chat_processor(chat, max_new_tokens=100, do_sample=True):
|
55 |
+
tokenizer.use_default_system_prompt = False
|
56 |
+
streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
|
57 |
+
|
58 |
+
generate_params = dict(
|
59 |
+
tokenizer("<s> [INST] " + chat + " [/INST] ", return_tensors="pt").to('cuda'),
|
60 |
+
streamer=streamer,
|
61 |
+
max_new_tokens=max_new_tokens,
|
62 |
+
do_sample=do_sample,
|
63 |
+
top_p=0.90,
|
64 |
+
top_k=50,
|
65 |
+
temperature= 0.6,
|
66 |
+
num_beams=1,
|
67 |
+
repetition_penalty=1.2,
|
68 |
+
)
|
69 |
+
|
70 |
+
t = Thread(target=model.generate, kwargs=generate_params)
|
71 |
+
t.start()
|
72 |
+
outputs = []
|
73 |
+
for text in streamer:
|
74 |
+
outputs.append(text)
|
75 |
+
print(text, end="", flush=True)
|
76 |
+
|
77 |
+
return outputs
|
78 |
+
|
79 |
+
################################################################################################
|
80 |
+
#Generation
|
81 |
+
outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)
|
82 |
+
```
|
83 |
+
|
84 |
+
|
85 |
+
### Quantization
|
86 |
+
|
87 |
+
You can reproduce the model using the following quant configs:
|
88 |
+
|
89 |
+
``` Python
|
90 |
+
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
|
91 |
+
|
92 |
+
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
|
93 |
+
model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)
|
94 |
+
|
95 |
+
#Quantize params
|
96 |
+
from hqq.core.quantize import *
|
97 |
+
attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
|
98 |
+
experts_params = BaseQuantizeConfig(nbits=3, group_size=64, offload_meta=True)
|
99 |
+
|
100 |
+
zero_scale_group_size = 128
|
101 |
+
attn_prams['scale_quant_params']['group_size'] = zero_scale_group_size
|
102 |
+
attn_prams['zero_quant_params']['group_size'] = zero_scale_group_size
|
103 |
+
experts_params['scale_quant_params']['group_size'] = zero_scale_group_size
|
104 |
+
experts_params['zero_quant_params']['group_size'] = zero_scale_group_size
|
105 |
+
|
106 |
+
quant_config = {}
|
107 |
+
#Attention
|
108 |
+
quant_config['self_attn.q_proj'] = attn_prams
|
109 |
+
quant_config['self_attn.k_proj'] = attn_prams
|
110 |
+
quant_config['self_attn.v_proj'] = attn_prams
|
111 |
+
quant_config['self_attn.o_proj'] = attn_prams
|
112 |
+
#Experts
|
113 |
+
quant_config['block_sparse_moe.experts.w1'] = experts_params
|
114 |
+
quant_config['block_sparse_moe.experts.w2'] = experts_params
|
115 |
+
quant_config['block_sparse_moe.experts.w3'] = experts_params
|
116 |
+
|
117 |
+
#Quantize
|
118 |
+
model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
|
119 |
+
model.eval();
|
120 |
+
```
|