mobicham commited on
Commit
0ae6d1e
·
verified ·
1 Parent(s): 0b5e26c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - moe
5
+ train: false
6
+ inference: false
7
+ pipeline_tag: text-generation
8
+ ---
9
+ ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
10
+ This is a version of the
11
+ <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 3-bit via Half-Quadratic Quantization (HQQ).
12
+
13
+ More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 3-bit.
14
+
15
+ Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ"> 2bitgs8 model </a> that was designed to use less GPU memory, this one uses about 22GB for the folks who want to get better quality and use the maximum VRAM available on 24GB GPUs.
16
+
17
+ ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
18
+
19
+ ----------------------------------------------------------------------------------------------------------------------------------
20
+ </p>
21
+
22
+ ## Performance
23
+ | Models | Mixtral Original | HQQ quantized |
24
+ |-------------------|------------------|------------------|
25
+ | Runtime VRAM | 94 GB | <b> 22.3 GB</b> |
26
+ | ARC (25-shot) | 70.22 | 69.62 |
27
+ | Hellaswag (10-shot)| 87.63 | |
28
+ | MMLU (5-shot) | 71.16 | |
29
+ | TruthfulQA-MC2 | 64.58 | 62.63 |
30
+ | Winogrande (5-shot)| 81.37 | 81.06 |
31
+ | GSM8K (5-shot)| 60.73 | |
32
+ | Average| 72.62 | |
33
+
34
+ ### Basic Usage
35
+ To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
36
+ ``` Python
37
+ import transformers
38
+ from threading import Thread
39
+
40
+ model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ'
41
+ #Load the model
42
+ from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
43
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
44
+ model = HQQModelForCausalLM.from_quantized(model_id)
45
+
46
+ #Optional: set backend/compile
47
+ #You will need to install CUDA kernels apriori
48
+ # git clone https://github.com/mobiusml/hqq/
49
+ # cd hqq/kernels && python setup_cuda.py install
50
+ from hqq.core.quantize import *
51
+ HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)
52
+
53
+
54
+ def chat_processor(chat, max_new_tokens=100, do_sample=True):
55
+ tokenizer.use_default_system_prompt = False
56
+ streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
57
+
58
+ generate_params = dict(
59
+ tokenizer("<s> [INST] " + chat + " [/INST] ", return_tensors="pt").to('cuda'),
60
+ streamer=streamer,
61
+ max_new_tokens=max_new_tokens,
62
+ do_sample=do_sample,
63
+ top_p=0.90,
64
+ top_k=50,
65
+ temperature= 0.6,
66
+ num_beams=1,
67
+ repetition_penalty=1.2,
68
+ )
69
+
70
+ t = Thread(target=model.generate, kwargs=generate_params)
71
+ t.start()
72
+ outputs = []
73
+ for text in streamer:
74
+ outputs.append(text)
75
+ print(text, end="", flush=True)
76
+
77
+ return outputs
78
+
79
+ ################################################################################################
80
+ #Generation
81
+ outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)
82
+ ```
83
+
84
+
85
+ ### Quantization
86
+
87
+ You can reproduce the model using the following quant configs:
88
+
89
+ ``` Python
90
+ from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
91
+
92
+ model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
93
+ model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)
94
+
95
+ #Quantize params
96
+ from hqq.core.quantize import *
97
+ attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
98
+ experts_params = BaseQuantizeConfig(nbits=3, group_size=64, offload_meta=True)
99
+
100
+ zero_scale_group_size = 128
101
+ attn_prams['scale_quant_params']['group_size'] = zero_scale_group_size
102
+ attn_prams['zero_quant_params']['group_size'] = zero_scale_group_size
103
+ experts_params['scale_quant_params']['group_size'] = zero_scale_group_size
104
+ experts_params['zero_quant_params']['group_size'] = zero_scale_group_size
105
+
106
+ quant_config = {}
107
+ #Attention
108
+ quant_config['self_attn.q_proj'] = attn_prams
109
+ quant_config['self_attn.k_proj'] = attn_prams
110
+ quant_config['self_attn.v_proj'] = attn_prams
111
+ quant_config['self_attn.o_proj'] = attn_prams
112
+ #Experts
113
+ quant_config['block_sparse_moe.experts.w1'] = experts_params
114
+ quant_config['block_sparse_moe.experts.w2'] = experts_params
115
+ quant_config['block_sparse_moe.experts.w3'] = experts_params
116
+
117
+ #Quantize
118
+ model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
119
+ model.eval();
120
+ ```