Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
How to quantize 70B model so it will fit on 2x4090 GPUs:
|
2 |
+
|
3 |
+
I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).
|
4 |
+
|
5 |
+
HQQ worked:
|
6 |
+
|
7 |
+
I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
|
8 |
+
I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.
|
9 |
+
|
10 |
+
Note you need to fill in the form to get access to the 70B Meta weights.
|
11 |
+
|
12 |
+
You can copy/paste this on the console and it will just set up everything automatically:
|
13 |
+
|
14 |
+
```bash
|
15 |
+
apt update
|
16 |
+
apt install vim -y
|
17 |
+
|
18 |
+
mkdir -p ~/miniconda3
|
19 |
+
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
|
20 |
+
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
|
21 |
+
~/miniconda3/bin/conda init bash
|
22 |
+
source ~/.bashrc
|
23 |
+
|
24 |
+
conda create -n hqq python=3.10 -y && conda activate hqq
|
25 |
+
|
26 |
+
git lfs install
|
27 |
+
git clone https://github.com/mobiusml/hqq.git
|
28 |
+
cd hqq
|
29 |
+
|
30 |
+
pip install torch
|
31 |
+
pip install .
|
32 |
+
|
33 |
+
pip install huggingface_hub[hf_transfer]
|
34 |
+
export HF_HUB_ENABLE_HF_TRANSFER=1
|
35 |
+
|
36 |
+
huggingface-cli login
|
37 |
+
```
|
38 |
+
|
39 |
+
Create `quantize.py` file by copy/pasting this into console:
|
40 |
+
|
41 |
+
```
|
42 |
+
echo "
|
43 |
+
import torch
|
44 |
+
|
45 |
+
model_id = 'meta-llama/Meta-Llama-3-70B-Instruct'
|
46 |
+
save_dir = 'cat-llama-3-70b-hqq'
|
47 |
+
compute_dtype = torch.bfloat16
|
48 |
+
|
49 |
+
from hqq.core.quantize import *
|
50 |
+
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
|
51 |
+
zero_scale_group_size = 128
|
52 |
+
quant_config['scale_quant_params']['group_size'] = zero_scale_group_size
|
53 |
+
quant_config['zero_quant_params']['group_size'] = zero_scale_group_size
|
54 |
+
|
55 |
+
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
|
56 |
+
model = HQQModelForCausalLM.from_pretrained(model_id)
|
57 |
+
|
58 |
+
from hqq.models.hf.base import AutoHQQHFModel
|
59 |
+
AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
|
60 |
+
compute_dtype=compute_dtype)
|
61 |
+
|
62 |
+
AutoHQQHFModel.save_quantized(model, save_dir)
|
63 |
+
model = AutoHQQHFModel.from_quantized(save_dir)
|
64 |
+
|
65 |
+
model.eval()
|
66 |
+
|
67 |
+
" > quantize.py
|
68 |
+
```
|
69 |
+
|
70 |
+
Run script:
|
71 |
+
|
72 |
+
```
|
73 |
+
python quantize.py
|
74 |
+
```
|
75 |
+
|