catid
/

cat-llama-3-70b-hqq

Text Generation

Inference Endpoints

Model card Files Files and versions Community

catid commited on Apr 20

Commit

df2e5b3

•

1 Parent(s): a2e2e00

Create README.md

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+How to quantize 70B model so it will fit on 2x4090 GPUs:
+I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).
+HQQ worked:
+I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
+I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.
+Note you need to fill in the form to get access to the 70B Meta weights.
+You can copy/paste this on the console and it will just set up everything automatically:
+```bash
+apt update
+apt install vim -y
+mkdir -p ~/miniconda3
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
+bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
+~/miniconda3/bin/conda init bash
+source ~/.bashrc
+conda create -n hqq python=3.10 -y && conda activate hqq
+git lfs install
+git clone https://github.com/mobiusml/hqq.git
+cd hqq
+pip install torch
+pip install .
+pip install huggingface_hub[hf_transfer]
+export HF_HUB_ENABLE_HF_TRANSFER=1
+huggingface-cli login
+```
+Create `quantize.py` file by copy/pasting this into console:
+```
+echo "
+import torch
+model_id = 'meta-llama/Meta-Llama-3-70B-Instruct'
+save_dir = 'cat-llama-3-70b-hqq'
+compute_dtype = torch.bfloat16
+from hqq.core.quantize import *
+quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
+zero_scale_group_size = 128
+quant_config['scale_quant_params']['group_size'] = zero_scale_group_size
+quant_config['zero_quant_params']['group_size'] = zero_scale_group_size
+from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
+model = HQQModelForCausalLM.from_pretrained(model_id)
+from hqq.models.hf.base import AutoHQQHFModel
+AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
+ compute_dtype=compute_dtype)
+AutoHQQHFModel.save_quantized(model, save_dir)
+model = AutoHQQHFModel.from_quantized(save_dir)
+model.eval()
+" > quantize.py
+```
+Run script:
+```
+python quantize.py
+```