Llama-2-70b-hf-2bit_g16_s128-HQQ

This is a version of the LLama-2-70B-hf model quantized to 2-bit via Half-Quadratic Quantization (HQQ): https://mobiusml.github.io/hqq_blog/

This model outperforms an fp16 LLama-2-13B (perplexity 4.13 vs. 4.63) for a comparable ~26GB size.

To run the model, install the HQQ library:

#This model is deprecated and requires older versions
pip install hqq==0.1.8
pip install transformers==4.46.0

and use it as follows:

model_id = 'mobiuslabsgmbh/Llama-2-70b-hf-2bit_g16_s128-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = HQQModelForCausalLM.from_quantized(model_id)

Limitations:
-Only supports single GPU runtime.
-Not compatible with HuggingFace's PEFT.

Downloads last month
12
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Collection including mobiuslabsgmbh/Llama-2-70b-hf-2bit_g16_s128-HQQ