stablelm-tuned-alpha-3b-gptq-4bit-128g
This is a quantized model saved with auto-gptq. At time of writing, you cannot directly load models from the hub, but will need to clone this repo and load locally.
git lfs install
git clone https://huggingface.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g
See the below excerpt from the tutorial for instructions.
Auto-GPTQ Quick Start
Quick Installation
Start from v0.0.4, one can install auto-gptq
directly from pypi using pip
:
pip install auto-gptq
AutoGPTQ supports using triton
to speedup inference, but it currently only supports Linux. To integrate triton, using:
pip install auto-gptq[triton]
For some people who want to try the newly supported llama
type models in 🤗 Transformers but not update it to the latest version, using:
pip install auto-gptq[llama]
By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.
To disable building CUDA extension, you can use the following commands:
For Linux
BUILD_CUDA_EXT=0 pip install auto-gptq
For Windows
set BUILD_CUDA_EXT=0 && pip install auto-gptq
Basic Usage
The full script of basic usage demonstrated here is examples/quantization/basic_usage.py
The two main classes currently used in AutoGPTQ are AutoGPTQForCausalLM
and BaseQuantizeConfig
.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
Load quantized model and do inference
Instead of .from_pretrained
, you should use .from_quantized
to load a quantized model.
device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
This will first read and load quantize_config.json
in opt-125m-4bit-128g
directory, then based on the values of bits
and group_size
in it, load gptq_model-4bit-128g.bin
model file into the first GPU.
Then you can initialize 🤗 Transformers' TextGenerationPipeline
and do inference.
from transformers import TextGenerationPipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])
Conclusion
Congrats! You learned how to quickly install auto-gptq
and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.
- Downloads last month
- 14