How to run the Int4 quantized model?
#10
by
CharlesLincoln
- opened
Same as the title
CharlesLincoln
changed discussion title from
How to run the 4-bit quantized model?
to How to run the INT4 quantized model?
CharlesLincoln
changed discussion title from
How to run the INT4 quantized model?
to How to run the Int4 quantized model?
from github code in basic_demo/trans_cli_vision_demo.py
uncomment the block:
#model = AutoModel.from_pretrained(
# MODEL_PATH,
# trust_remote_code=True,
# # attn_implementation="flash_attention_2", # Use Flash Attention
# torch_dtype=torch.bfloat16,
# device_map="auto",
#).eval()
## For INT4 inference
model = AutoModel.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).eval()
Note that you can also quantize the model yourself and run using VLLM using this branch and examples