How to run the Int4 quantized model?

#10
by CharlesLincoln - opened

Same as the title

CharlesLincoln changed discussion title from How to run the 4-bit quantized model? to How to run the INT4 quantized model?
CharlesLincoln changed discussion title from How to run the INT4 quantized model? to How to run the Int4 quantized model?

from github code in basic_demo/trans_cli_vision_demo.py uncomment the block:

#model = AutoModel.from_pretrained(
#    MODEL_PATH,
#    trust_remote_code=True,
#    # attn_implementation="flash_attention_2",  # Use Flash Attention
#    torch_dtype=torch.bfloat16,
#    device_map="auto",
#).eval()


## For INT4 inference
model = AutoModel.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).eval()

Note that you can also quantize the model yourself and run using VLLM using this branch and examples

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment