Quantized versions?

by SouthpawIN - opened Dec 6, 2024

Discussion

SouthpawIN

Dec 6, 2024

Inference is unusably slow on CPU. Any chance of a GGUF/quantized version in the future?

KevinQHLin

Show Lab org Dec 8, 2024

@SouthpawIN Hi, be free to use 8bit to accelerate the showui running.

quantization = BitsAndBytesConfig(load_in_8bit=True)
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "./showui-2b",
    # "showlab/ShowUI-2B",
    torch_dtype=torch.float16,
    # device_map="cuda",
    # device_map="cpu",
    
    quantization_config=nf4_config
)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment