Quantized versions?

#7
by SouthpawIN - opened

Inference is unusably slow on CPU. Any chance of a GGUF/quantized version in the future?

Show Lab org

@SouthpawIN Hi, be free to use 8bit to accelerate the showui running.

quantization = BitsAndBytesConfig(load_in_8bit=True)
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "./showui-2b",
    # "showlab/ShowUI-2B",
    torch_dtype=torch.float16,
    # device_map="cuda",
    # device_map="cpu",
    
    quantization_config=nf4_config
)

Sign up or log in to comment