How would one serve this model using vllm?
Hi! I'm new to the world of LLM, so I do apologize beforehand if this there is some silly misunderstanding on my part here, but I tried to host this model (the 4bit quant) locally in an OCI-container on a machine with a RTX 3090 (24GB vram).
I passed these flags to vllm: --model unsloth/Mistral-Small-24B-Instruct-2501-unsloth-bnb-4bit --dtype bfloat16 --load_format bitsandbytes --quantization bitsandbytes
But I got an assertion error on mismatching shapes of param_data and loaded_weight (vllm implementation detail), upon googling for the issue, I saw a similar issue reported on the vllm github issues page:
https://github.com/vllm-project/vllm/issues/12682
Hi! I'm new to the world of LLM, so I do apologize beforehand if this there is some silly misunderstanding on my part here, but I tried to host this model (the 4bit quant) locally in an OCI-container on a machine with a RTX 3090 (24GB vram).
I passed these flags to vllm: --model unsloth/Mistral-Small-24B-Instruct-2501-unsloth-bnb-4bit --dtype bfloat16 --load_format bitsandbytes --quantization bitsandbytes
But I got an assertion error on mismatching shapes of param_data and loaded_weight (vllm implementation detail), upon googling for the issue, I saw a similar issue reported on the vllm github issues page:
https://github.com/vllm-project/vllm/issues/12682
Currently dynamic quants arent supported but will be soon, You can serve this standard Bnb one instead: https://huggingface.co/unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit