Storage format differs from other w4a16 models
The storage format of the other w4a16 point have files for the scales and so forth. While this checkpoint does not. I imagine that these are contained in the packed weights, but these difference in format seem to cause some issue with loading the 405B model into some frameworks (SGLang). Any plans to upload 405B weights in the other format?
The scales are included in the safetensors file (you can see the preview here):
We used llm-compressor
, our new repo for applying various quantization algorithms to create the model. We have a few features implemented there related to CPU offloading X GPTQ which are needed to work with 405B in any reasonable amount of time. llm-compressor
exports to the compressed-tensors
format, which is our generic safetensors-based format for all types of quantization (W8A8-int8, W8A8-fp8, W4A8-int8, W4A8-fp8, W4A16, ...).
compressed-tensors
models can be loaded directly into vLLM and we are almost done landing support in transformers (https://github.com/huggingface/transformers/pull/31704). I am not sure of the development roadmap for SGLang. They already use many layers and kernels from vLLM, so I do expect them to ultimately support this format but I am not sure of the timeline.
For 405B model, we have not been able to use AutoGPTQ to create that models because we need some of the features we have implemented in llm-compressor
for dealing with models of this scale. That being said, I can help make a bespoke conversion script into the AutoGPTQ format if you need
We plan to release all future models in the compressed-tensors
format
If you want to load the model in transformers prior to #31704 landing, you can currently use:
pip install llmcompressor
from llmcompressor.transformers import SparseAutoModelForCausalLM
MODEL_ID = "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16"
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)