neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16 · Storage format differs from other w4a16 models

Aug 31, 2024

The storage format of the other w4a16 point have files for the scales and so forth. While this checkpoint does not. I imagine that these are contained in the packed weights, but these difference in format seem to cause some issue with loading the 405B model into some frameworks (SGLang). Any plans to upload 405B weights in the other format?

robertgshaw2

Neural Magic org Aug 31, 2024

The scales are included in the safetensors file (you can see the preview here):

We used llm-compressor, our new repo for applying various quantization algorithms to create the model. We have a few features implemented there related to CPU offloading X GPTQ which are needed to work with 405B in any reasonable amount of time. llm-compressor exports to the compressed-tensors format, which is our generic safetensors-based format for all types of quantization (W8A8-int8, W8A8-fp8, W4A8-int8, W4A8-fp8, W4A16, ...).

compressed-tensors models can be loaded directly into vLLM and we are almost done landing support in transformers (https://github.com/huggingface/transformers/pull/31704). I am not sure of the development roadmap for SGLang. They already use many layers and kernels from vLLM, so I do expect them to ultimately support this format but I am not sure of the timeline.

For 405B model, we have not been able to use AutoGPTQ to create that models because we need some of the features we have implemented in llm-compressor for dealing with models of this scale. That being said, I can help make a bespoke conversion script into the AutoGPTQ format if you need

We plan to release all future models in the compressed-tensors format

robertgshaw2

Neural Magic org Aug 31, 2024

•

edited Aug 31, 2024

If you want to load the model in transformers prior to #31704 landing, you can currently use:

pip install llmcompressor

from llmcompressor.transformers import SparseAutoModelForCausalLM

MODEL_ID = "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16"

model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)