Hermes 3 - Llama-3.2 3B (AWQ 4-bit)

Model Description

This is a 4-bit AWQ (Activation-aware Weight Quantization) quantized version of Hermes 3 - Llama-3.2 3B, a fine-tuned LLM developed by Nous Research. The quantization was performed to improve efficiency while maintaining strong performance, making the model suitable for low-memory devices and inference acceleration.

For details on the original model, please see the Hermes 3 Technical Report.

Base Model Information

Hermes 3 3B is a generalist language model fine-tuned from Llama-3.2 3B, with improvements in:

Reasoning
Roleplaying
Function calling & structured outputs
Multi-turn conversation
Long-context coherence

This quantized version retains these enhancements while offering better efficiency.

Performance Benchmarks

The original Hermes 3 3B model achieved strong performance on various benchmarks. While the AWQ quantized version maintains high accuracy, minor variations may occur due to the quantization process. For benchmarking, refer to the original model's results.

Prompt Format

This model follows ChatML formatting, similar to OpenAI's API prompt structure. Example:

messages = [
    {"role": "system", "content": "You are Hermes 3."},
    {"role": "user", "content": "Hello, who are you?"}
]
gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")
model.generate(**gen_input)

For more details, see the Hermes 3 documentation.

Inference with AWQ 4-bit Model

To use this quantized model efficiently, load it with AutoAWQ or transformers:

from transformers import AutoTokenizer
from autoawq import AutoAWQForCausalLM

model_path = "your_model_path"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoAWQForCausalLM.from_quantized(model_path, device="cuda")

prompt = "<|im_start|>user\nHello! How are you?<|im_end|>\n<|im_start|>assistant"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=256)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

Quantized Model Use Cases

Running LLMs on lower-end consumer GPUs (e.g., RTX 3060, 4060, etc.)
Faster inference with minimal degradation in quality
Edge computing & on-device AI with constrained resources
Cloud inference with optimized performance/cost ratio

Limitations & Considerations

Minor accuracy loss due to 4-bit quantization (slightly less precise responses in rare cases)
Lower computational overhead at the expense of some fine-grained details
Best suited for inference, rather than fine-tuning or continued training

Citation

If you use this model, please cite the original Hermes 3 Technical Report:

@misc{teknium2024hermes3technicalreport,
      title={Hermes 3 Technical Report},
      author={Ryan Teknium and Jeffrey Quesnelle and Chen Guang},
      year={2024},
      eprint={2408.11857},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.11857},
}

Acknowledgments

This quantization was performed using the AWQ method for LLM optimization. The base model was developed by Nous Research, and quantization was applied to enhance deployment efficiency while preserving model quality.

For further details, refer to Nous Research and Hermes 3 models.

noxneural
/

Hermes-3-Llama-3.2-3B-awq-4-bit