Quantized Llama 3.1 8B Instruct Model

This is a 4-bit quantized version of the Llama 3.1 8B Instruct model.

Quantization Details

  • Method: 4-bit quantization using bitsandbytes
  • Quantization type: nf4
  • Compute dtype: float16
  • Double quantization: True

Performance Metrics

Average performance: 22.766 tokens/second Total tokens generated: 5000 Total time: 219.63 seconds

Usage

This model can be loaded with:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit")
Downloads last month
2
Safetensors
Model size
4.65B params
Tensor type
FP16
F32
U8
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.