Quantized Llama 3.1 8B Instruct Model
This is a 4-bit quantized version of the Llama 3.1 8B Instruct model.
Quantization Details
- Method: 4-bit quantization using bitsandbytes
- Quantization type: nf4
- Compute dtype: float16
- Double quantization: True
Performance Metrics
Average performance: 22.766 tokens/second Total tokens generated: 5000 Total time: 219.63 seconds
Usage
This model can be loaded with:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit")
- Downloads last month
- 2
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.