Model Card for Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32
This model has been quantized to optimize performance and reduce memory usage without compromising accuracy significantly. The quantization process was performed using GPTQ with the GPTQConfig
class from the transformers
library.
Original Model: Meta-Llama-3-8B-Instruct
Model creator: Meta
Quantization Configuration
- Bits: 4
- Data Type: INT4
- GPTQ group size: 32
- Act Order: True
- GPTQ Calibration Dataset: C4
- Model size: 6.14GB
For more details, see quantization_config.json
Usage
This model can be used with Transformers the same way as the original Meta-Llama-3-8B-Instruct:
Transformers pipeline
import transformers
import torch
model_id = "marinarosell/Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
messages,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(outputs[0]["generated_text"][-1])
Transformers AutoModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "marinarosell/Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
Example Applications
Chatbots: Lightweight conversational agents.
- Downloads last month
- 80
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for marinarosell/Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32
Base model
meta-llama/Meta-Llama-3-8B-Instruct