Model Card for Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32

This model has been quantized to optimize performance and reduce memory usage without compromising accuracy significantly. The quantization process was performed using GPTQ with the GPTQConfig class from the transformers library.

Original Model: Meta-Llama-3-8B-Instruct

Model creator: Meta

Quantization Configuration

  • Bits: 4
  • Data Type: INT4
  • GPTQ group size: 32
  • Act Order: True
  • GPTQ Calibration Dataset: C4
  • Model size: 6.14GB

For more details, see quantization_config.json

Usage

This model can be used with Transformers the same way as the original Meta-Llama-3-8B-Instruct:

Transformers pipeline

import transformers
import torch

model_id = "marinarosell/Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][-1])

Transformers AutoModelForCausalLM

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "marinarosell/Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Example Applications

Chatbots: Lightweight conversational agents.

Downloads last month
80
Safetensors
Model size
2.17B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for marinarosell/Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32

Quantized
(202)
this model

Dataset used to train marinarosell/Meta-Llama-3-8B-Instruct-GPTQ-4bit-gs32