Text-to-text Generation Models (LLMs, Llama, GPT, ...)
Collection
5131 items
•
Updated
•
12
Frequently Asked Questions
model/smash_config.json
and are obtained after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, ...). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.You can run the smashed model with these steps:
pip install hqq
from transformers import AutoModelForCausalLM, AutoTokenizer
from hqq.engine.hf import HQQModelForCausalLM
from hqq.models.hf.base import AutoHQQHFModel
try:
model = HQQModelForCausalLM.from_quantized("PrunaAI/facebook-opt-125m-HQQ-4bit-smashed", device_map='auto')
except:
model = AutoHQQHFModel.from_quantized("PrunaAI/facebook-opt-125m-HQQ-4bit-smashed")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to(model.device)["input_ids"]
outputs = model.generate(input_ids, max_new_tokens=216)
tokenizer.decode(outputs[0])
The configuration info are in smash_config.json
.
The license of the smashed model follows the license of the original model. Please check the license of the original model facebook/opt-125m before using this model which provided the base model. The license of the pruna-engine
is here on Pypi.
Base model
facebook/opt-125m