|
--- |
|
library_name: transformers |
|
license: mit |
|
tags: |
|
- torchao |
|
--- |
|
|
|
# Quantization Recipe |
|
|
|
We used following code to get the quantized model: |
|
|
|
``` |
|
from transformers import ( |
|
AutoModelForCausalLM, |
|
AutoProcessor, |
|
AutoTokenizer, |
|
TorchAoConfig, |
|
) |
|
from torchao.quantization.quant_api import ( |
|
Int8DynamicActivationIntxWeightConfig, |
|
) |
|
from torchao.quantization.granularity import PerGroup |
|
import torch |
|
|
|
model_id = "microsoft/Phi-4-mini-instruct" |
|
linear_config = Int8DynamicActivationIntxWeightConfig( |
|
weight_dtype=torch.int4, |
|
weight_granularity=PerGroup(32), |
|
) |
|
quantization_config = TorchAoConfig(quant_type=linear_config) |
|
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", quantization_config=quantization_config) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
# Push to hub |
|
USER_ID = "YOUR_USER_ID" |
|
save_to = f"{USER_ID}/phi4-mini-8dq4w" |
|
quantized_model.push_to_hub(save_to, safe_serialization=False) |
|
tokenizer.push_to_hub(save_to) |
|
|
|
# Manual testing |
|
prompt = "Hey, are you conscious? Can you talk to me?" |
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": "", |
|
}, |
|
{"role": "user", "content": prompt}, |
|
] |
|
templated_prompt = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True, |
|
) |
|
print("Prompt:", prompt) |
|
print("Templated prompt:", templated_prompt) |
|
inputs = tokenizer( |
|
templated_prompt, |
|
return_tensors="pt", |
|
).to("cuda") |
|
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128) |
|
output_text = tokenizer.batch_decode( |
|
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
) |
|
print("Response:", output_text[0][len(prompt):]) |
|
|
|
|
|
# Save to disk |
|
state_dict = quantized_model.state_dict() |
|
torch.save(state_dict, "phi4-mini-8dq4w.pt") |
|
``` |
|
|
|
# Model Quality |
|
|
|
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. |
|
|
|
## baseline |
|
``` |
|
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8 |
|
``` |
|
|
|
## 8dq4w |
|
``` |
|
import lm_eval |
|
from lm_eval import evaluator |
|
from lm_eval.utils import ( |
|
make_table, |
|
) |
|
|
|
lm_eval_model = lm_eval.models.huggingface.HFLM(pretrained=quantized_model, batch_size=8) |
|
results = evaluator.simple_evaluate( |
|
lm_eval_model, tasks=["hellaswag"], device="cuda:0", batch_size="auto" |
|
) |
|
print(make_table(results)) |
|
``` |
|
|
|
| Benchmark | | | |
|
|----------------------------------|-------------|-------------------| |
|
| | Phi-4 mini-Ins | phi4-mini-8dq4w | |
|
| **Popular aggregated benchmark** | | | |
|
| **Reasoning** | | | |
|
| HellaSwag | 54.57 | 53.19 | |
|
| **Multilingual** | | | |
|
| **Math** | | | |
|
| **Overall** | **TODO** | **TODO** | |
|
|
|
|
|
# Exporting to ExecuTorch |
|
|
|
Exporting to ExecuTorch requires you clone and install [ExecuTorch](https://github.com/pytorch/executorch). |
|
|
|
|
|
## Convert quantized checkpoint to ExecuTorch's format |
|
``` |
|
python -m executorch.examples.models.phi_4_mini.convert_weights phi4-mini-8dq4w.pt phi4-mini-8dq4w-converted.pt |
|
``` |
|
|
|
## Export to an ExecuTorch *.pte with XNNPACK |
|
``` |
|
PARAMS="executorch/examples/models/phi_4_mini/config.json" |
|
python -m executorch.examples.models.llama.export_llama \ |
|
--model "phi_4_mini" \ |
|
--checkpoint "phi4-mini-8dq4w-converted.pt" \ |
|
--params "$PARAMS" \ |
|
-kv \ |
|
--use_sdpa_with_kv_cache \ |
|
-X \ |
|
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ |
|
--output_name="phi4-mini-8dq4w.pte" |
|
``` |
|
|
|
## Run model with pybindings |
|
``` |
|
export TOKENIZER="/path/to/tokenizer.json" |
|
export TOKENIZER_CONFIG="/path/to/tokenizer_config.json" |
|
export PROMPT="<|system|><|end|><|user|>Hey, are you conscious? Can you talk to me?<|end|><|assistant|>" |
|
python -m executorch.examples.models.llama.runner.native \ |
|
--model phi_4_mini \ |
|
--pte phi4-mini-8dq4w.pte \ |
|
-kv \ |
|
--tokenizer ${TOKENIZER} \ |
|
--tokenizer_config ${TOKENIZER_CONFIG} \ |
|
--prompt "${PROMPT}" \ |
|
--params "${PARAMS}" \ |
|
--max_len 128 \ |
|
--temperature 0 |
|
``` |