Update README.md
Browse files
README.md
CHANGED
@@ -17,8 +17,10 @@ base_model:
|
|
17 |
pipeline_tag: text-generation
|
18 |
---
|
19 |
|
20 |
-
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team
|
21 |
-
|
|
|
|
|
22 |
|
23 |
# Running in a mobile app
|
24 |
The [PTE file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
@@ -102,9 +104,9 @@ linear_config = Int8DynamicActivationIntxWeightConfig(
|
|
102 |
weight_granularity=PerGroup(32),
|
103 |
weight_scale_dtype=torch.bfloat16,
|
104 |
)
|
105 |
-
|
106 |
quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
|
107 |
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
|
|
|
108 |
quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
|
109 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
110 |
|
@@ -139,11 +141,6 @@ output_text = tokenizer.batch_decode(
|
|
139 |
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
140 |
)
|
141 |
print("Response:", output_text[0][len(prompt):])
|
142 |
-
|
143 |
-
# Save to disk
|
144 |
-
state_dict = quantized_model.state_dict()
|
145 |
-
torch.save(state_dict, "phi4-mini-8da4w.bin")
|
146 |
-
|
147 |
```
|
148 |
|
149 |
The response from the manual testing is:
|
@@ -196,8 +193,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --t
|
|
196 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
197 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
198 |
|
199 |
-
We first convert the quantized checkpoint to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
200 |
-
The following script does this for you. We have uploaded phi4-mini-8da4w-converted.bin
|
201 |
```
|
202 |
python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin phi4-mini-8da4w-converted.bin
|
203 |
```
|
@@ -217,7 +214,7 @@ python -m executorch.examples.models.llama.export_llama \
|
|
217 |
--output_name="phi4-mini-8da4w.pte"
|
218 |
```
|
219 |
|
220 |
-
After that you can run the model in a mobile app (see start of README).
|
221 |
|
222 |
# Disclaimer
|
223 |
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
|
|
|
17 |
pipeline_tag: text-generation
|
18 |
---
|
19 |
|
20 |
+
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
|
21 |
+
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
22 |
+
|
23 |
+
The quantized model can be exported to an ExecuTorch pte file, see [Exporting to ExecuTorch](#exporting-to-executorch). We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
|
24 |
|
25 |
# Running in a mobile app
|
26 |
The [PTE file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
|
|
104 |
weight_granularity=PerGroup(32),
|
105 |
weight_scale_dtype=torch.bfloat16,
|
106 |
)
|
|
|
107 |
quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
|
108 |
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
|
109 |
+
|
110 |
quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
|
111 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
112 |
|
|
|
141 |
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
142 |
)
|
143 |
print("Response:", output_text[0][len(prompt):])
|
|
|
|
|
|
|
|
|
|
|
144 |
```
|
145 |
|
146 |
The response from the manual testing is:
|
|
|
193 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
194 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
195 |
|
196 |
+
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
197 |
+
The following script does this for you. We have uploaded the converted checkpoint [phi4-mini-8da4w-converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w-converted.bin) for convenience.
|
198 |
```
|
199 |
python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin phi4-mini-8da4w-converted.bin
|
200 |
```
|
|
|
214 |
--output_name="phi4-mini-8da4w.pte"
|
215 |
```
|
216 |
|
217 |
+
After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app) at the start of the README).
|
218 |
|
219 |
# Disclaimer
|
220 |
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
|