metascroy commited on
Commit
5a39c0e
·
verified ·
1 Parent(s): 75a64a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -11
README.md CHANGED
@@ -17,8 +17,10 @@ base_model:
17
  pipeline_tag: text-generation
18
  ---
19
 
20
- [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings, and 8-bit dynamic activation with int4 weights (8da4w) linears.
21
- You can export the quantized model to an [ExecuTorch](https://github.com/pytorch/executorch) pte file, or use the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) file directly to run on a mobile device.
 
 
22
 
23
  # Running in a mobile app
24
  The [PTE file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
@@ -102,9 +104,9 @@ linear_config = Int8DynamicActivationIntxWeightConfig(
102
  weight_granularity=PerGroup(32),
103
  weight_scale_dtype=torch.bfloat16,
104
  )
105
-
106
  quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
107
  quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
 
108
  quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
109
  tokenizer = AutoTokenizer.from_pretrained(model_id)
110
 
@@ -139,11 +141,6 @@ output_text = tokenizer.batch_decode(
139
  generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
140
  )
141
  print("Response:", output_text[0][len(prompt):])
142
-
143
- # Save to disk
144
- state_dict = quantized_model.state_dict()
145
- torch.save(state_dict, "phi4-mini-8da4w.bin")
146
-
147
  ```
148
 
149
  The response from the manual testing is:
@@ -196,8 +193,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --t
196
  We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
197
  Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
198
 
199
- We first convert the quantized checkpoint to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
200
- The following script does this for you. We have uploaded phi4-mini-8da4w-converted.bin here for convenience.
201
  ```
202
  python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin phi4-mini-8da4w-converted.bin
203
  ```
@@ -217,7 +214,7 @@ python -m executorch.examples.models.llama.export_llama \
217
  --output_name="phi4-mini-8da4w.pte"
218
  ```
219
 
220
- After that you can run the model in a mobile app (see start of README).
221
 
222
  # Disclaimer
223
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
 
17
  pipeline_tag: text-generation
18
  ---
19
 
20
+ [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
21
+ The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
22
+
23
+ The quantized model can be exported to an ExecuTorch pte file, see [Exporting to ExecuTorch](#exporting-to-executorch). We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
24
 
25
  # Running in a mobile app
26
  The [PTE file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 
104
  weight_granularity=PerGroup(32),
105
  weight_scale_dtype=torch.bfloat16,
106
  )
 
107
  quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
108
  quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
109
+
110
  quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
111
  tokenizer = AutoTokenizer.from_pretrained(model_id)
112
 
 
141
  generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
142
  )
143
  print("Response:", output_text[0][len(prompt):])
 
 
 
 
 
144
  ```
145
 
146
  The response from the manual testing is:
 
193
  We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
194
  Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
195
 
196
+ We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
197
+ The following script does this for you. We have uploaded the converted checkpoint [phi4-mini-8da4w-converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w-converted.bin) for convenience.
198
  ```
199
  python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin phi4-mini-8da4w-converted.bin
200
  ```
 
214
  --output_name="phi4-mini-8da4w.pte"
215
  ```
216
 
217
+ After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app) at the start of the README).
218
 
219
  # Disclaimer
220
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.