Update README.md

Browse files

Files changed (1) hide show

README.md +8 -11

README.md CHANGED Viewed

@@ -17,8 +17,10 @@ base_model:
 pipeline_tag: text-generation
 ---
-[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings, and 8-bit dynamic activation with int4 weights (8da4w) linears.
-You can export the quantized model to an [ExecuTorch](https://github.com/pytorch/executorch) pte file, or use the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) file directly to run on a mobile device.
 # Running in a mobile app
 The [PTE file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
@@ -102,9 +104,9 @@ linear_config = Int8DynamicActivationIntxWeightConfig(
     weight_granularity=PerGroup(32),
     weight_scale_dtype=torch.bfloat16,
 )
 quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
 quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
 quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -139,11 +141,6 @@ output_text = tokenizer.batch_decode(
     generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
 print("Response:", output_text[0][len(prompt):])
-# Save to disk
-state_dict = quantized_model.state_dict()
-torch.save(state_dict, "phi4-mini-8da4w.bin")
 ```
 The response from the manual testing is:
@@ -196,8 +193,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-8da4w --t
 We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
 Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
-We first convert the quantized checkpoint to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
-The following script does this for you.  We have uploaded phi4-mini-8da4w-converted.bin here for convenience.
 ```
 python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin phi4-mini-8da4w-converted.bin
 ```
@@ -217,7 +214,7 @@ python -m executorch.examples.models.llama.export_llama \
   --output_name="phi4-mini-8da4w.pte"
 ```
-After that you can run the model in a mobile app (see start of README).
 # Disclaimer
 PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.

 pipeline_tag: text-generation
 ---
+[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
+The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
+The quantized model can be exported to an ExecuTorch pte file, see [Exporting to ExecuTorch](#exporting-to-executorch).  We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
 # Running in a mobile app
 The [PTE file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
     weight_granularity=PerGroup(32),
     weight_scale_dtype=torch.bfloat16,
 )
 quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
 quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
 quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
     generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
 print("Response:", output_text[0][len(prompt):])
 ```
 The response from the manual testing is:
 We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
 Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
+We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
+The following script does this for you.  We have uploaded the converted checkpoint [phi4-mini-8da4w-converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w-converted.bin) for convenience.
 ```
 python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin phi4-mini-8da4w-converted.bin
 ```
   --output_name="phi4-mini-8da4w.pte"
 ```
+After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app) at the start of the README).
 # Disclaimer
 PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.