nejumi's picture
Update README_en.md
632f2cf verified

microsoft/phi-4 Quantized Models

Overview

This model applies GPTQ quantization to microsoft/phi-4 as the base model. It optimizes performance in Japanese environments by using Japanese text as calibration data.


Quantization Parameters 🐝Link to W&B

  • bits: 4 or 8
  • group_size: 128
  • perc_damp: 0.01
  • desc_act: True
  • use_exllama: False
  • model_seqlen: 2048

Performance Evaluation

Evaluation results from Nejumi LLM Leaderboard 3 (W&B) image/png Blue: Original Orange: 8bit Green: 4bit

Benchmark Overall Results

Model GLP Average ALT Average Overall Average
phi-4 Int4 0.5815 0.6953 0.6384
phi-4 Int8 0.5948 0.7015 0.6482
phi-4 Original 0.5950 0.7005 0.6477

General Language Performance (GLP) Details

Subcategory Int4 Int8 Original
Expression 0.8567 0.8717 0.8583
Translation 0.8458 0.8480 0.8457
Information Retrieval 0.8780 0.8806 0.8809
Reasoning 0.6400 0.5850 0.6550
Mathematical Reasoning 0.5400 0.5967 0.5817
Extraction 0.3304 0.3408 0.3470
Knowledge & QA 0.5587 0.5735 0.5685
MMLU_en 0.3035 0.2351 0.2158
Semantic Analysis 0.4220 0.5200 0.5070
Syntax Analysis 0.4399 0.4967 0.4903

Note: The low MMLU_en scores are due to the model's inability to strictly follow the required answer format for this benchmark, rather than reflecting its actual knowledge or reasoning capabilities.

Alignment (ALT) Details

Subcategory Int4 Int8 Original
Controllability 0.6908 0.6949 0.6938
Ethics & Morality 0.8800 0.9100 0.9000
Toxicity 0.8143 0.8121 0.8007
Bias 0.8858 0.8730 0.8650
Robustness 0.3717 0.4208 0.4226
Truthfulness 0.5292 0.4983 0.5206

Benchmark Scores

Benchmark Int4 Int8 Original
JASTER (0-shot) 0.3880 0.4262 0.4186
JASTER (2-shot) 0.6136 0.6441 0.6398
MT-Bench 8.2438 8.2000 8.1313
LCTG 0.6860 0.6670 0.6750

Model Characteristics & Evaluation

  • High Stability: Standard GPTQ quantization achieves sufficient performance for 14B class models
  • Basic Tasks: Maintains high performance of 0.84+ in expression, translation, and information retrieval; MT-Bench scores largely maintain the original model's very high level for this model size
  • Alignment: Particularly high scores in ethics, morality, and bias metrics

License

This model follows the license of its base model microsoft/phi-4. Please refer to the base model's license for details.