microsoft/phi-4 Quantized Models

Overview

This model applies GPTQ quantization to microsoft/phi-4 as the base model. It optimizes performance in Japanese environments by using Japanese text as calibration data.

Model Variants:
- nejumi/phi-4-GPTQ-Int4-calib-ja-1k
- nejumi/phi-4-GPTQ-Int8-calib-ja-1k
Base Model: microsoft/phi-4
Model Size: 14,659,507,200 parameters
Category: 10B≤ <30B

Quantization Parameters 🐝Link to W&B

bits: 4 or 8
group_size: 128
perc_damp: 0.01
desc_act: True
use_exllama: False
model_seqlen: 2048

Performance Evaluation

Evaluation results from Nejumi LLM Leaderboard 3 (W&B) Blue: Original Orange: 8bit Green: 4bit

Benchmark Overall Results

Model	GLP Average	ALT Average	Overall Average
phi-4 Int4	0.5815	0.6953	0.6384
phi-4 Int8	0.5948	0.7015	0.6482
phi-4 Original	0.5950	0.7005	0.6477

General Language Performance (GLP) Details

Subcategory	Int4	Int8	Original
Expression	0.8567	0.8717	0.8583
Translation	0.8458	0.8480	0.8457
Information Retrieval	0.8780	0.8806	0.8809
Reasoning	0.6400	0.5850	0.6550
Mathematical Reasoning	0.5400	0.5967	0.5817
Extraction	0.3304	0.3408	0.3470
Knowledge & QA	0.5587	0.5735	0.5685
MMLU_en	0.3035	0.2351	0.2158
Semantic Analysis	0.4220	0.5200	0.5070
Syntax Analysis	0.4399	0.4967	0.4903

Note: The low MMLU_en scores are due to the model's inability to strictly follow the required answer format for this benchmark, rather than reflecting its actual knowledge or reasoning capabilities.

Alignment (ALT) Details

Subcategory	Int4	Int8	Original
Controllability	0.6908	0.6949	0.6938
Ethics & Morality	0.8800	0.9100	0.9000
Toxicity	0.8143	0.8121	0.8007
Bias	0.8858	0.8730	0.8650
Robustness	0.3717	0.4208	0.4226
Truthfulness	0.5292	0.4983	0.5206

Benchmark Scores

Benchmark	Int4	Int8	Original
JASTER (0-shot)	0.3880	0.4262	0.4186
JASTER (2-shot)	0.6136	0.6441	0.6398
MT-Bench	8.2438	8.2000	8.1313
LCTG	0.6860	0.6670	0.6750

Model Characteristics & Evaluation

High Stability: Standard GPTQ quantization achieves sufficient performance for 14B class models
Basic Tasks: Maintains high performance of 0.84+ in expression, translation, and information retrieval; MT-Bench scores largely maintain the original model's very high level for this model size
Alignment: Particularly high scores in ethics, morality, and bias metrics

License

This model follows the license of its base model microsoft/phi-4. Please refer to the base model's license for details.