Phi-4 ZeroWw quantizations

For q4_k: output and embed tensors quantized to q8_0, all other tensors quantized for q4_k.
For q5_k, q6_k, q8_0 and q8_0 --pure: output and embed tensors quantized to bf16, all other tensors quantized for q5_k, q6_k, q8_0 and q8_0 --pure.
BF16 and imatrix for q5_k, q6_k available.

	Quant type	File Size	Vram*
phi-4.q8.q4	4 bits per weight	9.43 GB	12.9 GB
phi-4.bf16.q5	5 bits per weight	11.9 GB	14.2 GB
phi-4.bf16.q5.im	5 bits per weight	11.9 GB	14.2 GB
phi-4.bf16.q6	6 bits per weight	13.2 GB	15.5 GB
phi-4.bf16.q6.im	6 bits per weight	13.2 GB	15.5 GB
phi-4.bf16.q8	8 bits per weight	16.5 GB	18.5 GB
phi-4.bf16.q8p	8 bits per weight	15.6 GB	18.6 GB
phi-4.bf16	16 bits per weight	29.3 GB	tbd

_{*approximate value at 16k context, FP16 cache.}

ZeroWw quantization: huggingface.co/RobertSinclair

python convert_hf_to_gguf.py --outtype bf16 phi-4 --outfile phi-4.bf16.gguf

llama-quantize --allow-requantize --output-tensor-type q8_0 --token-embedding-type q8_0 phi-4.bf16.gguf phi-4.q8.q4.gguf q4_k
llama-quantize --allow-requantize --output-tensor-type bf16 --token-embedding-type bf16 phi-4.bf16.gguf phi-4.bf16.q5.gguf q5_k
llama-quantize --imatrix imatrix.dat --leave-output-tensor phi-4.bf16.gguf phi-4.bf16.q5.im.gguf q5_k
llama-quantize --allow-requantize --output-tensor-type bf16 --token-embedding-type bf16 phi-4.bf16.gguf phi-4.bf16.q6.gguf q6_k
llama-quantize --imatrix imatrix.dat --leave-output-tensor phi-4.bf16.gguf phi-4.bf16.q6.im.gguf q6_k
llama-quantize --allow-requantize --output-tensor-type bf16 --token-embedding-type bf16 phi-4.bf16.gguf phi-4.bf16.q8.gguf q8_0
llama-quantize --allow-requantize --pure phi-4.bf16.gguf phi-4.bf16.q8p.gguf q8_0

Phi-4 Model Card

Phi-4 Technical Report

Model Summary


Developers	Microsoft Research
Description	`phi-4` is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. `phi-4` underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures
Architecture	14B parameters, dense decoder-only Transformer model
Context length	16384 tokens

Usage

Input Formats

Given the nature of the training data, phi-4 is best suited for prompts using the chat format as follows:

<|im_start|>system<|im_sep|>
You are a medieval knight and must provide explanations to modern people.<|im_end|>
<|im_start|>user<|im_sep|>
How should I explain the Internet?<|im_end|>
<|im_start|>assistant<|im_sep|>

cmh
/

phi-4_ZeroWw

Phi-4 ZeroWw quantizations

Phi-4 Model Card

Model Summary

Usage

Input Formats

Model tree for cmh/phi-4_ZeroWw