Model Card: Phi‑4-Thinking (14B)

Model Overview

Base Model: Microsoft's Phi‑4 (14B)
Architecture: Converted into a Llama-compatible architecture to improve accuracy and facilitate efficient finetuning
Key Enhancements:
- Tokenizer and Generation Fixes:
  • Corrected EOS and PAD token handling (using tokens such as <|im_end|> and <|dummy_87|>)
  • Adjusted chat template behavior to avoid unintended assistant prompt insertions
- Dynamic Quantization:
  • Implemented dynamic 4‑bit quantization that leaves selected layers in 16‑bit, boosting accuracy while reducing VRAM usage (approximately 70% less VRAM than standard setups)
- GRPO Training:
  • Integrated Unsloth’s Efficient GRPO algorithm for long-context training, enabling up to 12× longer context lengths and 10× longer reasoning chains while using significantly less memory

Primary Applications:
Natural language understanding and generation tasks including conversation, question–answering, creative text generation, and other scenarios where long context and reasoning capabilities are beneficial.
Research & Experimental:
This model is intended for research and experimentation. Users should carefully evaluate its behavior in production environments, especially for safety‐critical applications.

Data & Methodology:
- Finetuning was performed using open source datasets.
- Training employed QLoRA applied across key linear layers (query, key, value, output, gate, up, and down) to maximize efficiency.
- The GRPO method was leveraged to enhance reasoning over extended context lengths.
Infrastructure Improvements:
- The model now supports >128K context lengths on a single 24GB GPU (with further efficiency gains, fitting within ~15GB VRAM for finetuning).

Performance Benchmarks:
- Community evaluations (e.g., on the OpenLLM Leaderboard) indicate that the finetuned model achieves performance on par with or exceeding Microsoft’s official Phi‑4, especially in tasks involving multiple–choice reasoning and creative generation.
- Notable improvements in generation quality were observed, such as more accurate token prediction and enhanced reasoning output.

Model Limitations:
- Despite extensive improvements, the model may still reflect biases or limitations present in the original training data.
- Extended context handling may sometimes lead to unexpected behavior; users should monitor outputs carefully.
Ethical Use:
- As with all generative models, outputs should be evaluated for potential biases, factual inaccuracies, or harmful content.
- Users are encouraged to apply appropriate safeguards when deploying the model in public-facing or safety–critical systems.

Intended Audience:
Researchers, developers, and practitioners interested in efficient finetuning and long–context reasoning applications.
Attribution:
- “Phi‑4 Finetuning + Bug Fixes by Unsloth” (unsloth.ai/blog/phi4)
- “Long–context GRPO” (unsloth.ai/blog/grpo)