metadata

license: mit
license_link: https://huggingface.co/microsoft/phi-2/resolve/main/LICENSE
library: llama.cpp
library_link: https://github.com/ggerganov/llama.cpp
base_model:
  - microsoft/phi-2
language:
  - en
pipeline_tag: text-generation
tags:
  - nlp
  - code
  - gguf

Phi-2 Model Card

Model Summary

Phi-2 is a Transformer-based model with 2.7 billion parameters. It was trained on a variety of high-quality sources, including the datasets used for Phi-1.5 and additional filtered web content for improved safety and educational value. This model provides a compact option for exploring safety in language models, such as toxicity reduction and controllability, making it a strong choice for research into responsible AI usage.

Primary Specializations:

QA Format: Handles concise question-answering and analogy prompts.
Code Format: Demonstrates strong performance in Python code generation.
Chat Format: Limited support for multi-turn dialogues and chat-based tasks.

Model Information

Architecture: Transformer
Parameter Count: 2.7B
Training Data: Filtered text datasets including websites and synthetic educational resources.
Intended Use: QA and Python code generation. Not recommended for broad general-purpose NLP tasks without further evaluation.

Quantized Model Files

The following quantized files are provided for use with llama.cpp:

f16 (16-bit float precision):
- gguf format for lower memory footprint while retaining most accuracy.
- Suitable for systems with limited GPU memory.
- Recommended Use: Code generation and QA tasks requiring higher accuracy.
q8_0 (8-bit integer precision):
- Quantized for reduced storage requirements, optimized for lightweight inference.
- Recommended for CPU-based inference or setups with memory constraints.
- Recommended Use: Chat-style completions and small-scale experiments.

Example: LLaMa CLI

For interactive usage, try starting a chat session using:

./build/bin/llama-cli -m /mnt/scsm/models/microsoft/phi-2/ggml-model-f32.gguf --color --seed 1337 --ctx-size 2048 --n-predict -1 --threads 8 --repeat-penalty 1.25 --n-gpu-layers 32 --conversation --prompt "My name is Phi. I am a supportive and helpful assistant." 2&> /dev/null
<|im_start|>system
My name is Phi. I am a supportive and helpful assistant.<|im_end|>

> Hello! My name is Austin. What is your name?
My name is Phi. Nice to meet you, Austin.<|im_end|>

>

Example: LLaMa Server

./build/bin/llama-server -m /mnt/scsm/models/microsoft/phi-2/ggml-model-f16.gguf --port 8080 --threads $(nproc) --n-gpu-layers 32 --ctx-size 2048

Performance and Intended Use

The server and client are typically user friendly options compared to the example CLI application, e.g.

python -m llama_cpp_client.client --stop "</s>,<|endoftext|>,<|im_end|>" --session-name "test-phi-2-00" --completions

Example: QA Completion

**user**
Instruct: Explain why mathematics is often called the "language of the universe."
Output:

**completion**
Instruct: Explain why mathematics is often called the "language of the universe."
Output: Mathematics is often referred to as the "language of the universe" because it provides a way to describe and understand the natural world. Just as different languages have their own grammar and vocabulary, mathematics has its own set of rules and symbols. These rules allow us to communicate complex ideas and relationships in a precise and universal manner. Mathematics provides a framework for describing and analyzing patterns, structures, and processes that exist in the universe. It allows us to make predictions, solve problems, and uncover the underlying principles that govern the world around us. From the movement of celestial bodies to the behavior of subatomic particles, mathematics provides a common language through which we can explore and understand the mysteries of the universe.

Example: Code Completion

**user**
def generate_fibonacci(n: int) -> list[int]:
    """

**completion**
def generate_fibonacci(n: int) -> list:
    """Generate the Fibonacci sequence up to n."""
    result = []
    a, b = 0, 1
    while a < n:
        result.append(a)
        a, b = b, a + b
    return result

Known Limitations

Attention Overflow in FP16: Phi-2 may experience attention overflow issues in FP16 precision. If you encounter this, consider enabling or disabling autocast in the attention mechanism (PhiAttention.forward()).
Niche Specialization: The model has been trained for QA, chat, and Python code formats. It might not perform well on unrelated tasks or creative writing.

Safety and Responsible Use

Phi-2 is provided under the MIT license for research and educational purposes. Users should be mindful of its limitations and evaluate outputs carefully before using them in real-world applications. The model was designed to reduce harmful completions, but it may still produce biased or undesirable results in some scenarios.

For more details, please refer to the official license.