---
license: apache-2.0
---
# Model Card for Zamba2-1.2B
Zamba2-1.2B-instruct is obtained from Zamba2-1.2B by fine-tuning on instruction-following and chat datasets. Specifically:
1. SFT of the base [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) model on [ultrachat_200k](HuggingFaceH4/ultrachat_200k) and [Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
2. DPO of the SFT checkpoint on [ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), [orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs), and [OpenHermesPreferences](https://huggingface.co/datasets/argilla/OpenHermesPreferences)
Zamba2-1.2B-Instruct is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks.
## Quick start
### Prerequisites
To download Zamba2-1.2B, clone Zyphra's fork of transformers:
1. `git clone https://github.com/Zyphra/transformers_zamba2.git`
2. `cd transformers_zamba2`
3. Install the repository: `pip install -e .`
4. `pip install accelerate`
You can run the model without using the optimized Mamba2 kernels, but it is **not** recommended as it will result in significantly higher latency and memory usage.
To run on CPU, please specify `use_mamba_kernels=False` when loading the model using ``AutoModelForCausalLM.from_pretrained``.
### Inference
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Instantiate model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-1.2B-instruct")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-1.2B-instruct", device_map="cuda", torch_dtype=torch.bfloat16)
# Format the input as a chat template
prompt = "What factors contributed to the fall of the Roman Empire?"
sample = [{'role': 'user', 'content': prompt}]
chat_sample = tokenizer.apply_chat_template(sample, tokenize=False)
# Tokenize input and generate output
input_ids = tokenizer(chat_sample, return_tensors='pt', add_special_tokens=False).to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=150, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False)
print((tokenizer.decode(outputs[0])))
```
## Performance
Zamba2-1.2B-Instruct achieves leading instruction-following and multi-turn chat performance for a model of its size and matches strong models significantly larger. For instance, Zamba2-1.2B-Instruct outperforms Gemma2-2B-Instruct, a very strong model over 2x its size.
| Model | Size | MT-Bench | IFEval |
|-------------|----|----|----|
| **Zamba2-1.2B-Instruct** | 1.2B | **59.53** | **41.45** |
| Gemma2-2B-Instruct | 2.7B | 51.69 | 42.20 |
| H2O-Danube-1.6B-Chat | 1.6B | 49.78 | 27.95 |
| StableLM-1.6B-Chat | 1.6B | 49.87 | 33.77 |
| SmolLM-1.7B-Instruct | 1.7B | 43.37 | 16.53 |
| Qwen2-1.5B-Instruct | 1.5B | N/A | 34.68 |
Moreover, due to its unique hybrid SSM architecture, Zamba2-1.2B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models.