# Qwen-2.5-14B & Phi-4: Instruction-Tuned Models

## Overview
We fine-tuned large language models, including **Qwen-2.5-14B-Instruct** and **Phi-4**, on a mixed-language dataset [dataset placeholder]().  
These models demonstrate state-of-the-art performance across multiple benchmarks, excelling in both English and Hindi evaluation tasks.

## Benchmark Results
We evaluated our models on multiple well-known benchmarks to measure their effectiveness against other leading models.

### **English Benchmarks**
The following table presents performance metrics of our models compared to other LLMs across various English benchmarks:

| Model                           | ARC-C | ARC-E | BoolQ | CMCQ  | MMLU  | Average* | MMLU-Pro | GPQA | MuSR  | BBH   | MATH  |
|---------------------------------|-------|-------|-------|-------|-------|----------|----------|------|-------|-------|-------|
| AryaBhatta-GemmaUltra-8.5B      | 22.70 | 25.04 | 22.95 | 62.23 | 23.70 | 31.32    | 22.66    | 25.34| 42.72 | 41.12 | 2.95  |
| Airavata-7B                     | 25.09 | 30.47 | 25.31 | 62.17 | 33.20 | 35.25    | 16.35    | 27.43| 37.57 | 36.00 | 13.60 |
| Nemotron-4-Mini-Hindi-Instruct  | 55.80 | 71.63 | 62.11 | 68.10 | 43.20 | 60.17    | 25.95    | 30.87| 41.53 | 40.11 | 2.04  |
| Llama-3-Nanda-10B-Chat          | 65.36 | 80.64 | 82.29 | 67.60 | 50.61 | 69.30    | -        | -    | -     | -     | -     |
| Krutrim-2-12b-instruct          | 67.32 | 81.10 | 84.74 | 76.30 | 56.10 | 73.11    | -        | -    | -     | -     | -     |
| aya-expanse-8b                  | 74.06 | 87.08 | 86.45 | 83.30 | 56.89 | 77.56    | 30.04    | 30.29| 37.17 | 49.42 | 7.02  |
| aya-expanse-32B                 | 85.41 | 95.08 | 90.43 | 89.80 | 69.71 | 86.08    | 41.30    | 32.55| 38.62 | 56.29 | 13.37 |
| **Our Qwen Model (14b)**        | **90.61** | **94.82** | **88.53** | **90.70** | **75.00** | **87.93** | **52.63** | **36.24** | **44.84** | **64.97** | **25.08** |
| **Our Phi Model (14b)**         | **97.39** | **92.24** | **87.65** | **87.40** | **75.59** | **88.05** | **52.39** | **39.77** | **49.07** | **66.97** | **23.11** |

### **Hindi Benchmarks**
Below are the evaluation metrics of our models on Hindi datasets:

| Model                              | ARC-C | ARC-E | BoolQ | CMCQ  | MMLU  | Average |
|------------------------------------|-------|-------|-------|-------|-------|---------|
| AryaBhatta-GemmaUltra-8.5B         | 22.70 | 25.08 | 22.95 | 62.17 | 23.80 | 31.34   |
| Airavata-7B                        | 22.87 | 25.13 | 23.28 | 62.17 | 33.20 | 33.33   |
| Llama-3-Nanda-10B-Chat             | 45.99 | 60.56 | 71.96 | 54.70 | 36.35 | 53.91   |
| aya-expanse-32B                    | 73.29 | 85.48 | 87.73 | 79.70 | 56.96 | 76.63   |
| **Our Qwen Model (14b)**           | **74.06** | **81.23** | **84.07** | **78.20** | **53.85** | **74.82** |
| **Our Phi Model (14b)**            | **81.74** | **89.06** | **86.02** | **78.70** | **56.39** | **78.38** |

### **Model Performance Improvements**
We evaluated the improvements of our Qwen and Phi models over their base versions using log-likelihood-based evals:

| Benchmark       | Lang | Qwen-2.5-14B-Instruct | Our Qwen | Change | Phi-4 | Our Phi | Change |
|----------------|------|----------------------|----------|--------|-------|---------|--------|
| ARC-Easy      | En   | 95.45                | 94.82    | 🔻 0.63 | 97.31 | 97.39   | 🔼 0.08 |
| BoolQ         | Hi   | 78.89                | 84.07    | 🔼 5.18 | 82.72 | 86.02   | 🔼 3.30 |
| MMLU-Pro      | En   | 49.04                | 52.63    | 🔼 3.59  | 53.78 | 52.39   | 🔻 1.39  |
| MATH hard     | En   | 00.00                | 25.08    | 🔷 N/A   | 12.31 | 23.11   | 🔼 10.80 |
| GPQA          | En   | 32.21                | 36.24    | 🔼 4.03  | 33.72 | 39.77   | 🔼 6.05  |

## Conclusion
Our **Qwen-2.5-14B-Instruct** and **Phi-4** models demonstrate exceptional performance across a variety of benchmarks, significantly improving upon existing models in both **English** and **Hindi**. These results validate the effectiveness of our **instruction-tuning methodology** on diverse datasets.

---
### Citation
If you use our models in your research, please cite:
```bibtex
@article{qwen_phi_2024,
  title={Instruction-Tuned Large Language Models},
  author={Your Name et al.},
  year={2024},
  publisher={Your Organization}
}
```