# Qwen-2.5-14B & Phi-4: Instruction-Tuned Models ## Overview We fine-tuned large language models, including **Qwen-2.5-14B-Instruct** and **Phi-4**, on a mixed-language dataset [dataset placeholder](). These models demonstrate state-of-the-art performance across multiple benchmarks, excelling in both English and Hindi evaluation tasks. ## Benchmark Results We evaluated our models on multiple well-known benchmarks to measure their effectiveness against other leading models. ### **English Benchmarks** The following table presents performance metrics of our models compared to other LLMs across various English benchmarks: | Model | ARC-C | ARC-E | BoolQ | CMCQ | MMLU | Average* | MMLU-Pro | GPQA | MuSR | BBH | MATH | |---------------------------------|-------|-------|-------|-------|-------|----------|----------|------|-------|-------|-------| | AryaBhatta-GemmaUltra-8.5B | 22.70 | 25.04 | 22.95 | 62.23 | 23.70 | 31.32 | 22.66 | 25.34| 42.72 | 41.12 | 2.95 | | Airavata-7B | 25.09 | 30.47 | 25.31 | 62.17 | 33.20 | 35.25 | 16.35 | 27.43| 37.57 | 36.00 | 13.60 | | Nemotron-4-Mini-Hindi-Instruct | 55.80 | 71.63 | 62.11 | 68.10 | 43.20 | 60.17 | 25.95 | 30.87| 41.53 | 40.11 | 2.04 | | Llama-3-Nanda-10B-Chat | 65.36 | 80.64 | 82.29 | 67.60 | 50.61 | 69.30 | - | - | - | - | - | | Krutrim-2-12b-instruct | 67.32 | 81.10 | 84.74 | 76.30 | 56.10 | 73.11 | - | - | - | - | - | | aya-expanse-8b | 74.06 | 87.08 | 86.45 | 83.30 | 56.89 | 77.56 | 30.04 | 30.29| 37.17 | 49.42 | 7.02 | | aya-expanse-32B | 85.41 | 95.08 | 90.43 | 89.80 | 69.71 | 86.08 | 41.30 | 32.55| 38.62 | 56.29 | 13.37 | | **Our Qwen Model (14b)** | **90.61** | **94.82** | **88.53** | **90.70** | **75.00** | **87.93** | **52.63** | **36.24** | **44.84** | **64.97** | **25.08** | | **Our Phi Model (14b)** | **97.39** | **92.24** | **87.65** | **87.40** | **75.59** | **88.05** | **52.39** | **39.77** | **49.07** | **66.97** | **23.11** | ### **Hindi Benchmarks** Below are the evaluation metrics of our models on Hindi datasets: | Model | ARC-C | ARC-E | BoolQ | CMCQ | MMLU | Average | |------------------------------------|-------|-------|-------|-------|-------|---------| | AryaBhatta-GemmaUltra-8.5B | 22.70 | 25.08 | 22.95 | 62.17 | 23.80 | 31.34 | | Airavata-7B | 22.87 | 25.13 | 23.28 | 62.17 | 33.20 | 33.33 | | Llama-3-Nanda-10B-Chat | 45.99 | 60.56 | 71.96 | 54.70 | 36.35 | 53.91 | | aya-expanse-32B | 73.29 | 85.48 | 87.73 | 79.70 | 56.96 | 76.63 | | **Our Qwen Model (14b)** | **74.06** | **81.23** | **84.07** | **78.20** | **53.85** | **74.82** | | **Our Phi Model (14b)** | **81.74** | **89.06** | **86.02** | **78.70** | **56.39** | **78.38** | ### **Model Performance Improvements** We evaluated the improvements of our Qwen and Phi models over their base versions using log-likelihood-based evals: | Benchmark | Lang | Qwen-2.5-14B-Instruct | Our Qwen | Change | Phi-4 | Our Phi | Change | |----------------|------|----------------------|----------|--------|-------|---------|--------| | ARC-Easy | En | 95.45 | 94.82 | 🔻 0.63 | 97.31 | 97.39 | 🔼 0.08 | | BoolQ | Hi | 78.89 | 84.07 | 🔼 5.18 | 82.72 | 86.02 | 🔼 3.30 | | MMLU-Pro | En | 49.04 | 52.63 | 🔼 3.59 | 53.78 | 52.39 | 🔻 1.39 | | MATH hard | En | 00.00 | 25.08 | 🔷 N/A | 12.31 | 23.11 | 🔼 10.80 | | GPQA | En | 32.21 | 36.24 | 🔼 4.03 | 33.72 | 39.77 | 🔼 6.05 | ## Conclusion Our **Qwen-2.5-14B-Instruct** and **Phi-4** models demonstrate exceptional performance across a variety of benchmarks, significantly improving upon existing models in both **English** and **Hindi**. These results validate the effectiveness of our **instruction-tuning methodology** on diverse datasets. --- ### Citation If you use our models in your research, please cite: ```bibtex @article{qwen_phi_2024, title={Instruction-Tuned Large Language Models}, author={Your Name et al.}, year={2024}, publisher={Your Organization} } ```