qwen-2.5-14b / README.md
DrishtiSharma's picture
Update README.md
7d55ab7 verified
|
raw
history blame
5.65 kB
we instruction tuned models
011 such as Qwen-2.5-14B-Instruct and Phi-4 on a mixed- 063
language dataset []()
| Model | ARC-C | ARC-E | BoolQ | CMCQ | MMLU | Average* | MMLU-Pro | GPQA | MuSR | BBH | MATH |
|---------------------------------|-------|-------|-------|-------|-------|----------|----------|------|-------|-------|-------|
| AryaBhatta-GemmaUltra-8.5B | 22.70 | 25.04 | 22.95 | 62.23 | 23.70 | 31.32 | 22.66 | 25.34| 42.72 | 41.12 | 2.95 |
| Airavata-7B | 25.09 | 30.47 | 25.31 | 62.17 | 33.20 | 35.25 | 16.35 | 27.43| 37.57 | 36.00 | 13.60 |
| sarvam-1-2B | 30.03 | 33.25 | 62.17 | 42.80 | 27.90 | 39.23 | - | - | - | - | - |
| Nemotron-4-Mini-Hindi-Instruct | 55.80 | 71.63 | 62.11 | 68.10 | 43.20 | 60.17 | 25.95 | 30.87| 41.53 | 40.11 | 2.04 |
| Llama-3-Nanda-10B-Chat | 65.36 | 80.64 | 82.29 | 67.60 | 50.61 | 69.30 | - | - | - | - | - |
| Krutrim-2-12b-instruct | 67.32 | 81.10 | 84.74 | 76.30 | 56.10 | 73.11 | - | - | - | - | - |
| aya-expanse-8b | 74.06 | 87.08 | 86.45 | 83.30 | 56.89 | 77.56 | 30.04 | 30.29| 37.17 | 49.42 | 7.02 |
| aya-expanse-32B | 85.41 | 95.08 | 90.43 | 89.80 | 69.71 | 86.08 | 41.30 | 32.55| 38.62 | 56.29 | 13.37 |
| **Our Qwen Model (14b)** | **90.61** | **94.82** | **88.53** | **90.70** | **75.00** | **87.93** | **52.63** | **36.24** | **44.84** | **64.97** | **25.08** |
| **Our Phi Model (14b)** | **97.39** | **92.24** | **87.65** | **87.40** | **75.59** | **88.05** | **52.39** | **39.77** | **49.07** | **66.97** | **23.11** |
**Table 1: Metrics (.2f) of our Qwen-2.5-14B and other LLMs over several English benchmarks**
| Model | ARC-C | ARC-E | BoolQ | CMCQ | MMLU | Average |
|------------------------------------|-------|-------|-------|-------|-------|---------|
| AryaBhatta-GemmaUltra-8.5B | 22.70 | 25.08 | 22.95 | 62.17 | 23.80 | 31.34 |
| Airavata-7B | 22.87 | 25.13 | 23.28 | 62.17 | 33.20 | 33.33 |
| sarvam-1-2B | 32.76 | 35.06 | 62.16 | 47.10 | 24.22 | 40.26 |
| Llama-3-Nanda-10B-Chat | 45.99 | 60.56 | 71.96 | 54.70 | 36.35 | 53.91 |
| Nemotron-4-Mini-Hindi-4B-Instruct | 50.68 | 63.72 | 68.74 | 51.30 | 37.18 | 54.32 |
| Krutrim-2-12b-instruct | 56.83 | 70.66 | 78.86 | 64.10 | 46.51 | 63.39 |
| aya-expanse-8b | 57.42 | 72.90 | 80.42 | 69.00 | 43.39 | 64.63 |
| aya-expanse-32B | 73.29 | 85.48 | 87.73 | 79.70 | 56.96 | 76.63 |
| **Our Qwen Model (14b)** | **74.06** | **81.23** | **84.07** | **78.20** | **53.85** | **74.82** |
| **Our Phi Model (14b)** | **81.74** | **89.06** | **86.02** | **78.70** | **56.39** | **78.38** |
**Table 2: Metrics (.2f) of our Qwen-2.5-14B and other LLMs over several Hindi benchmarks**
| Benchmark | Lang | Qwen-2.5-14B-Instruct | Our Qwen | Change | Phi-4 | Our Phi | Change |
|----------------|------|----------------------|----------|--------|-------|---------|--------|
| ARC-Easy | En | 95.45 | 94.82 | πŸ”» 0.63 | 97.31 | 97.39 | πŸ”Ό 0.08 |
| | Hi | 78.49 | 81.23 | πŸ”Ό 2.74 | 86.87 | 89.06 | πŸ”Ό 2.19 |
| ARC-Challenge | En | 90.87 | 90.61 | πŸ”» 0.26 | 92.41 | 92.24 | πŸ”» 0.17 |
| | Hi | 69.62 | 74.06 | πŸ”Ό 4.44 | 79.18 | 81.74 | πŸ”Ό 2.56 |
| BoolQ | En | 86.09 | 88.53 | πŸ”Ό 2.44 | 86.30 | 87.65 | πŸ”Ό 1.35 |
| | Hi | 78.89 | 84.07 | πŸ”Ό 5.18 | 82.72 | 86.02 | πŸ”Ό 3.30 |
| Context-MCQ | En | 91.20 | 90.70 | πŸ”» 0.50 | 86.30 | 87.40 | πŸ”Ό 1.10 |
| | Hi | 77.40 | 78.20 | πŸ”Ό 0.80 | 75.70 | 78.70 | πŸ”Ό 3.00 |
| MMLU | En | 74.37 | 75.00 | πŸ”Ό 0.63 | 74.67 | 75.59 | πŸ”Ό 0.92 |
| | Hi | 52.16 | 53.85 | πŸ”Ό 1.69 | 53.24 | 56.39 | πŸ”Ό 3.15 |
| **Average** | En | **87.60** | **87.93**| πŸ”Ό 0.33 | **87.40** | **88.05** | πŸ”Ό 0.65 |
| | Hi | **71.31** | **74.82**| πŸ”Ό 3.51 | **75.54** | **78.38** | πŸ”Ό 2.84 |
| **Overall** | | **79.46** | **81.38**| πŸ”Ό 1.92 | **81.47** | **83.22** | πŸ”Ό 1.75 |
**Table 3: Performance of our Qwen-2.5-14B model compared to originals over each benchmark : evals through log likelihoods**
| Benchmark | Lang | Qwen-2.5-14B-Instruct | Our Qwen | Change | Phi-4 | Our Phi | Change |
|----------------|------|----------------------|----------|---------|-------|---------|---------|
| MMLU-Pro | En | 49.04 | 52.63 | πŸ”Ό 3.59 | 53.78 | 52.39 | πŸ”» 1.39 |
| MATH hard | En | 00.00 | 25.08 | πŸ”· N/A | 12.31 | 23.11 | πŸ”Ό 10.80 |
| GPQA | En | 32.21 | 36.24 | πŸ”Ό 4.03 | 33.72 | 39.77 | πŸ”Ό 6.05 |
| MuSR | En | 40.87 | 44.84 | πŸ”Ό 3.97 | 41.01 | 49.07 | πŸ”Ό 8.06 |
| BigBench-Hard | En | 63.74 | 64.97 | πŸ”Ό 1.23 | 68.60 | 66.97 | πŸ”» 1.63 |
| **Average** | | **37.17** | **44.75**| πŸ”Ό 7.58 | **41.88** | **46.26** | πŸ”Ό 4.38 |
**Table 4: Performance of our Qwen-2.5-14B model compared to originals over each benchmark : evals through eval-harness**