fix swapped results of Mistral
Browse files
README.md
CHANGED
@@ -131,7 +131,7 @@ The Open LLM Leaderboard evaluates models on various English language tasks, pro
|
|
131 |
| Model | AVG | arc_challenge | hellaswag | truthfulqa_mc2 | mmlu | winogrande | gsm8k |
|
132 |
|-------------------------|-------|---------------|-----------|----------------|-------|------------|-------|
|
133 |
| **Bielik-11B-v2** | **65.87** | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 |
|
134 |
-
| Mistral-7B-v0.2 | 60.37 | 60.84 | 83.08 |
|
135 |
| Bielik-7B-v0.1 | 49.98 | 45.22 | 67.92 | 47.16 | 43.20 | 66.85 | 29.49 |
|
136 |
|
137 |
The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis.
|
@@ -139,8 +139,7 @@ The results from the Open LLM Leaderboard demonstrate the impressive performance
|
|
139 |
Key observations:
|
140 |
1. Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements.
|
141 |
2. It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks.
|
142 |
-
3.
|
143 |
-
4. While Mistral-7B-v0.2 outperforms in truthfulqa_mc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task.
|
144 |
|
145 |
Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities.
|
146 |
|
|
|
131 |
| Model | AVG | arc_challenge | hellaswag | truthfulqa_mc2 | mmlu | winogrande | gsm8k |
|
132 |
|-------------------------|-------|---------------|-----------|----------------|-------|------------|-------|
|
133 |
| **Bielik-11B-v2** | **65.87** | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 |
|
134 |
+
| Mistral-7B-v0.2 | 60.37 | 60.84 | 83.08 | 41.76 | 63.62 | 78.22 | 34.72 |
|
135 |
| Bielik-7B-v0.1 | 49.98 | 45.22 | 67.92 | 47.16 | 43.20 | 66.85 | 29.49 |
|
136 |
|
137 |
The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis.
|
|
|
139 |
Key observations:
|
140 |
1. Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements.
|
141 |
2. It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks.
|
142 |
+
3. While Mistral-7B-v0.2 outperforms in truthfulqa_mc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task.
|
|
|
143 |
|
144 |
Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities.
|
145 |
|