speakleash
/

Bielik-11B-v2

@@ -131,7 +131,7 @@ The Open LLM Leaderboard evaluates models on various English language tasks, pro
 | Model                   | AVG   | arc_challenge | hellaswag | truthfulqa_mc2 | mmlu  | winogrande | gsm8k |
 |-------------------------|-------|---------------|-----------|----------------|-------|------------|-------|
 | **Bielik-11B-v2**       | **65.87** | 60.58         | 79.84     | 46.13          | 63.06 | 77.82      | 67.78 |
-| Mistral-7B-v0.2         | 60.37 | 60.84         | 83.08     | 63.62          | 41.76 | 78.22      | 34.72 |
 | Bielik-7B-v0.1          | 49.98 | 45.22         | 67.92     | 47.16          | 43.20 | 66.85      | 29.49 |
 The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis.
@@ -139,8 +139,7 @@ The results from the Open LLM Leaderboard demonstrate the impressive performance
 Key observations:
 1. Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements.
 2. It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks.
-3. The model shows particular strength in MMLU (massive multitask language understanding), scoring 63.06 compared to Mistral-7B-v0.2's 41.76, demonstrating its broad knowledge base and understanding capabilities.
-4. While Mistral-7B-v0.2 outperforms in truthfulqa_mc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task.
 Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities.

 | Model                   | AVG   | arc_challenge | hellaswag | truthfulqa_mc2 | mmlu  | winogrande | gsm8k |
 |-------------------------|-------|---------------|-----------|----------------|-------|------------|-------|
 | **Bielik-11B-v2**       | **65.87** | 60.58         | 79.84     | 46.13          | 63.06 | 77.82      | 67.78 |
+| Mistral-7B-v0.2         | 60.37 | 60.84         | 83.08     | 41.76          | 63.62 | 78.22      | 34.72 |
 | Bielik-7B-v0.1          | 49.98 | 45.22         | 67.92     | 47.16          | 43.20 | 66.85      | 29.49 |
 The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis.
 Key observations:
 1. Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements.
 2. It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks.
+3. While Mistral-7B-v0.2 outperforms in truthfulqa_mc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task.
 Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities.