djstrong commited on
Commit
11f3f73
1 Parent(s): e0909da

fix swapped results of Mistral

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -131,7 +131,7 @@ The Open LLM Leaderboard evaluates models on various English language tasks, pro
131
  | Model | AVG | arc_challenge | hellaswag | truthfulqa_mc2 | mmlu | winogrande | gsm8k |
132
  |-------------------------|-------|---------------|-----------|----------------|-------|------------|-------|
133
  | **Bielik-11B-v2** | **65.87** | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 |
134
- | Mistral-7B-v0.2 | 60.37 | 60.84 | 83.08 | 63.62 | 41.76 | 78.22 | 34.72 |
135
  | Bielik-7B-v0.1 | 49.98 | 45.22 | 67.92 | 47.16 | 43.20 | 66.85 | 29.49 |
136
 
137
  The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis.
@@ -139,8 +139,7 @@ The results from the Open LLM Leaderboard demonstrate the impressive performance
139
  Key observations:
140
  1. Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements.
141
  2. It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks.
142
- 3. The model shows particular strength in MMLU (massive multitask language understanding), scoring 63.06 compared to Mistral-7B-v0.2's 41.76, demonstrating its broad knowledge base and understanding capabilities.
143
- 4. While Mistral-7B-v0.2 outperforms in truthfulqa_mc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task.
144
 
145
  Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities.
146
 
 
131
  | Model | AVG | arc_challenge | hellaswag | truthfulqa_mc2 | mmlu | winogrande | gsm8k |
132
  |-------------------------|-------|---------------|-----------|----------------|-------|------------|-------|
133
  | **Bielik-11B-v2** | **65.87** | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 |
134
+ | Mistral-7B-v0.2 | 60.37 | 60.84 | 83.08 | 41.76 | 63.62 | 78.22 | 34.72 |
135
  | Bielik-7B-v0.1 | 49.98 | 45.22 | 67.92 | 47.16 | 43.20 | 66.85 | 29.49 |
136
 
137
  The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis.
 
139
  Key observations:
140
  1. Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements.
141
  2. It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks.
142
+ 3. While Mistral-7B-v0.2 outperforms in truthfulqa_mc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task.
 
143
 
144
  Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities.
145