chrisociepa
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,184 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- pl
|
5 |
+
library_name: transformers
|
6 |
+
inference:
|
7 |
+
parameters:
|
8 |
+
temperature: 0.9
|
9 |
+
---
|
10 |
+
|
11 |
+
<p align="center">
|
12 |
+
<img src="https://huggingface.co/speakleash/Bielik-7B-v0.1/raw/main/speakleash_cyfronet.png">
|
13 |
+
</p>
|
14 |
+
|
15 |
+
# Bielik-11B-v2
|
16 |
+
|
17 |
+
Bielik-11B-v2 is a generative text model featuring 11 billion parameters. It is initialized from its predecessor, Mistral-7B-v0.2, and trained on 400 billion tokens.
|
18 |
+
The aforementioned model stands as a testament to the unique collaboration between the open-science/open-source project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH.
|
19 |
+
Developed and trained on Polish text corpora, which have been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, specifically within the PLGrid environment,
|
20 |
+
and more precisely, the HPC center: ACK Cyfronet AGH. The creation and training of the Bielik-11B-v2 was propelled by the support of computational grant number PLG/2024/016951, conducted on the Helios supercomputer,
|
21 |
+
enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language,
|
22 |
+
providing accurate responses and performing a variety of linguistic tasks with high precision.
|
23 |
+
|
24 |
+
## Model
|
25 |
+
|
26 |
+
Bielik-11B-v2 has been trained with [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) using different parallelization techniques.
|
27 |
+
|
28 |
+
The model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH, utilizing 256 NVidia GH200 cards.
|
29 |
+
|
30 |
+
The training dataset was composed of Polish texts collected and made available through the [SpeakLeash](https://speakleash.org/) project as well as a part of the [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B). We used 200 billion tokens for two epochs of training.
|
31 |
+
|
32 |
+
### Model description:
|
33 |
+
|
34 |
+
* **Developed by:** [SpeakLeash](https://speakleash.org/) & [ACK Cyfronet AGH](https://www.cyfronet.pl/)
|
35 |
+
* **Language:** Polish
|
36 |
+
* **Model type:** causal decoder-only
|
37 |
+
* **Initialized from:** [Mistral-7B-v0.2](https://models.mistralcdn.com/mistral-7b-v0-2/mistral-7B-v0.2.tar)
|
38 |
+
* **License:** Apache 2.0 (commercial use allowed)
|
39 |
+
* **Model ref:** speakleash:45b6efdb701991181a05968fc53d2a8e
|
40 |
+
|
41 |
+
### Quality evaluation
|
42 |
+
|
43 |
+
An XGBoost classification model was prepared and created to evaluate the quality of texts in native Polish language. It is based on 93 features, such as the ratio of out-of-vocabulary words to all words (OOVs), the number of nouns, verbs, average sentence length etc. The model outputs the category of a given document (either HIGH, MEDIUM or LOW) along with the probability. This approach allows implementation of a dedicated pipeline to choose documents, from which we've used entries with HIGH quality index and probability exceeding 90%.
|
44 |
+
|
45 |
+
This filtration and appropriate selection of texts enable the provision of a condensed and high-quality database of texts in Polish for training purposes.
|
46 |
+
|
47 |
+
### Quickstart
|
48 |
+
|
49 |
+
This model can be easily loaded using the AutoModelForCausalLM functionality.
|
50 |
+
|
51 |
+
```python
|
52 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
53 |
+
|
54 |
+
model_name = "speakleash/Bielik-11B-v2"
|
55 |
+
|
56 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
57 |
+
model = AutoModelForCausalLM.from_pretrained(model_name)
|
58 |
+
```
|
59 |
+
|
60 |
+
In order to reduce the memory usage, you can use smaller precision (`bfloat16`).
|
61 |
+
|
62 |
+
```python
|
63 |
+
import torch
|
64 |
+
|
65 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
|
66 |
+
```
|
67 |
+
|
68 |
+
And then you can use HuggingFace Pipelines to generate text:
|
69 |
+
|
70 |
+
```python
|
71 |
+
import transformers
|
72 |
+
|
73 |
+
text = "Najwa偶niejszym celem cz艂owieka na ziemi jest"
|
74 |
+
|
75 |
+
pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
|
76 |
+
sequences = pipeline(max_new_tokens=100, do_sample=True, top_k=50, eos_token_id=tokenizer.eos_token_id, text_inputs=text)
|
77 |
+
for seq in sequences:
|
78 |
+
print(f"Result: {seq['generated_text']}")
|
79 |
+
```
|
80 |
+
Generated output:
|
81 |
+
> Najwa偶niejszym celem cz艂owieka na ziemi jest 偶ycie w pokoju, harmonii i mi艂o艣ci. Dla ka偶dego z nas bardzo wa偶ne jest, aby otacza膰 si臋 kochanymi osobami.
|
82 |
+
|
83 |
+
|
84 |
+
## Evaluation
|
85 |
+
|
86 |
+
|
87 |
+
Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard) 5-shot. The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Average column is an average score among all tasks normalized by baseline scores.
|
88 |
+
|
89 |
+
### Open PL LLM Leaderboard
|
90 |
+
|
91 |
+
| Model | Parameters | Average |
|
92 |
+
|------------------------|------------|---------|
|
93 |
+
| Qwen2-72B | 72 | 65.76 |
|
94 |
+
| Meta-Llama-3-70B | 70 | 60.87 |
|
95 |
+
| Meta-Llama-3.1-70B | 70 | 60.39 |
|
96 |
+
| Mixtral-8x22B-v0.1 | 141 | 59.95 |
|
97 |
+
| Qwen1.5-72B | 72 | 59.94 |
|
98 |
+
| Qwen1.5-32B | 32 | 57.34 |
|
99 |
+
| **Bielik-11B-v2** | **11** | **56.61** |
|
100 |
+
| Qwen2-7B | 7 | 48.75 |
|
101 |
+
| Mistral-Nemo-Base-2407 | 12 | 46.15 |
|
102 |
+
| SOLAR-10.7B-v1.0 | 10.7 | 46.04 |
|
103 |
+
| internlm2-20b | 20 | 45.98 |
|
104 |
+
| Meta-Llama-3.1-8B | 8 | 42.79 |
|
105 |
+
| Meta-Llama-3-8B | 8 | 42.40 |
|
106 |
+
| Mistral-7B-v0.2 | 7 | 37.20 |
|
107 |
+
| Bielik-7B-v0.1 | 7 | 33.78 |
|
108 |
+
| Qra-13b | 13 | 33.71 |
|
109 |
+
| Qra-7b | 7 | 16.09 |
|
110 |
+
|
111 |
+
The results from the Open PL LLM Leaderboard show that the Bielik-11B-v2 model, with 11 billion parameters, achieved an average score of 56.61. This makes it the best performing model among those under 20B parameters, outperforming the second-best model in this category by an impressive 8 percentage points. This significant lead not only places it ahead of its predecessor, the Bielik-7B-v0.1 (which scored 33.78), but also demonstrates its superiority over other larger models. The substantial improvement highlights the remarkable advancements and optimizations made in this newer version.
|
112 |
+
|
113 |
+
Other Polish models listed include Qra-13b and Qra-7b, scoring 33.71 and 16.09 respectively, indicating that Bielik-11B-v2 outperforms these models by a considerable margin.
|
114 |
+
|
115 |
+
Additionally, the Bielik-11B-v2 was initialized from the weights of Mistral-7B-v0.2, which itself scored 37.20, further demonstrating the effective enhancements incorporated into the Bielik-11B-v2 model.
|
116 |
+
|
117 |
+
|
118 |
+
|
119 |
+
### Open LLM Leaderboard
|
120 |
+
|
121 |
+
| Model | AVG | arc_challenge | hellaswag | truthfulqa_mc2 | mmlu | winogrande | gsm8k |
|
122 |
+
|-------------------------|-------|---------------|-----------|----------------|-------|------------|-------|
|
123 |
+
| **Bielik-11B-v2** | **65.87** | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 |
|
124 |
+
| Mistral-7B-v0.2 | 60.37 | 60.84 | 83.08 | 63.62 | 41.76 | 78.22 | 34.72 |
|
125 |
+
| Bielik-7B-v0.1 | 49.98 | 45.22 | 67.92 | 47.16 | 43.20 | 66.85 | 29.49 |
|
126 |
+
|
127 |
+
The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis.
|
128 |
+
|
129 |
+
Key observations:
|
130 |
+
1. Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements.
|
131 |
+
2. It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks.
|
132 |
+
3. The model shows particular strength in MMLU (massive multitask language understanding), scoring 63.06 compared to Mistral-7B-v0.2's 41.76, demonstrating its broad knowledge base and understanding capabilities.
|
133 |
+
4. While Mistral-7B-v0.2 outperforms in truthfulqa_mc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task.
|
134 |
+
|
135 |
+
Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities.
|
136 |
+
|
137 |
+
|
138 |
+
## Limitations and Biases
|
139 |
+
|
140 |
+
Bielik-11B-v2 is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
|
141 |
+
|
142 |
+
Bielik-11B-v2 can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-11B-v2 was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs.
|
143 |
+
|
144 |
+
## License
|
145 |
+
|
146 |
+
The model is licensed under Apache 2.0, which allows for commercial use.
|
147 |
+
|
148 |
+
## Citation
|
149 |
+
Please cite this model using the following format:
|
150 |
+
|
151 |
+
```
|
152 |
+
@misc{Bielik7Bv01,
|
153 |
+
title = {Bielik-11B-v2 model card},
|
154 |
+
author = {Ociepa, Krzysztof and Flis, 艁ukasz and Wr贸bel, Krzysztof and Gwo藕dziej, Adrian and {SpeakLeash Team} and {Cyfronet Team}},
|
155 |
+
year = {2024},
|
156 |
+
url = {https://huggingface.co/speakleash/Bielik-11B-v2},
|
157 |
+
note = {Accessed: 2024-08-28},
|
158 |
+
urldate = {2024-08-28}
|
159 |
+
}
|
160 |
+
```
|
161 |
+
|
162 |
+
## Responsible for training the model
|
163 |
+
|
164 |
+
* [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/)<sup>SpeakLeash</sup> - team leadership, conceptualizing, data preparation, process optimization and oversight of training
|
165 |
+
* [艁ukasz Flis](https://www.linkedin.com/in/lukasz-flis-0a39631/)<sup>Cyfronet AGH</sup> - coordinating and supervising the training
|
166 |
+
* [Adrian Gwo藕dziej](https://www.linkedin.com/in/adrgwo/)<sup>SpeakLeash</sup> - data cleaning and quality
|
167 |
+
* [Krzysztof Wr贸bel](https://www.linkedin.com/in/wrobelkrzysztof/)<sup>SpeakLeash</sup> - benchmarks
|
168 |
+
|
169 |
+
The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model:
|
170 |
+
[Grzegorz Urbanowicz](https://www.linkedin.com/in/grzegorz-urbanowicz-05823469/),
|
171 |
+
[Igor Ciuciura](https://www.linkedin.com/in/igor-ciuciura-1763b52a6/),
|
172 |
+
[Jacek Chwi艂a](https://www.linkedin.com/in/jacek-chwila/),
|
173 |
+
[Szymon Baczy艅ski](https://www.linkedin.com/in/szymon-baczynski/),
|
174 |
+
[Pawe艂 Kiszczak](https://www.linkedin.com/in/paveu-kiszczak/),
|
175 |
+
[Aleksander Smywi艅ski-Pohl](https://www.linkedin.com/in/apohllo/).
|
176 |
+
|
177 |
+
Members of the ACK Cyfronet AGH team providing valuable support and expertise:
|
178 |
+
[Szymon Mazurek](https://www.linkedin.com/in/sz-mazurek-ai/),
|
179 |
+
[Marek Magry艣](https://www.linkedin.com/in/magrys/).
|
180 |
+
|
181 |
+
|
182 |
+
## Contact Us
|
183 |
+
|
184 |
+
If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our [Discord SpeakLeash](https://discord.gg/3G9DVM39).
|