aashish1904
commited on
Commit
•
5372336
1
Parent(s):
fe8a645
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
|
4 |
+
library_name: transformers
|
5 |
+
tags:
|
6 |
+
- gemma2
|
7 |
+
- instruct
|
8 |
+
- bggpt
|
9 |
+
- insait
|
10 |
+
license: gemma
|
11 |
+
language:
|
12 |
+
- bg
|
13 |
+
- en
|
14 |
+
base_model:
|
15 |
+
- google/gemma-2-2b-it
|
16 |
+
- google/gemma-2-2b
|
17 |
+
pipeline_tag: text-generation
|
18 |
+
|
19 |
+
---
|
20 |
+
|
21 |
+
[![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
|
22 |
+
|
23 |
+
|
24 |
+
# QuantFactory/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF
|
25 |
+
This is quantized version of [INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0) created using llama.cpp
|
26 |
+
|
27 |
+
# Original Model Card
|
28 |
+
|
29 |
+
|
30 |
+
# INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0
|
31 |
+
|
32 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/637e1f8cf7e01589cc17bf7e/p6d0YFHjWCQ3S12jWqO1m.png)
|
33 |
+
|
34 |
+
INSAIT introduces **BgGPT-Gemma-2-2.6B-IT-v1.0**, a state-of-the-art Bulgarian language model based on **google/gemma-2-2b** and **google/gemma-2-2b-it**.
|
35 |
+
BgGPT-Gemma-2-2.6B-IT-v1.0 is **free to use** and distributed under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
|
36 |
+
This model was created by [`INSAIT`](https://insait.ai/), part of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.
|
37 |
+
|
38 |
+
|
39 |
+
# Model description
|
40 |
+
|
41 |
+
The model was built on top of Google’s Gemma 2 2B open models.
|
42 |
+
It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at [EMNLP’24](https://aclanthology.org/2024.findings-emnlp.1000/),
|
43 |
+
allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance.
|
44 |
+
During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute,
|
45 |
+
and machine translations of popular English datasets.
|
46 |
+
The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations.
|
47 |
+
For more information check our [blogpost](https://models.bggpt.ai/blog/).
|
48 |
+
|
49 |
+
# Benchmarks and Results
|
50 |
+
|
51 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65fefdc282708115868203aa/9pp8aD1yvoW-cJWzhbHXk.png)
|
52 |
+
|
53 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65fefdc282708115868203aa/33CjjtmCeAcw5qq8DEtJj.png)
|
54 |
+
|
55 |
+
We evaluate our models on a set of standard English benchmarks, a translated version of them in Bulgarian, as well as, Bulgarian specific benchmarks we collected:
|
56 |
+
|
57 |
+
- **Winogrande challenge**: testing world knowledge and understanding
|
58 |
+
- **Hellaswag**: testing sentence completion
|
59 |
+
- **ARC Easy/Challenge**: testing logical reasoning
|
60 |
+
- **TriviaQA**: testing trivia knowledge
|
61 |
+
- **GSM-8k**: solving multiple-choice questions in high-school mathematics
|
62 |
+
- **Exams**: solving high school problems from natural and social sciences
|
63 |
+
- **MON**: contains exams across various subjects for grades 4 to 12
|
64 |
+
|
65 |
+
These benchmarks test logical reasoning, mathematics, knowledge, language understanding and other skills of the models and are provided at https://github.com/insait-institute/lm-evaluation-harness-bg.
|
66 |
+
The graphs above show the performance of BgGPT 2.6B compared to other small open language models such as Microsoft's Phi 3.5 and Alibaba's Qwen 2.5 3B.
|
67 |
+
The BgGPT model not only surpasses them, but also **retains English performance** inherited from the original Google Gemma 2 models upon which it is based.
|
68 |
+
|
69 |
+
# Use in 🤗 Transformers
|
70 |
+
First install the latest version of the transformers library:
|
71 |
+
```
|
72 |
+
pip install -U 'transformers[torch]'
|
73 |
+
```
|
74 |
+
Then load the model in transformers:
|
75 |
+
```python
|
76 |
+
from transformers import AutoModelForCausalLM
|
77 |
+
|
78 |
+
model = AutoModelForCausalLM.from_pretrained(
|
79 |
+
"INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
|
80 |
+
torch_dtype=torch.bfloat16,
|
81 |
+
attn_implementation="eager",
|
82 |
+
device_map="auto",
|
83 |
+
)
|
84 |
+
```
|
85 |
+
|
86 |
+
# Recommended Parameters
|
87 |
+
|
88 |
+
For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:
|
89 |
+
|
90 |
+
```python
|
91 |
+
from transformers import GenerationConfig
|
92 |
+
|
93 |
+
generation_params = GenerationConfig(
|
94 |
+
max_new_tokens=2048, # Choose maximum generation tokens
|
95 |
+
temperature=0.1,
|
96 |
+
top_k=25,
|
97 |
+
top_p=1,
|
98 |
+
repetition_penalty=1.1,
|
99 |
+
eos_token_id=[1,107]
|
100 |
+
)
|
101 |
+
```
|
102 |
+
|
103 |
+
In principle, increasing temperature should work adequately as well.
|
104 |
+
|
105 |
+
# Instruction format
|
106 |
+
|
107 |
+
In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token `<bos>` and be formatted in the Gemma 2 chat template. `<bos>` should only be the first token in a chat sequence.
|
108 |
+
|
109 |
+
E.g.
|
110 |
+
```
|
111 |
+
<bos><start_of_turn>user
|
112 |
+
Кога е основан Софийският университет?<end_of_turn>
|
113 |
+
<start_of_turn>model
|
114 |
+
|
115 |
+
```
|
116 |
+
|
117 |
+
This format is also available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
|
118 |
+
|
119 |
+
```python
|
120 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
121 |
+
"INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
|
122 |
+
use_default_system_prompt=False,
|
123 |
+
)
|
124 |
+
|
125 |
+
messages = [
|
126 |
+
{"role": "user", "content": "Кога е основан Софийският университет?"},
|
127 |
+
]
|
128 |
+
|
129 |
+
input_ids = tokenizer.apply_chat_template(
|
130 |
+
messages,
|
131 |
+
return_tensors="pt",
|
132 |
+
add_generation_prompt=True,
|
133 |
+
return_dict=True
|
134 |
+
)
|
135 |
+
|
136 |
+
outputs = model.generate(
|
137 |
+
**input_ids,
|
138 |
+
generation_config=generation_params
|
139 |
+
)
|
140 |
+
print(tokenizer.decode(outputs[0]))
|
141 |
+
|
142 |
+
```
|
143 |
+
|
144 |
+
**Important Note:** Models based on Gemma 2 such as BgGPT-Gemma-2-2.6B-IT-v1.0 do not support flash attention. Using it results in degraded performance.
|
145 |
+
|
146 |
+
# Use with GGML / llama.cpp
|
147 |
+
|
148 |
+
The model and instructions for usage in GGUF format are available at [INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF).
|
149 |
+
|
150 |
+
# Community Feedback
|
151 |
+
|
152 |
+
We welcome feedback from the community to help improve BgGPT. If you have suggestions, encounter any issues, or have ideas for improvements, please:
|
153 |
+
- Share your experience using the model through Hugging Face's community discussion feature or
|
154 |
+
- Contact us at [[email protected]](mailto:[email protected])
|
155 |
+
|
156 |
+
Your real-world usage and insights are valuable in helping us optimize the model's performance and behaviour for various use cases.
|
157 |
+
|
158 |
+
# Summary
|
159 |
+
- **Finetuned from:** [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it); [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b);
|
160 |
+
- **Model type:** Causal decoder-only transformer language model
|
161 |
+
- **Language:** Bulgarian and English
|
162 |
+
- **Contact:** [[email protected]](mailto:[email protected])
|
163 |
+
- **License:** BgGPT is distributed under [Gemma Terms of Use](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0/raw/main/LICENSE)
|