munish0838 commited on
Commit
04c2f37
·
verified ·
1 Parent(s): b0526c5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +204 -0
README.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ language:
5
+ - de
6
+ tags:
7
+ - german
8
+ - causal-lm
9
+ - text-generation
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ license: apache-2.0
13
+
14
+ ---
15
+
16
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
17
+
18
+
19
+ # QuantFactory/bueble-lm-2b-GGUF
20
+ This is quantized version of [flair/bueble-lm-2b](https://huggingface.co/flair/bueble-lm-2b) created using llama.cpp
21
+
22
+ # Original Model Card
23
+
24
+
25
+ # BübleLM
26
+
27
+
28
+ <div align="center" style="margin-bottom: 2rem; margin-top: 2rem">
29
+ <img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" style="max-height: 450px; width: auto;"/>
30
+ <h1 style="margin-top: 1rem;">BübleLM</h1>
31
+ <p><em>A small German LM</em></p>
32
+ </div>
33
+
34
+ BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
35
+
36
+ ## Model Details
37
+
38
+ - **Architecture**: Based on Gemma-2B decoder-only architecture
39
+ - **Parameters**: 2 billion
40
+ - **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
41
+ - Fertility rate: 1.78 tokens per word
42
+ - Optimized for German morphological structures
43
+ - Trained on the same corpus as the model
44
+ - **Context Length**: 8192 tokens
45
+ - **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
46
+
47
+ ## Training Data
48
+
49
+ Trained on 3.5B tokens from Occiglot-FineWeb project, including:
50
+ - Contemporary web content (OSCAR 2015-2023)
51
+ - Legislative documents (EurLex, ParlamInt)
52
+ - News data (Tagesschau)
53
+ - Wiki sources
54
+
55
+ Data sampling weights:
56
+ - Wikipedia: 4x
57
+ - News/Parliamentary: 2x
58
+ - Other sources: 1x
59
+
60
+ ## Performance
61
+
62
+ Key improvements over Gemma-2-2B baseline:
63
+ - HellaSwag-DE: +71% (47.9% vs 28.0%)
64
+ - ARC-DE: +41% (32.3% vs 22.9%)
65
+ - Average zero-shot: +40% (35.8% vs 25.5%)
66
+
67
+ → BübleLM-2B consistently outperforms both the base Gemma-2-2B and other German models like LLäMmlein-1B across most tasks.
68
+
69
+ <table class="model-comparison">
70
+ <thead>
71
+ <tr>
72
+ <th align="left">Model</th>
73
+ <th align="center" colspan="2">ARC-DE</th>
74
+ <th align="center" colspan="2">HellaSwag-DE</th>
75
+ <th align="center">TruthfulQA-DE</th>
76
+ <th align="center">Average</th>
77
+ </tr>
78
+ <tr>
79
+ <th></th>
80
+ <th align="center">0-shot</th>
81
+ <th align="center">3-shot</th>
82
+ <th align="center">0-shot</th>
83
+ <th align="center">3-shot</th>
84
+ <th align="center">0-shot</th>
85
+ <th align="center">0-shot</th>
86
+ </tr>
87
+ </thead>
88
+ <tbody>
89
+ <tr>
90
+ <td><a href="https://huggingface.co/google/gemma-2-2b" target="_blank">Gemma-2-2B</a></td>
91
+ <td align="center">22.9</td>
92
+ <td align="center">23.1</td>
93
+ <td align="center">28.0</td>
94
+ <td align="center">27.6</td>
95
+ <td align="center">25.5</td>
96
+ <td align="center">25.5</td>
97
+ </tr>
98
+ <tr>
99
+ <td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_120M" target="_blank">LLäMmlein-120M</a></td>
100
+ <td align="center">24.7 ↑+8%</td>
101
+ <td align="center">-</td>
102
+ <td align="center">32.0 ↑+14%</td>
103
+ <td align="center">-</td>
104
+ <td align="center">25.0 ↓-2%</td>
105
+ <td align="center">27.2 ↑+7%</td>
106
+ </tr>
107
+ <tr>
108
+ <td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_1B" target="_blank">LLäMmlein-1B</a></td>
109
+ <td align="center">30.0 ↑+31%</td>
110
+ <td align="center">-</td>
111
+ <td align="center"><strong>48.5</strong> ↑+73%</td>
112
+ <td align="center">-</td>
113
+ <td align="center">23.4 ↓-8%</td>
114
+ <td align="center">34.0 ↑+33%</td>
115
+ </tr>
116
+ <tr>
117
+ <td><a href="https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-2b" target="_blank">Sauerkraut-Gemma-2B</a></td>
118
+ <td align="center">28.0 ↑+22%</td>
119
+ <td align="center">34.6 ↑+50%</td>
120
+ <td align="center">37.2 ↑+33%</td>
121
+ <td align="center">44.1 ↑+60%</td>
122
+ <td align="center"><strong>32.9</strong> ↑+29%</td>
123
+ <td align="center">32.7 ↑+28%</td>
124
+ </tr>
125
+ <tr>
126
+ <td><strong>BübleLM (Ours)</strong></td>
127
+ <td align="center"><strong>32.3</strong> ↑+41%</td>
128
+ <td align="center"><strong>35.2</strong> ↑+52%</td>
129
+ <td align="center">47.9 ↑+71%</td>
130
+ <td align="center"><strong>46.6</strong> ↑+69%</td>
131
+ <td align="center">27.2 ↑+7%</td>
132
+ <td align="center"><strong>35.8</strong> ↑+40%</td>
133
+ </tr>
134
+ </tbody>
135
+ </table>
136
+
137
+ *Performance evaluated on German versions of ARC (knowledge-based QA), HellaSwag (commonsense reasoning), and TruthfulQA (truthfulness). Values show accuracy in percentages, with arrows indicating relative improvement over Gemma-2B baseline. Best results shown in bold.*
138
+
139
+ ## Safety & Ethics
140
+
141
+ ### Toxicity
142
+ - Perplexity: 52.97 on German TextDetox dataset
143
+ - Toxic content appears more out-of-distribution compared to baseline
144
+
145
+ ### Gender Bias
146
+ - Evaluated using perplexity differences between traditional and gender-inclusive forms
147
+ - Slight preference for gender-inclusive language (not statistically significant)
148
+ - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
149
+
150
+
151
+ ## Usage
152
+
153
+ **Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
154
+
155
+ Also make sure you have the sentencepiece tokenizer installed:
156
+
157
+ ```bash
158
+ pip install sentencepiece
159
+ ```
160
+
161
+ ```python
162
+ from transformers import pipeline
163
+ pipe = pipeline("text-generation", model="flair/bueble-lm-2b")
164
+ pipe("Ich bin")
165
+ ```
166
+
167
+ Or with the full model api:
168
+
169
+ ```python
170
+ from transformers import AutoTokenizer, AutoModelForCausalLM
171
+
172
+ tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
173
+ model = AutoModelForCausalLM.from_pretrained(
174
+ "flair/bueble-lm-2b",
175
+ device_map="auto",
176
+ torch_dtype=torch.bfloat16
177
+ )
178
+
179
+ # Basic text completion
180
+ text = "Berlin ist eine Stadt, die"
181
+ inputs = tokenizer(text, return_tensors="pt").to("cuda")
182
+ outputs = model.generate(**inputs, max_new_tokens=256)
183
+ print(tokenizer.decode(outputs[0]))
184
+ ```
185
+
186
+ For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets.
187
+
188
+
189
+ ## Limitations
190
+
191
+ - Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma)
192
+ - Performance may vary on specialized domains not well-represented in training data
193
+ - Higher fertility rate (1.78) due to smaller vocabulary size
194
+ - Inherits base limitations from Gemma architecture
195
+
196
+ ## Citation
197
+
198
+ ```bibtex
199
+ @article{delobelle2024buble,
200
+ title={BübleLM: A small German LM},
201
+ author={Delobelle, Pieter and Akbik, Alan and others},
202
+ year={2024}
203
+ }
204
+ ```