munish0838
commited on
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,204 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
|
4 |
+
language:
|
5 |
+
- de
|
6 |
+
tags:
|
7 |
+
- german
|
8 |
+
- causal-lm
|
9 |
+
- text-generation
|
10 |
+
library_name: transformers
|
11 |
+
pipeline_tag: text-generation
|
12 |
+
license: apache-2.0
|
13 |
+
|
14 |
+
---
|
15 |
+
|
16 |
+
[![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
|
17 |
+
|
18 |
+
|
19 |
+
# QuantFactory/bueble-lm-2b-GGUF
|
20 |
+
This is quantized version of [flair/bueble-lm-2b](https://huggingface.co/flair/bueble-lm-2b) created using llama.cpp
|
21 |
+
|
22 |
+
# Original Model Card
|
23 |
+
|
24 |
+
|
25 |
+
# BübleLM
|
26 |
+
|
27 |
+
|
28 |
+
<div align="center" style="margin-bottom: 2rem; margin-top: 2rem">
|
29 |
+
<img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" style="max-height: 450px; width: auto;"/>
|
30 |
+
<h1 style="margin-top: 1rem;">BübleLM</h1>
|
31 |
+
<p><em>A small German LM</em></p>
|
32 |
+
</div>
|
33 |
+
|
34 |
+
BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
|
35 |
+
|
36 |
+
## Model Details
|
37 |
+
|
38 |
+
- **Architecture**: Based on Gemma-2B decoder-only architecture
|
39 |
+
- **Parameters**: 2 billion
|
40 |
+
- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
|
41 |
+
- Fertility rate: 1.78 tokens per word
|
42 |
+
- Optimized for German morphological structures
|
43 |
+
- Trained on the same corpus as the model
|
44 |
+
- **Context Length**: 8192 tokens
|
45 |
+
- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
|
46 |
+
|
47 |
+
## Training Data
|
48 |
+
|
49 |
+
Trained on 3.5B tokens from Occiglot-FineWeb project, including:
|
50 |
+
- Contemporary web content (OSCAR 2015-2023)
|
51 |
+
- Legislative documents (EurLex, ParlamInt)
|
52 |
+
- News data (Tagesschau)
|
53 |
+
- Wiki sources
|
54 |
+
|
55 |
+
Data sampling weights:
|
56 |
+
- Wikipedia: 4x
|
57 |
+
- News/Parliamentary: 2x
|
58 |
+
- Other sources: 1x
|
59 |
+
|
60 |
+
## Performance
|
61 |
+
|
62 |
+
Key improvements over Gemma-2-2B baseline:
|
63 |
+
- HellaSwag-DE: +71% (47.9% vs 28.0%)
|
64 |
+
- ARC-DE: +41% (32.3% vs 22.9%)
|
65 |
+
- Average zero-shot: +40% (35.8% vs 25.5%)
|
66 |
+
|
67 |
+
→ BübleLM-2B consistently outperforms both the base Gemma-2-2B and other German models like LLäMmlein-1B across most tasks.
|
68 |
+
|
69 |
+
<table class="model-comparison">
|
70 |
+
<thead>
|
71 |
+
<tr>
|
72 |
+
<th align="left">Model</th>
|
73 |
+
<th align="center" colspan="2">ARC-DE</th>
|
74 |
+
<th align="center" colspan="2">HellaSwag-DE</th>
|
75 |
+
<th align="center">TruthfulQA-DE</th>
|
76 |
+
<th align="center">Average</th>
|
77 |
+
</tr>
|
78 |
+
<tr>
|
79 |
+
<th></th>
|
80 |
+
<th align="center">0-shot</th>
|
81 |
+
<th align="center">3-shot</th>
|
82 |
+
<th align="center">0-shot</th>
|
83 |
+
<th align="center">3-shot</th>
|
84 |
+
<th align="center">0-shot</th>
|
85 |
+
<th align="center">0-shot</th>
|
86 |
+
</tr>
|
87 |
+
</thead>
|
88 |
+
<tbody>
|
89 |
+
<tr>
|
90 |
+
<td><a href="https://huggingface.co/google/gemma-2-2b" target="_blank">Gemma-2-2B</a></td>
|
91 |
+
<td align="center">22.9</td>
|
92 |
+
<td align="center">23.1</td>
|
93 |
+
<td align="center">28.0</td>
|
94 |
+
<td align="center">27.6</td>
|
95 |
+
<td align="center">25.5</td>
|
96 |
+
<td align="center">25.5</td>
|
97 |
+
</tr>
|
98 |
+
<tr>
|
99 |
+
<td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_120M" target="_blank">LLäMmlein-120M</a></td>
|
100 |
+
<td align="center">24.7 ↑+8%</td>
|
101 |
+
<td align="center">-</td>
|
102 |
+
<td align="center">32.0 ↑+14%</td>
|
103 |
+
<td align="center">-</td>
|
104 |
+
<td align="center">25.0 ↓-2%</td>
|
105 |
+
<td align="center">27.2 ↑+7%</td>
|
106 |
+
</tr>
|
107 |
+
<tr>
|
108 |
+
<td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_1B" target="_blank">LLäMmlein-1B</a></td>
|
109 |
+
<td align="center">30.0 ↑+31%</td>
|
110 |
+
<td align="center">-</td>
|
111 |
+
<td align="center"><strong>48.5</strong> ↑+73%</td>
|
112 |
+
<td align="center">-</td>
|
113 |
+
<td align="center">23.4 ↓-8%</td>
|
114 |
+
<td align="center">34.0 ↑+33%</td>
|
115 |
+
</tr>
|
116 |
+
<tr>
|
117 |
+
<td><a href="https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-2b" target="_blank">Sauerkraut-Gemma-2B</a></td>
|
118 |
+
<td align="center">28.0 ↑+22%</td>
|
119 |
+
<td align="center">34.6 ↑+50%</td>
|
120 |
+
<td align="center">37.2 ↑+33%</td>
|
121 |
+
<td align="center">44.1 ↑+60%</td>
|
122 |
+
<td align="center"><strong>32.9</strong> ↑+29%</td>
|
123 |
+
<td align="center">32.7 ↑+28%</td>
|
124 |
+
</tr>
|
125 |
+
<tr>
|
126 |
+
<td><strong>BübleLM (Ours)</strong></td>
|
127 |
+
<td align="center"><strong>32.3</strong> ↑+41%</td>
|
128 |
+
<td align="center"><strong>35.2</strong> ↑+52%</td>
|
129 |
+
<td align="center">47.9 ↑+71%</td>
|
130 |
+
<td align="center"><strong>46.6</strong> ↑+69%</td>
|
131 |
+
<td align="center">27.2 ↑+7%</td>
|
132 |
+
<td align="center"><strong>35.8</strong> ↑+40%</td>
|
133 |
+
</tr>
|
134 |
+
</tbody>
|
135 |
+
</table>
|
136 |
+
|
137 |
+
*Performance evaluated on German versions of ARC (knowledge-based QA), HellaSwag (commonsense reasoning), and TruthfulQA (truthfulness). Values show accuracy in percentages, with arrows indicating relative improvement over Gemma-2B baseline. Best results shown in bold.*
|
138 |
+
|
139 |
+
## Safety & Ethics
|
140 |
+
|
141 |
+
### Toxicity
|
142 |
+
- Perplexity: 52.97 on German TextDetox dataset
|
143 |
+
- Toxic content appears more out-of-distribution compared to baseline
|
144 |
+
|
145 |
+
### Gender Bias
|
146 |
+
- Evaluated using perplexity differences between traditional and gender-inclusive forms
|
147 |
+
- Slight preference for gender-inclusive language (not statistically significant)
|
148 |
+
- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
|
149 |
+
|
150 |
+
|
151 |
+
## Usage
|
152 |
+
|
153 |
+
**Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
|
154 |
+
|
155 |
+
Also make sure you have the sentencepiece tokenizer installed:
|
156 |
+
|
157 |
+
```bash
|
158 |
+
pip install sentencepiece
|
159 |
+
```
|
160 |
+
|
161 |
+
```python
|
162 |
+
from transformers import pipeline
|
163 |
+
pipe = pipeline("text-generation", model="flair/bueble-lm-2b")
|
164 |
+
pipe("Ich bin")
|
165 |
+
```
|
166 |
+
|
167 |
+
Or with the full model api:
|
168 |
+
|
169 |
+
```python
|
170 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
171 |
+
|
172 |
+
tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
|
173 |
+
model = AutoModelForCausalLM.from_pretrained(
|
174 |
+
"flair/bueble-lm-2b",
|
175 |
+
device_map="auto",
|
176 |
+
torch_dtype=torch.bfloat16
|
177 |
+
)
|
178 |
+
|
179 |
+
# Basic text completion
|
180 |
+
text = "Berlin ist eine Stadt, die"
|
181 |
+
inputs = tokenizer(text, return_tensors="pt").to("cuda")
|
182 |
+
outputs = model.generate(**inputs, max_new_tokens=256)
|
183 |
+
print(tokenizer.decode(outputs[0]))
|
184 |
+
```
|
185 |
+
|
186 |
+
For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets.
|
187 |
+
|
188 |
+
|
189 |
+
## Limitations
|
190 |
+
|
191 |
+
- Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma)
|
192 |
+
- Performance may vary on specialized domains not well-represented in training data
|
193 |
+
- Higher fertility rate (1.78) due to smaller vocabulary size
|
194 |
+
- Inherits base limitations from Gemma architecture
|
195 |
+
|
196 |
+
## Citation
|
197 |
+
|
198 |
+
```bibtex
|
199 |
+
@article{delobelle2024buble,
|
200 |
+
title={BübleLM: A small German LM},
|
201 |
+
author={Delobelle, Pieter and Akbik, Alan and others},
|
202 |
+
year={2024}
|
203 |
+
}
|
204 |
+
```
|