pdelobelle
commited on
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- de
|
4 |
+
tags:
|
5 |
+
- german
|
6 |
+
- causal-lm
|
7 |
+
- text-generation
|
8 |
+
library_name: transformers
|
9 |
+
pipeline_tag: text-generation
|
10 |
+
license: apache-2.0
|
11 |
+
---
|
12 |
+
|
13 |
+
# BübleLM
|
14 |
+
|
15 |
+
<div align="center">
|
16 |
+
<img src="/api/placeholder/400/200" alt="BübleLM Logo" />
|
17 |
+
</div>
|
18 |
+
|
19 |
+
BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German-specific SentencePiece tokenizer. This 2B parameter model achieves state-of-the-art performance on German language tasks while maintaining strong safety properties.
|
20 |
+
|
21 |
+
## Model Details
|
22 |
+
|
23 |
+
- **Architecture**: Based on Gemma-2B
|
24 |
+
- **Parameters**: 2 billion
|
25 |
+
- **Training**: Trans-tokenization from Gemma-2B using German SentencePiece tokenizer (vocab size: 20k)
|
26 |
+
- **Context Length**: Same as Gemma-2B
|
27 |
+
- **Input**: Text (German)
|
28 |
+
- **Output**: Text (German)
|
29 |
+
|
30 |
+
## Training Data
|
31 |
+
|
32 |
+
Trained on 3.5B tokens from Occiglot-FineWeb project, including:
|
33 |
+
- Contemporary web content (OSCAR 2015-2023)
|
34 |
+
- Legislative documents (EurLex, ParlamInt)
|
35 |
+
- News data (Tagesschau)
|
36 |
+
- Wiki sources
|
37 |
+
|
38 |
+
Data sampling weights:
|
39 |
+
- Wikipedia: 4x
|
40 |
+
- News/Parliamentary: 2x
|
41 |
+
- Other sources: 1x
|
42 |
+
|
43 |
+
## Performance
|
44 |
+
|
45 |
+
[INSERT FIGURE: Performance comparison across models]
|
46 |
+
|
47 |
+
Key improvements over Gemma-2B baseline:
|
48 |
+
- HellaSwag-DE: +71% (47.9% vs 28.0%)
|
49 |
+
- ARC-DE: +41% (32.3% vs 22.9%)
|
50 |
+
- Average zero-shot: +40% (35.8% vs 25.5%)
|
51 |
+
|
52 |
+
## Safety & Ethics
|
53 |
+
|
54 |
+
### Toxicity
|
55 |
+
- Score: 52.97 on German TextDetox dataset
|
56 |
+
- Toxic content appears more out-of-distribution compared to baseline
|
57 |
+
|
58 |
+
### Gender Bias
|
59 |
+
- Evaluated using perplexity differences between traditional and gender-inclusive forms
|
60 |
+
- Slight preference for gender-inclusive language (not statistically significant)
|
61 |
+
- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
|
62 |
+
|
63 |
+
## Usage
|
64 |
+
|
65 |
+
```python
|
66 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
67 |
+
|
68 |
+
tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
|
69 |
+
model = AutoModelForCausalLM.from_pretrained(
|
70 |
+
"flair/bueble-lm-2b",
|
71 |
+
device_map="auto",
|
72 |
+
torch_dtype=torch.bfloat16
|
73 |
+
)
|
74 |
+
|
75 |
+
messages = [{"role": "user", "content": "Schreibe ein Gedicht über Berlin."}]
|
76 |
+
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
|
77 |
+
|
78 |
+
outputs = model.generate(**input_ids, max_new_tokens=256)
|
79 |
+
print(tokenizer.decode(outputs[0]))
|
80 |
+
```
|
81 |
+
|
82 |
+
## Limitations
|
83 |
+
|
84 |
+
- Limited vocabulary size (20k tokens) compared to multilingual models
|
85 |
+
- Performance may vary on specialized domains not well-represented in training data
|
86 |
+
- Model inherits base limitations from Gemma architecture
|
87 |
+
|
88 |
+
## Citation
|
89 |
+
|
90 |
+
```bibtex
|
91 |
+
|
92 |
+
```
|