pdelobelle commited on
Commit
c3e5846
·
verified ·
1 Parent(s): 8e5e123

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ tags:
5
+ - german
6
+ - causal-lm
7
+ - text-generation
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # BübleLM
14
+
15
+ <div align="center">
16
+ <img src="/api/placeholder/400/200" alt="BübleLM Logo" />
17
+ </div>
18
+
19
+ BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German-specific SentencePiece tokenizer. This 2B parameter model achieves state-of-the-art performance on German language tasks while maintaining strong safety properties.
20
+
21
+ ## Model Details
22
+
23
+ - **Architecture**: Based on Gemma-2B
24
+ - **Parameters**: 2 billion
25
+ - **Training**: Trans-tokenization from Gemma-2B using German SentencePiece tokenizer (vocab size: 20k)
26
+ - **Context Length**: Same as Gemma-2B
27
+ - **Input**: Text (German)
28
+ - **Output**: Text (German)
29
+
30
+ ## Training Data
31
+
32
+ Trained on 3.5B tokens from Occiglot-FineWeb project, including:
33
+ - Contemporary web content (OSCAR 2015-2023)
34
+ - Legislative documents (EurLex, ParlamInt)
35
+ - News data (Tagesschau)
36
+ - Wiki sources
37
+
38
+ Data sampling weights:
39
+ - Wikipedia: 4x
40
+ - News/Parliamentary: 2x
41
+ - Other sources: 1x
42
+
43
+ ## Performance
44
+
45
+ [INSERT FIGURE: Performance comparison across models]
46
+
47
+ Key improvements over Gemma-2B baseline:
48
+ - HellaSwag-DE: +71% (47.9% vs 28.0%)
49
+ - ARC-DE: +41% (32.3% vs 22.9%)
50
+ - Average zero-shot: +40% (35.8% vs 25.5%)
51
+
52
+ ## Safety & Ethics
53
+
54
+ ### Toxicity
55
+ - Score: 52.97 on German TextDetox dataset
56
+ - Toxic content appears more out-of-distribution compared to baseline
57
+
58
+ ### Gender Bias
59
+ - Evaluated using perplexity differences between traditional and gender-inclusive forms
60
+ - Slight preference for gender-inclusive language (not statistically significant)
61
+ - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
62
+
63
+ ## Usage
64
+
65
+ ```python
66
+ from transformers import AutoTokenizer, AutoModelForCausalLM
67
+
68
+ tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ "flair/bueble-lm-2b",
71
+ device_map="auto",
72
+ torch_dtype=torch.bfloat16
73
+ )
74
+
75
+ messages = [{"role": "user", "content": "Schreibe ein Gedicht über Berlin."}]
76
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
77
+
78
+ outputs = model.generate(**input_ids, max_new_tokens=256)
79
+ print(tokenizer.decode(outputs[0]))
80
+ ```
81
+
82
+ ## Limitations
83
+
84
+ - Limited vocabulary size (20k tokens) compared to multilingual models
85
+ - Performance may vary on specialized domains not well-represented in training data
86
+ - Model inherits base limitations from Gemma architecture
87
+
88
+ ## Citation
89
+
90
+ ```bibtex
91
+
92
+ ```