flair
/

bueble-lm-2b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

pdelobelle commited on Dec 4, 2024

Commit

c3e5846

·

verified ·

1 Parent(s): 8e5e123

Create README.md

Files changed (1) hide show

README.md +92 -0

README.md ADDED Viewed

	@@ -0,0 +1,92 @@

+---
+language:
+- de
+tags:
+- german
+- causal-lm
+- text-generation
+library_name: transformers
+pipeline_tag: text-generation
+license: apache-2.0
+---
+# BübleLM
+<div align="center">
+   <img src="/api/placeholder/400/200" alt="BübleLM Logo" />
+</div>
+BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German-specific SentencePiece tokenizer. This 2B parameter model achieves state-of-the-art performance on German language tasks while maintaining strong safety properties.
+## Model Details
+- **Architecture**: Based on Gemma-2B
+- **Parameters**: 2 billion
+- **Training**: Trans-tokenization from Gemma-2B using German SentencePiece tokenizer (vocab size: 20k)
+- **Context Length**: Same as Gemma-2B
+- **Input**: Text (German)
+- **Output**: Text (German)
+## Training Data
+Trained on 3.5B tokens from Occiglot-FineWeb project, including:
+- Contemporary web content (OSCAR 2015-2023)
+- Legislative documents (EurLex, ParlamInt)
+- News data (Tagesschau)
+- Wiki sources
+Data sampling weights:
+- Wikipedia: 4x
+- News/Parliamentary: 2x
+- Other sources: 1x
+## Performance
+[INSERT FIGURE: Performance comparison across models]
+Key improvements over Gemma-2B baseline:
+- HellaSwag-DE: +71% (47.9% vs 28.0%)
+- ARC-DE: +41% (32.3% vs 22.9%)
+- Average zero-shot: +40% (35.8% vs 25.5%)
+## Safety & Ethics
+### Toxicity
+- Score: 52.97 on German TextDetox dataset
+- Toxic content appears more out-of-distribution compared to baseline
+### Gender Bias
+- Evaluated using perplexity differences between traditional and gender-inclusive forms
+- Slight preference for gender-inclusive language (not statistically significant)
+- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
+model = AutoModelForCausalLM.from_pretrained(
+    "flair/bueble-lm-2b",
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+)
+messages = [{"role": "user", "content": "Schreibe ein Gedicht über Berlin."}]
+input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=256)
+print(tokenizer.decode(outputs[0]))
+```
+## Limitations
+- Limited vocabulary size (20k tokens) compared to multilingual models
+- Performance may vary on specialized domains not well-represented in training data
+- Model inherits base limitations from Gemma architecture
+## Citation
+```bibtex
+```