--- language: - de tags: - german - causal-lm - text-generation library_name: transformers pipeline_tag: text-generation license: apache-2.0 --- # BübleLM SFT WIP

BübleLM

A small German LM

BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities. This is an experimental version that received some finetuning using several german datasets. DPO version will follow soon. ## Model Details - **Architecture**: Based on Gemma-2B decoder-only architecture - **Parameters**: 2 billion - **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary) - Fertility rate: 1.78 tokens per word - Optimized for German morphological structures - Trained on the same corpus as the model - **Context Length**: 8192 tokens - **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs ## Training Data Trained on 3.5B tokens from Occiglot-FineWeb project, including: - Contemporary web content (OSCAR 2015-2023) - Legislative documents (EurLex, ParlamInt) - News data (Tagesschau) - Wiki sources Data sampling weights: - Wikipedia: 4x - News/Parliamentary: 2x - Other sources: 1x ## Finetuning Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia. ## Performance TBD after dpo training. ## Usage ## Source ```bibtex @article{delobelle2024buble, title={BübleLM: A small German LM}, author={Delobelle, Pieter and Akbik, Alan and others}, year={2024} } ```