BübleLM SFT WIP

BübleLM Logo

BübleLM

A small German LM

BübleLM is a German language model based on Gemma-2-2B, adapted using trans-tokenization with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.

This is an experimental version that received some finetuning using several german datasets. DPO version will follow soon.

Model Details

  • Architecture: Based on Gemma-2B decoder-only architecture
  • Parameters: 2 billion
  • Tokenizer: Custom German SentencePiece tokenizer (20k vocabulary)
    • Fertility rate: 1.78 tokens per word
    • Optimized for German morphological structures
    • Trained on the same corpus as the model
  • Context Length: 8192 tokens
  • Training Hardware: Single node with 4x NVidia A100-SXM4-80GB GPUs

Training Data

Trained on 3.5B tokens from Occiglot-FineWeb project, including:

  • Contemporary web content (OSCAR 2015-2023)
  • Legislative documents (EurLex, ParlamInt)
  • News data (Tagesschau)
  • Wiki sources

Data sampling weights:

  • Wikipedia: 4x
  • News/Parliamentary: 2x
  • Other sources: 1x

Finetuning

Additional supervised finetuning via lora was done using german translations of alpaca-gpt4, openschnabeltier, evol_instruct, dolphin, airoboros, slimorca, hermes and synthia.

Performance

TBD after dpo training.

Usage

Source

@article{delobelle2024buble,
    title={BübleLM: A small German LM},
    author={Delobelle, Pieter and Akbik, Alan and others},
    year={2024}
}
Downloads last month
23
Safetensors
Model size
2.12B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.