--- license: apache-2.0 datasets: - Alfaxad/Inkuba-Mono-Swahili language: - sw pipeline_tag: text-generation library_name: transformers tags: - gemma2 - text-2-text - text-generation - llms base_model: - google/gemma-2-2b --- # Gemma2-2B-Swahili-Preview Gemma2-2B-Swahili-Preview is a Swahili variation of the base language model Gemma2 2B fine-tuned on the Inkuba-Mono Swahili dataset, designed to enhance Swahili language understanding through monolingual training. ## Model Details - **Developer:** Alfaxad Eyembe - **Base Model:** google/gemma-2-2b - **Model Type:** Decoder-only transformer - **Language:** Swahili - **License:** Apache 2.0 - **Fine-tuning Approach:** Low-Rank Adaptation (LoRA) ## Training Data The model was fine-tuned on a focused subset of the Inkuba-Mono dataset: - 1,000,000 randomly selected examples - Total tokens: 60,831,073 - Average text length: 101.33 characters - Diverse Swahili text sources including news, social media, and various domains ## Training Details - **Fine-tuning Method:** LoRA - **Training Steps:** 2,500 - **Batch Size:** 2 - **Gradient Accumulation Steps:** 32 - **Learning Rate:** 2e-4 - **Training Time:** ~7.5 hours ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6375af60e3413701a9f01c0f/8fVULkKb92JTk8-65KE5R.png) ## Model Capabilities This model is designed for: - Swahili text continuation - Natural language understanding - Contextual text generation - Base language modeling for Swahili ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("alfaxadeyembe/gemma2-2b-swahili-preview") model = AutoModelForCausalLM.from_pretrained( "alfaxadeyembe/gemma2-2b-swahili-preview", device_map="auto", torch_dtype=torch.bfloat16 ) # Set to evaluation mode model.eval() # Example usage prompt = "Katika soko la Kariakoo, teknolojia mpya imewezesha" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( **inputs, max_new_tokens=500, do_sample=True, temperature=0.7, top_p=0.95 ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Key Features - Natural Swahili text continuation - Strong cultural context understanding - Efficient parameter updates through LoRA - Diverse domain knowledge integration ## Limitations - Not instruction-tuned - Base language modeling capabilities - Performance varies across different text domains ## Citation ```bibtex @misc{gemma2-2b-swahili-preview, author = {Alfaxad Eyembe}, title = {Gemma2-2B-Swahili-Preview: Swahili Variation of Gemma2 2B}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Model Hub}, } ``` ## Contact For questions or feedback, please reach out through: - HuggingFace: [@alfaxadeyembe](https://huggingface.co/alfaxad) - X : [@alfxad](https://twitter.com/alfxad)