--- library_name: transformers tags: - low resource - trans language: - ban - min - en - id base_model: - Yellow-AI-NLP/komodo-7b-base --- # Model Card for Model ID NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation. ## Model Details ### Model Description - **Developed by:** William Tan - **Model type:** Decoder-only Large Language Model - **Language(s) (NLP):** Balinese, Minangkabau, Indonesian, English - **Finetuned from model:** Yellow-AI-NLP/komodo-7b-base ### Model Sources - **Repository:** https://github.com/williammtan/nusamt - **Paper:** https://arxiv.org/abs/2410.07830 - **Demo:** https://indonesiaku.com/translate ## Uses The model is designed for: - Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau) - Language preservation and documentation - Cross-cultural communication - Educational purposes and language learning ### Direct Use - Integrated into translation applications - Used for data augmentation in low-resource language tasks - Adapted for other Indonesian regional languages - Used as a foundation for developing language learning tools ### Out-of-Scope Use The model is not suitable for: - Translation of languages outside its trained scope - General text generation or chat functionality - Real-time translation requiring minimal latency - Critical applications where translation errors could cause harm ## Bias, Risks, and Limitations - Limited to specific language pairs (English/Indonesian ↔ Balinese/Minangkabau) - Performance varies between translation directions, with better results for translations into low-resource languages - Underperforms larger models (NLLB-3.3B) in translations into high-resource languages - May not capture all dialectal variations or cultural nuances - Uses significantly more parameters (7 billion) compared to traditional NMT models - Limited by the quality and quantity of available training data ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. ## How to Get Started with the Model Use the code below to get started with the model. ## Training Details ### Training Data NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT Total parallel sentences after cleaning: - Balinese ↔ English: 35.6k sentences - Balinese ↔ Indonesian: 44.9k sentences - Minangkabau ↔ English: 16.6k sentences - Minangkabau ↔ Indonesian: 22.4k sentences Data sources: - NLLB Mined corpus (ODC-BY license) - NLLB SEED dataset (CC-BY-SA license) - BASAbaliWiki (CC-BY-SA license) - Bible verses from Alkitab.mobi (for non-profit scholarly use) - NusaX dataset (CC-BY-SA license) #### Preprocessing - Length filtering (15-500 characters) - Word length ratio of 2 - Removal of sentences with words >20 characters - Deduplication - Language identification with GlotLid V3 (threshold: 0.9) - LASER3 similarity scoring (threshold: 1.09) - GPT-4o mini-based data cleaning #### Training Hyperparameters - Training regime: bfloat16 mixed precision - LoRA rank: 16 - Learning rate: 0.002 - Batch size: 10 per device - Epochs: 3 - Data splits: 90% training, 5% validation, 5% testing - Loss: Causal Language Modeling (CLM) ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - FLORES-200 multilingual translation benchmark - Internal test set (5% of parallel data) #### Metrics - spBLEU (SentencePiece tokenized BLEU) ### Results Performance highlights: - Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese - Underperforms by up to -3.38 spBLEU in translations into higher-resource languages - Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation ### Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements | Models | ban → en | en → ban | ban → id | id → ban | |-------------------------------|----------|----------|----------|----------| | LLaMA2-7B SFT | 27.63 | 13.94 | 27.90 | 13.68 | | + Monolingual Pre-training | 31.28 | 18.92 | 28.75 | 20.11 | | + Mono + Backtranslation | 33.97 | 20.27 | 29.62 | 20.67 | | + Mono + LLM Cleaner | 33.23 | 19.75 | 29.02 | 21.16 | | + Mono + Cleaner + Backtrans. | **35.42**| **22.15**| **31.56**| **22.95**| This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs. ### Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models | Models | ban → en | en → ban | ban → id | id → ban | min → en | en → min | min → id | id → min | |-------------------------------|----------|----------|----------|----------|----------|----------|----------|----------| | GPT-3.5-turbo, zero-shot | 27.17 | 11.63 | 28.17 | 13.14 | 28.75 | 11.07 | 31.06 | 11.05 | | GPT-4o, zero-shot | 27.11 | 11.45 | 27.89 | 13.08 | 28.63 | 11.00 | 31.27 | 11.00 | | GPT-4, zero-shot | 27.20 | 11.59 | 28.41 | 13.24 | 28.51 | 10.99 | 31.00 | 10.93 | | NLLB-600M | 33.96 | 16.86 | 30.12 | 15.15 | 35.05 | 19.72 | 31.92 | 17.72 | | NLLB-1.3B | 37.24 | 17.73 | 32.42 | 16.21 | 38.59 | 22.79 | 34.68 | 20.89 | | NLLB-3.3B | **38.57**| 17.09 | **33.35**| 14.85 | **40.61**| **24.71**| **35.20**| 22.44 | | NusaMT-7B (Ours) | 35.42 | **22.15**| 31.56 | **22.95**| 37.23 | 24.32 | 34.29 | **23.27**| This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages. ## Environmental Impact - **Hardware Type:** 2x NVIDIA RTX 4090 - **Hours used:** 1250 - **Cloud Provider:** Runpod.io - **Carbon Emitted:** 210 kg CO2e ## Citation If you find this model useful, please cite the following works ``` @misc{tan2024nusamt7bmachinetranslationlowresource, title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models}, author={William Tan and Kevin Zhu}, year={2024}, eprint={2410.07830}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.07830}, } ```