Model Card for Model ID
NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation.
Model Details
Model Description
- Developed by: William Tan
- Model type: Decoder-only Large Language Model
- Language(s) (NLP): Balinese, Minangkabau, Indonesian, English
- Finetuned from model: Yellow-AI-NLP/komodo-7b-base
Model Sources
- Repository: https://github.com/williammtan/nusamt
- Paper: https://arxiv.org/abs/2410.07830
- Demo: https://indonesiaku.com/translate
Uses
The model is designed for:
- Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau)
- Language preservation and documentation
- Cross-cultural communication
- Educational purposes and language learning
Direct Use
- Integrated into translation applications
- Used for data augmentation in low-resource language tasks
- Adapted for other Indonesian regional languages
- Used as a foundation for developing language learning tools
Out-of-Scope Use
The model is not suitable for:
- Translation of languages outside its trained scope
- General text generation or chat functionality
- Real-time translation requiring minimal latency
- Critical applications where translation errors could cause harm
Bias, Risks, and Limitations
- Limited to specific language pairs (English/Indonesian β Balinese/Minangkabau)
- Performance varies between translation directions, with better results for translations into low-resource languages
- Underperforms larger models (NLLB-3.3B) in translations into high-resource languages
- May not capture all dialectal variations or cultural nuances
- Uses significantly more parameters (7 billion) compared to traditional NMT models
- Limited by the quality and quantity of available training data
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
How to Get Started with the Model
Use the code below to get started with the model.
Training Details
Training Data
NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT
Total parallel sentences after cleaning:
- Balinese β English: 35.6k sentences
- Balinese β Indonesian: 44.9k sentences
- Minangkabau β English: 16.6k sentences
- Minangkabau β Indonesian: 22.4k sentences
Data sources:
- NLLB Mined corpus (ODC-BY license)
- NLLB SEED dataset (CC-BY-SA license)
- BASAbaliWiki (CC-BY-SA license)
- Bible verses from Alkitab.mobi (for non-profit scholarly use)
- NusaX dataset (CC-BY-SA license)
Preprocessing
- Length filtering (15-500 characters)
- Word length ratio of 2
- Removal of sentences with words >20 characters
- Deduplication
- Language identification with GlotLid V3 (threshold: 0.9)
- LASER3 similarity scoring (threshold: 1.09)
- GPT-4o mini-based data cleaning
Training Hyperparameters
- Training regime: bfloat16 mixed precision
- LoRA rank: 16
- Learning rate: 0.002
- Batch size: 10 per device
- Epochs: 3
- Data splits: 90% training, 5% validation, 5% testing
- Loss: Causal Language Modeling (CLM)
Evaluation
Testing Data, Factors & Metrics
Testing Data
- FLORES-200 multilingual translation benchmark
- Internal test set (5% of parallel data)
Metrics
- spBLEU (SentencePiece tokenized BLEU)
Results
Performance highlights:
- Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese
- Underperforms by up to -3.38 spBLEU in translations into higher-resource languages
- Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation
Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements
Models | ban β en | en β ban | ban β id | id β ban |
---|---|---|---|---|
LLaMA2-7B SFT | 27.63 | 13.94 | 27.90 | 13.68 |
+ Monolingual Pre-training | 31.28 | 18.92 | 28.75 | 20.11 |
+ Mono + Backtranslation | 33.97 | 20.27 | 29.62 | 20.67 |
+ Mono + LLM Cleaner | 33.23 | 19.75 | 29.02 | 21.16 |
+ Mono + Cleaner + Backtrans. | 35.42 | 22.15 | 31.56 | 22.95 |
This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs.
Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models
Models | ban β en | en β ban | ban β id | id β ban | min β en | en β min | min β id | id β min |
---|---|---|---|---|---|---|---|---|
GPT-3.5-turbo, zero-shot | 27.17 | 11.63 | 28.17 | 13.14 | 28.75 | 11.07 | 31.06 | 11.05 |
GPT-4o, zero-shot | 27.11 | 11.45 | 27.89 | 13.08 | 28.63 | 11.00 | 31.27 | 11.00 |
GPT-4, zero-shot | 27.20 | 11.59 | 28.41 | 13.24 | 28.51 | 10.99 | 31.00 | 10.93 |
NLLB-600M | 33.96 | 16.86 | 30.12 | 15.15 | 35.05 | 19.72 | 31.92 | 17.72 |
NLLB-1.3B | 37.24 | 17.73 | 32.42 | 16.21 | 38.59 | 22.79 | 34.68 | 20.89 |
NLLB-3.3B | 38.57 | 17.09 | 33.35 | 14.85 | 40.61 | 24.71 | 35.20 | 22.44 |
NusaMT-7B (Ours) | 35.42 | 22.15 | 31.56 | 22.95 | 37.23 | 24.32 | 34.29 | 23.27 |
This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages.
Environmental Impact
- Hardware Type: 2x NVIDIA RTX 4090
- Hours used: 1250
- Cloud Provider: Runpod.io
- Carbon Emitted: 210 kg CO2e
Citation
If you find this model useful, please cite the following works
@misc{tan2024nusamt7bmachinetranslationlowresource,
title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models},
author={William Tan and Kevin Zhu},
year={2024},
eprint={2410.07830},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.07830},
}
- Downloads last month
- 4
Model tree for williamhtan/NusaMT-7B
Base model
Yellow-AI-NLP/komodo-7b-base