Model Card for Model ID

NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation.

Model Details

Model Description

Developed by: William Tan
Model type: Decoder-only Large Language Model
Language(s) (NLP): Balinese, Minangkabau, Indonesian, English
Finetuned from model: Yellow-AI-NLP/komodo-7b-base

Model Sources

Repository: https://github.com/williammtan/nusamt
Paper: https://arxiv.org/abs/2410.07830
Demo: https://indonesiaku.com/translate

Uses

The model is designed for:

Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau)
Language preservation and documentation
Cross-cultural communication
Educational purposes and language learning

Direct Use

Integrated into translation applications
Used for data augmentation in low-resource language tasks
Adapted for other Indonesian regional languages
Used as a foundation for developing language learning tools

Out-of-Scope Use

The model is not suitable for:

Translation of languages outside its trained scope
General text generation or chat functionality
Real-time translation requiring minimal latency
Critical applications where translation errors could cause harm

Bias, Risks, and Limitations

Limited to specific language pairs (English/Indonesian ↔ Balinese/Minangkabau)
Performance varies between translation directions, with better results for translations into low-resource languages
Underperforms larger models (NLLB-3.3B) in translations into high-resource languages
May not capture all dialectal variations or cultural nuances
Uses significantly more parameters (7 billion) compared to traditional NMT models
Limited by the quality and quantity of available training data

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Use the code below to get started with the model.

Training Details

Training Data

NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT

Total parallel sentences after cleaning:

Balinese ↔ English: 35.6k sentences
Balinese ↔ Indonesian: 44.9k sentences
Minangkabau ↔ English: 16.6k sentences
Minangkabau ↔ Indonesian: 22.4k sentences

Data sources:

NLLB Mined corpus (ODC-BY license)
NLLB SEED dataset (CC-BY-SA license)
BASAbaliWiki (CC-BY-SA license)
Bible verses from Alkitab.mobi (for non-profit scholarly use)
NusaX dataset (CC-BY-SA license)

Preprocessing

Length filtering (15-500 characters)
Word length ratio of 2
Removal of sentences with words >20 characters
Deduplication
Language identification with GlotLid V3 (threshold: 0.9)
LASER3 similarity scoring (threshold: 1.09)
GPT-4o mini-based data cleaning

Training Hyperparameters

Training regime: bfloat16 mixed precision
LoRA rank: 16
Learning rate: 0.002
Batch size: 10 per device
Epochs: 3
Data splits: 90% training, 5% validation, 5% testing
Loss: Causal Language Modeling (CLM)

Evaluation

Testing Data, Factors & Metrics

Testing Data

FLORES-200 multilingual translation benchmark
Internal test set (5% of parallel data)

Metrics

spBLEU (SentencePiece tokenized BLEU)

Results

Performance highlights:

Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese
Underperforms by up to -3.38 spBLEU in translations into higher-resource languages
Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation

Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements

Models	ban → en	en → ban	ban → id	id → ban
LLaMA2-7B SFT	27.63	13.94	27.90	13.68
+ Monolingual Pre-training	31.28	18.92	28.75	20.11
+ Mono + Backtranslation	33.97	20.27	29.62	20.67
+ Mono + LLM Cleaner	33.23	19.75	29.02	21.16
+ Mono + Cleaner + Backtrans.	35.42	22.15	31.56	22.95

This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs.

Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models

Models	ban → en	en → ban	ban → id	id → ban	min → en	en → min	min → id	id → min
GPT-3.5-turbo, zero-shot	27.17	11.63	28.17	13.14	28.75	11.07	31.06	11.05
GPT-4o, zero-shot	27.11	11.45	27.89	13.08	28.63	11.00	31.27	11.00
GPT-4, zero-shot	27.20	11.59	28.41	13.24	28.51	10.99	31.00	10.93
NLLB-600M	33.96	16.86	30.12	15.15	35.05	19.72	31.92	17.72
NLLB-1.3B	37.24	17.73	32.42	16.21	38.59	22.79	34.68	20.89
NLLB-3.3B	38.57	17.09	33.35	14.85	40.61	24.71	35.20	22.44
NusaMT-7B (Ours)	35.42	22.15	31.56	22.95	37.23	24.32	34.29	23.27

This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages.

Environmental Impact

Hardware Type: 2x NVIDIA RTX 4090
Hours used: 1250
Cloud Provider: Runpod.io
Carbon Emitted: 210 kg CO2e

Citation

If you find this model useful, please cite the following works

@misc{tan2024nusamt7bmachinetranslationlowresource,
      title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models}, 
      author={William Tan and Kevin Zhu},
      year={2024},
      eprint={2410.07830},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.07830}, 
}

williamhtan
/

NusaMT-7B