Model Card for Model ID

NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation.

Model Details

Model Description

  • Developed by: William Tan
  • Model type: Decoder-only Large Language Model
  • Language(s) (NLP): Balinese, Minangkabau, Indonesian, English
  • Finetuned from model: Yellow-AI-NLP/komodo-7b-base

Model Sources

Uses

The model is designed for:

  • Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau)
  • Language preservation and documentation
  • Cross-cultural communication
  • Educational purposes and language learning

Direct Use

  • Integrated into translation applications
  • Used for data augmentation in low-resource language tasks
  • Adapted for other Indonesian regional languages
  • Used as a foundation for developing language learning tools

Out-of-Scope Use

The model is not suitable for:

  • Translation of languages outside its trained scope
  • General text generation or chat functionality
  • Real-time translation requiring minimal latency
  • Critical applications where translation errors could cause harm

Bias, Risks, and Limitations

  • Limited to specific language pairs (English/Indonesian ↔ Balinese/Minangkabau)
  • Performance varies between translation directions, with better results for translations into low-resource languages
  • Underperforms larger models (NLLB-3.3B) in translations into high-resource languages
  • May not capture all dialectal variations or cultural nuances
  • Uses significantly more parameters (7 billion) compared to traditional NMT models
  • Limited by the quality and quantity of available training data

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Use the code below to get started with the model.

Training Details

Training Data

NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT

Total parallel sentences after cleaning:

  • Balinese ↔ English: 35.6k sentences
  • Balinese ↔ Indonesian: 44.9k sentences
  • Minangkabau ↔ English: 16.6k sentences
  • Minangkabau ↔ Indonesian: 22.4k sentences

Data sources:

  • NLLB Mined corpus (ODC-BY license)
  • NLLB SEED dataset (CC-BY-SA license)
  • BASAbaliWiki (CC-BY-SA license)
  • Bible verses from Alkitab.mobi (for non-profit scholarly use)
  • NusaX dataset (CC-BY-SA license)

Preprocessing

  • Length filtering (15-500 characters)
  • Word length ratio of 2
  • Removal of sentences with words >20 characters
  • Deduplication
  • Language identification with GlotLid V3 (threshold: 0.9)
  • LASER3 similarity scoring (threshold: 1.09)
  • GPT-4o mini-based data cleaning

Training Hyperparameters

  • Training regime: bfloat16 mixed precision
  • LoRA rank: 16
  • Learning rate: 0.002
  • Batch size: 10 per device
  • Epochs: 3
  • Data splits: 90% training, 5% validation, 5% testing
  • Loss: Causal Language Modeling (CLM)

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • FLORES-200 multilingual translation benchmark
  • Internal test set (5% of parallel data)

Metrics

  • spBLEU (SentencePiece tokenized BLEU)

Results

Performance highlights:

  • Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese
  • Underperforms by up to -3.38 spBLEU in translations into higher-resource languages
  • Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation

Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements

Models ban β†’ en en β†’ ban ban β†’ id id β†’ ban
LLaMA2-7B SFT 27.63 13.94 27.90 13.68
+ Monolingual Pre-training 31.28 18.92 28.75 20.11
+ Mono + Backtranslation 33.97 20.27 29.62 20.67
+ Mono + LLM Cleaner 33.23 19.75 29.02 21.16
+ Mono + Cleaner + Backtrans. 35.42 22.15 31.56 22.95

This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs.

Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models

Models ban β†’ en en β†’ ban ban β†’ id id β†’ ban min β†’ en en β†’ min min β†’ id id β†’ min
GPT-3.5-turbo, zero-shot 27.17 11.63 28.17 13.14 28.75 11.07 31.06 11.05
GPT-4o, zero-shot 27.11 11.45 27.89 13.08 28.63 11.00 31.27 11.00
GPT-4, zero-shot 27.20 11.59 28.41 13.24 28.51 10.99 31.00 10.93
NLLB-600M 33.96 16.86 30.12 15.15 35.05 19.72 31.92 17.72
NLLB-1.3B 37.24 17.73 32.42 16.21 38.59 22.79 34.68 20.89
NLLB-3.3B 38.57 17.09 33.35 14.85 40.61 24.71 35.20 22.44
NusaMT-7B (Ours) 35.42 22.15 31.56 22.95 37.23 24.32 34.29 23.27

This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages.

Environmental Impact

  • Hardware Type: 2x NVIDIA RTX 4090
  • Hours used: 1250
  • Cloud Provider: Runpod.io
  • Carbon Emitted: 210 kg CO2e

Citation

If you find this model useful, please cite the following works

@misc{tan2024nusamt7bmachinetranslationlowresource,
      title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models}, 
      author={William Tan and Kevin Zhu},
      year={2024},
      eprint={2410.07830},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.07830}, 
}
Downloads last month
4
Safetensors
Model size
6.76B params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for williamhtan/NusaMT-7B

Finetuned
(6)
this model