---
library_name: transformers
tags:
- low resource
- trans
language:
- ban
- min
- en
- id
base_model:
- Yellow-AI-NLP/komodo-7b-base
---

# Model Card for Model ID

NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation.


## Model Details

### Model Description


- **Developed by:** William Tan
- **Model type:** Decoder-only Large Language Model
- **Language(s) (NLP):** Balinese, Minangkabau, Indonesian, English
<!-- - **License:** [More Information Needed] -->
- **Finetuned from model:** Yellow-AI-NLP/komodo-7b-base

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/williammtan/nusamt
- **Paper:** https://arxiv.org/abs/2410.07830
- **Demo:** https://indonesiaku.com/translate

## Uses

The model is designed for:
- Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau)
- Language preservation and documentation
- Cross-cultural communication
- Educational purposes and language learning

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

- Integrated into translation applications
- Used for data augmentation in low-resource language tasks
- Adapted for other Indonesian regional languages
- Used as a foundation for developing language learning tools


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

The model is not suitable for:

- Translation of languages outside its trained scope
- General text generation or chat functionality
- Real-time translation requiring minimal latency
- Critical applications where translation errors could cause harm

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

- Limited to specific language pairs (English/Indonesian ↔ Balinese/Minangkabau)
- Performance varies between translation directions, with better results for translations into low-resource languages
- Underperforms larger models (NLLB-3.3B) in translations into high-resource languages
- May not capture all dialectal variations or cultural nuances
- Uses significantly more parameters (7 billion) compared to traditional NMT models
- Limited by the quality and quantity of available training data

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

## How to Get Started with the Model

Use the code below to get started with the model.


## Training Details

### Training Data

NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT

Total parallel sentences after cleaning:
- Balinese ↔ English: 35.6k sentences
- Balinese ↔ Indonesian: 44.9k sentences
- Minangkabau ↔ English: 16.6k sentences
- Minangkabau ↔ Indonesian: 22.4k sentences

Data sources:
- NLLB Mined corpus (ODC-BY license)
- NLLB SEED dataset (CC-BY-SA license)
- BASAbaliWiki (CC-BY-SA license)
- Bible verses from Alkitab.mobi (for non-profit scholarly use)
- NusaX dataset (CC-BY-SA license)

#### Preprocessing

- Length filtering (15-500 characters)
- Word length ratio of 2
- Removal of sentences with words >20 characters
- Deduplication
- Language identification with GlotLid V3 (threshold: 0.9)
- LASER3 similarity scoring (threshold: 1.09)
- GPT-4o mini-based data cleaning

#### Training Hyperparameters

- Training regime: bfloat16 mixed precision
- LoRA rank: 16
- Learning rate: 0.002
- Batch size: 10 per device
- Epochs: 3
- Data splits: 90% training, 5% validation, 5% testing
- Loss: Causal Language Modeling (CLM)


<!-- #### Speeds, Sizes, Times [optional] -->

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

<!-- [More Information Needed] -->

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

- FLORES-200 multilingual translation benchmark
- Internal test set (5% of parallel data)


#### Metrics

- spBLEU (SentencePiece tokenized BLEU)

### Results

Performance highlights:
- Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese
- Underperforms by up to -3.38 spBLEU in translations into higher-resource languages
- Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation

### Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements

| Models                        | ban → en | en → ban | ban → id | id → ban |
|-------------------------------|----------|----------|----------|----------|
| LLaMA2-7B SFT                 | 27.63    | 13.94    | 27.90    | 13.68    |
| + Monolingual Pre-training    | 31.28    | 18.92    | 28.75    | 20.11    |
| + Mono + Backtranslation      | 33.97    | 20.27    | 29.62    | 20.67    |
| + Mono + LLM Cleaner          | 33.23    | 19.75    | 29.02    | 21.16    |
| + Mono + Cleaner + Backtrans. | **35.42**| **22.15**| **31.56**| **22.95**|

This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs.

### Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models

| Models                        | ban → en | en → ban | ban → id | id → ban | min → en | en → min | min → id | id → min |
|-------------------------------|----------|----------|----------|----------|----------|----------|----------|----------|
| GPT-3.5-turbo, zero-shot      | 27.17    | 11.63    | 28.17    | 13.14    | 28.75    | 11.07    | 31.06    | 11.05    |
| GPT-4o, zero-shot             | 27.11    | 11.45    | 27.89    | 13.08    | 28.63    | 11.00    | 31.27    | 11.00    |
| GPT-4, zero-shot              | 27.20    | 11.59    | 28.41    | 13.24    | 28.51    | 10.99    | 31.00    | 10.93    |
| NLLB-600M                     | 33.96    | 16.86    | 30.12    | 15.15    | 35.05    | 19.72    | 31.92    | 17.72    |
| NLLB-1.3B                     | 37.24    | 17.73    | 32.42    | 16.21    | 38.59    | 22.79    | 34.68    | 20.89    |
| NLLB-3.3B                     | **38.57**| 17.09    | **33.35**| 14.85    | **40.61**| **24.71**| **35.20**| 22.44    |
| NusaMT-7B (Ours)              | 35.42    | **22.15**| 31.56    | **22.95**| 37.23    | 24.32    | 34.29    | **23.27**|

This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages.


## Environmental Impact


- **Hardware Type:** 2x NVIDIA RTX 4090
- **Hours used:** 1250
- **Cloud Provider:** Runpod.io
- **Carbon Emitted:** 210 kg CO2e


## Citation

If you find this model useful, please cite the following works

```
@misc{tan2024nusamt7bmachinetranslationlowresource,
      title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models}, 
      author={William Tan and Kevin Zhu},
      year={2024},
      eprint={2410.07830},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.07830}, 
}
```