File size: 7,831 Bytes
5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 308998b 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 3846a49 5573595 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 |
---
library_name: transformers
tags:
- low resource
- trans
language:
- ban
- min
- en
- id
base_model:
- Yellow-AI-NLP/komodo-7b-base
---
# Model Card for Model ID
NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation.
## Model Details
### Model Description
- **Developed by:** William Tan
- **Model type:** Decoder-only Large Language Model
- **Language(s) (NLP):** Balinese, Minangkabau, Indonesian, English
<!-- - **License:** [More Information Needed] -->
- **Finetuned from model:** Yellow-AI-NLP/komodo-7b-base
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/williammtan/nusamt
- **Paper:** https://arxiv.org/abs/2410.07830
- **Demo:** https://indonesiaku.com/translate
## Uses
The model is designed for:
- Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau)
- Language preservation and documentation
- Cross-cultural communication
- Educational purposes and language learning
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
- Integrated into translation applications
- Used for data augmentation in low-resource language tasks
- Adapted for other Indonesian regional languages
- Used as a foundation for developing language learning tools
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
The model is not suitable for:
- Translation of languages outside its trained scope
- General text generation or chat functionality
- Real-time translation requiring minimal latency
- Critical applications where translation errors could cause harm
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- Limited to specific language pairs (English/Indonesian β Balinese/Minangkabau)
- Performance varies between translation directions, with better results for translations into low-resource languages
- Underperforms larger models (NLLB-3.3B) in translations into high-resource languages
- May not capture all dialectal variations or cultural nuances
- Uses significantly more parameters (7 billion) compared to traditional NMT models
- Limited by the quality and quantity of available training data
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
## How to Get Started with the Model
Use the code below to get started with the model.
## Training Details
### Training Data
NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT
Total parallel sentences after cleaning:
- Balinese β English: 35.6k sentences
- Balinese β Indonesian: 44.9k sentences
- Minangkabau β English: 16.6k sentences
- Minangkabau β Indonesian: 22.4k sentences
Data sources:
- NLLB Mined corpus (ODC-BY license)
- NLLB SEED dataset (CC-BY-SA license)
- BASAbaliWiki (CC-BY-SA license)
- Bible verses from Alkitab.mobi (for non-profit scholarly use)
- NusaX dataset (CC-BY-SA license)
#### Preprocessing
- Length filtering (15-500 characters)
- Word length ratio of 2
- Removal of sentences with words >20 characters
- Deduplication
- Language identification with GlotLid V3 (threshold: 0.9)
- LASER3 similarity scoring (threshold: 1.09)
- GPT-4o mini-based data cleaning
#### Training Hyperparameters
- Training regime: bfloat16 mixed precision
- LoRA rank: 16
- Learning rate: 0.002
- Batch size: 10 per device
- Epochs: 3
- Data splits: 90% training, 5% validation, 5% testing
- Loss: Causal Language Modeling (CLM)
<!-- #### Speeds, Sizes, Times [optional] -->
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
<!-- [More Information Needed] -->
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
- FLORES-200 multilingual translation benchmark
- Internal test set (5% of parallel data)
#### Metrics
- spBLEU (SentencePiece tokenized BLEU)
### Results
Performance highlights:
- Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese
- Underperforms by up to -3.38 spBLEU in translations into higher-resource languages
- Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation
### Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements
| Models | ban β en | en β ban | ban β id | id β ban |
|-------------------------------|----------|----------|----------|----------|
| LLaMA2-7B SFT | 27.63 | 13.94 | 27.90 | 13.68 |
| + Monolingual Pre-training | 31.28 | 18.92 | 28.75 | 20.11 |
| + Mono + Backtranslation | 33.97 | 20.27 | 29.62 | 20.67 |
| + Mono + LLM Cleaner | 33.23 | 19.75 | 29.02 | 21.16 |
| + Mono + Cleaner + Backtrans. | **35.42**| **22.15**| **31.56**| **22.95**|
This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs.
### Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models
| Models | ban β en | en β ban | ban β id | id β ban | min β en | en β min | min β id | id β min |
|-------------------------------|----------|----------|----------|----------|----------|----------|----------|----------|
| GPT-3.5-turbo, zero-shot | 27.17 | 11.63 | 28.17 | 13.14 | 28.75 | 11.07 | 31.06 | 11.05 |
| GPT-4o, zero-shot | 27.11 | 11.45 | 27.89 | 13.08 | 28.63 | 11.00 | 31.27 | 11.00 |
| GPT-4, zero-shot | 27.20 | 11.59 | 28.41 | 13.24 | 28.51 | 10.99 | 31.00 | 10.93 |
| NLLB-600M | 33.96 | 16.86 | 30.12 | 15.15 | 35.05 | 19.72 | 31.92 | 17.72 |
| NLLB-1.3B | 37.24 | 17.73 | 32.42 | 16.21 | 38.59 | 22.79 | 34.68 | 20.89 |
| NLLB-3.3B | **38.57**| 17.09 | **33.35**| 14.85 | **40.61**| **24.71**| **35.20**| 22.44 |
| NusaMT-7B (Ours) | 35.42 | **22.15**| 31.56 | **22.95**| 37.23 | 24.32 | 34.29 | **23.27**|
This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages.
## Environmental Impact
- **Hardware Type:** 2x NVIDIA RTX 4090
- **Hours used:** 1250
- **Cloud Provider:** Runpod.io
- **Carbon Emitted:** 210 kg CO2e
## Citation
If you find this model useful, please cite the following works
```
@misc{tan2024nusamt7bmachinetranslationlowresource,
title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models},
author={William Tan and Kevin Zhu},
year={2024},
eprint={2410.07830},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.07830},
}
```
|