File size: 7,831 Bytes
5573595
 
3846a49
 
 
 
 
 
 
 
 
 
5573595
 
 
 
3846a49
5573595
 
 
 
 
 
 
3846a49
 
 
 
 
5573595
3846a49
5573595
 
 
3846a49
 
 
5573595
 
 
3846a49
 
 
 
 
5573595
 
 
 
 
3846a49
 
 
 
5573595
 
 
 
 
 
3846a49
 
 
 
 
 
5573595
 
 
 
 
3846a49
 
 
 
 
 
5573595
 
 
 
 
3846a49
5573595
 
 
 
 
 
 
 
 
 
308998b
 
3846a49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5573595
 
 
3846a49
 
 
 
 
 
 
 
5573595
3846a49
5573595
 
 
3846a49
5573595
 
 
 
 
 
 
 
 
3846a49
 
5573595
 
 
 
3846a49
5573595
 
 
3846a49
 
 
 
5573595
3846a49
5573595
3846a49
 
 
 
 
 
 
5573595
3846a49
5573595
3846a49
5573595
3846a49
 
 
 
 
 
 
 
 
5573595
3846a49
5573595
 
 
3846a49
5573595
 
3846a49
 
 
 
5573595
 
3846a49
5573595
3846a49
5573595
3846a49
 
 
 
 
 
 
 
 
 
 
5573595
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
library_name: transformers
tags:
- low resource
- trans
language:
- ban
- min
- en
- id
base_model:
- Yellow-AI-NLP/komodo-7b-base
---

# Model Card for Model ID

NusaMT-7B is a large language model fine-tuned for machine translation of low-resource Indonesian languages, with a focus on Balinese and Minangkabau. Built on LLaMA2-7B and leveraging the Komodo-7B-base model, it incorporates continued pre-training on non-English monolingual data, supervised fine-tuning, data preprocessing for cleaning parallel sentences, and synthetic data generation.


## Model Details

### Model Description


- **Developed by:** William Tan
- **Model type:** Decoder-only Large Language Model
- **Language(s) (NLP):** Balinese, Minangkabau, Indonesian, English
<!-- - **License:** [More Information Needed] -->
- **Finetuned from model:** Yellow-AI-NLP/komodo-7b-base

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/williammtan/nusamt
- **Paper:** https://arxiv.org/abs/2410.07830
- **Demo:** https://indonesiaku.com/translate

## Uses

The model is designed for:
- Bidirectional translation between English/Indonesian and low-resource Indonesian languages (currently Balinese and Minangkabau)
- Language preservation and documentation
- Cross-cultural communication
- Educational purposes and language learning

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

- Integrated into translation applications
- Used for data augmentation in low-resource language tasks
- Adapted for other Indonesian regional languages
- Used as a foundation for developing language learning tools


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

The model is not suitable for:

- Translation of languages outside its trained scope
- General text generation or chat functionality
- Real-time translation requiring minimal latency
- Critical applications where translation errors could cause harm

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

- Limited to specific language pairs (English/Indonesian ↔ Balinese/Minangkabau)
- Performance varies between translation directions, with better results for translations into low-resource languages
- Underperforms larger models (NLLB-3.3B) in translations into high-resource languages
- May not capture all dialectal variations or cultural nuances
- Uses significantly more parameters (7 billion) compared to traditional NMT models
- Limited by the quality and quantity of available training data

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

## How to Get Started with the Model

Use the code below to get started with the model.


## Training Details

### Training Data

NusaMT: https://huggingface.co/datasets/williamhtan/NusaMT

Total parallel sentences after cleaning:
- Balinese ↔ English: 35.6k sentences
- Balinese ↔ Indonesian: 44.9k sentences
- Minangkabau ↔ English: 16.6k sentences
- Minangkabau ↔ Indonesian: 22.4k sentences

Data sources:
- NLLB Mined corpus (ODC-BY license)
- NLLB SEED dataset (CC-BY-SA license)
- BASAbaliWiki (CC-BY-SA license)
- Bible verses from Alkitab.mobi (for non-profit scholarly use)
- NusaX dataset (CC-BY-SA license)

#### Preprocessing

- Length filtering (15-500 characters)
- Word length ratio of 2
- Removal of sentences with words >20 characters
- Deduplication
- Language identification with GlotLid V3 (threshold: 0.9)
- LASER3 similarity scoring (threshold: 1.09)
- GPT-4o mini-based data cleaning

#### Training Hyperparameters

- Training regime: bfloat16 mixed precision
- LoRA rank: 16
- Learning rate: 0.002
- Batch size: 10 per device
- Epochs: 3
- Data splits: 90% training, 5% validation, 5% testing
- Loss: Causal Language Modeling (CLM)


<!-- #### Speeds, Sizes, Times [optional] -->

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

<!-- [More Information Needed] -->

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

- FLORES-200 multilingual translation benchmark
- Internal test set (5% of parallel data)


#### Metrics

- spBLEU (SentencePiece tokenized BLEU)

### Results

Performance highlights:
- Outperforms SOTA models by up to +6.69 spBLEU in translations into Balinese
- Underperforms by up to -3.38 spBLEU in translations into higher-resource languages
- Consistently outperforms GPT-3.5, GPT-4, and GPT-4o in zero-shot translation

### Table 2: spBLEU Score Comparison of the LLaMA2-7B SFT Model with Various Enhancements

| Models                        | ban β†’ en | en β†’ ban | ban β†’ id | id β†’ ban |
|-------------------------------|----------|----------|----------|----------|
| LLaMA2-7B SFT                 | 27.63    | 13.94    | 27.90    | 13.68    |
| + Monolingual Pre-training    | 31.28    | 18.92    | 28.75    | 20.11    |
| + Mono + Backtranslation      | 33.97    | 20.27    | 29.62    | 20.67    |
| + Mono + LLM Cleaner          | 33.23    | 19.75    | 29.02    | 21.16    |
| + Mono + Cleaner + Backtrans. | **35.42**| **22.15**| **31.56**| **22.95**|

This table presents spBLEU scores for various configurations of the LLaMA2-7B model, showing the impact of monolingual pre-training, backtranslation, and LLM cleaning on translation performance across different language pairs.

### Table 3: spBLEU Scores of NusaMT-7B Compared Against SoTA Models and Large GPT Models

| Models                        | ban β†’ en | en β†’ ban | ban β†’ id | id β†’ ban | min β†’ en | en β†’ min | min β†’ id | id β†’ min |
|-------------------------------|----------|----------|----------|----------|----------|----------|----------|----------|
| GPT-3.5-turbo, zero-shot      | 27.17    | 11.63    | 28.17    | 13.14    | 28.75    | 11.07    | 31.06    | 11.05    |
| GPT-4o, zero-shot             | 27.11    | 11.45    | 27.89    | 13.08    | 28.63    | 11.00    | 31.27    | 11.00    |
| GPT-4, zero-shot              | 27.20    | 11.59    | 28.41    | 13.24    | 28.51    | 10.99    | 31.00    | 10.93    |
| NLLB-600M                     | 33.96    | 16.86    | 30.12    | 15.15    | 35.05    | 19.72    | 31.92    | 17.72    |
| NLLB-1.3B                     | 37.24    | 17.73    | 32.42    | 16.21    | 38.59    | 22.79    | 34.68    | 20.89    |
| NLLB-3.3B                     | **38.57**| 17.09    | **33.35**| 14.85    | **40.61**| **24.71**| **35.20**| 22.44    |
| NusaMT-7B (Ours)              | 35.42    | **22.15**| 31.56    | **22.95**| 37.23    | 24.32    | 34.29    | **23.27**|

This table compares the performance of NusaMT-7B with state-of-the-art models and large GPT models in terms of spBLEU scores across multiple language pairs. NusaMT-7B shows significant improvements, particularly in translations into low-resource languages.



## Environmental Impact


- **Hardware Type:** 2x NVIDIA RTX 4090
- **Hours used:** 1250
- **Cloud Provider:** Runpod.io
- **Carbon Emitted:** 210 kg CO2e


## Citation

If you find this model useful, please cite the following works

```
@misc{tan2024nusamt7bmachinetranslationlowresource,
      title={NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models}, 
      author={William Tan and Kevin Zhu},
      year={2024},
      eprint={2410.07830},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.07830}, 
}
```