File size: 6,115 Bytes
78c0117
aff104d
78c0117
 
 
1ca7c48
 
78c0117
 
 
1ca7c48
 
 
78c0117
 
1ca7c48
78c0117
3b3a6bb
1ca7c48
3b3a6bb
78c0117
 
 
 
1ca7c48
711e129
78c0117
711e129
1ca7c48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
711e129
1ca7c48
711e129
1ca7c48
 
 
 
 
 
 
 
 
 
 
 
711e129
 
 
 
 
 
 
 
 
 
 
 
 
 
1ca7c48
78c0117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ca7c48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: cc-by-nc-4.0
base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
metrics:
- bleu
datasets:
- atlasia/darija_english
model-index:
- name: Terjman-Large
  results: []
language:
- ar
- en
---

# Terjman-Large (240M params)

Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques. 
It is a fine-tuned version of [Helsinki-NLP/opus-mt-tc-big-en-ar](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-ar) on a the [darija_english](atlasia/darija_english) dataset enhanced with curated corpora ensuring high-quality and accurate translations.

It achieves the following results on the evaluation set:
- Loss: 3.2078
- Bleu: 8.3292
- Gen Len: 34.4959
  
The finetuning was conducted using a **A100-40GB** and took **23 hours**. 

Try it out on our dedicated [Terjman-Large Space](https://huggingface.co/spaces/atlasia/Terjman-Large) 🤗

## Usage

Using our model for translation is simple and straightforward. 
You can integrate it into your projects or workflows via the Hugging Face Transformers library. 
Here's a basic example of how to use the model in Python:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large")
model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large")

# Define your Moroccan Darija Arabizi text
input_text = "Your english text goes here."

# Tokenize the input text
input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Perform translation
output_tokens = model.generate(**input_tokens)

# Decode the output tokens
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Translation:", output_text)
```

## Example

Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:

**Input**: "Hi my friend, can you tell me a joke in moroccan darija? I'd be happy to hear that from you!"

**Output**: "مرحبا صديقي، يمكن لك تقول لي نكتة في داريجا المغربية؟ سأكون سعيدا بسماعها منك!"

## Limiations

This version has some limitations mainly due to the Tokenizer.
We're currently collecting more data with the aim of continous improvements.

## Feedback

We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly. 
If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.


## Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 22
- eval_batch_size: 22
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 88
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.03
- num_epochs: 40

## Training results

| Training Loss | Epoch   | Step  | Validation Loss | Bleu   | Gen Len |
|:-------------:|:-------:|:-----:|:---------------:|:------:|:-------:|
| No log        | 0.9982  | 407   | 4.3938          | 4.6056 | 22.6033 |
| 5.1616        | 1.9988  | 815   | 3.7257          | 5.8319 | 30.9201 |
| 3.902         | 2.9994  | 1223  | 3.5214          | 6.7311 | 32.9091 |
| 3.5737        | 4.0     | 1631  | 3.4204          | 7.3684 | 32.1433 |
| 3.4576        | 4.9982  | 2038  | 3.3562          | 7.8632 | 34.5399 |
| 3.4576        | 5.9988  | 2446  | 3.3151          | 7.9739 | 35.3278 |
| 3.3833        | 6.9994  | 2854  | 3.2884          | 8.0825 | 35.8292 |
| 3.3358        | 8.0     | 3262  | 3.2681          | 8.2765 | 34.5427 |
| 3.3069        | 8.9982  | 3669  | 3.2517          | 8.1019 | 33.584  |
| 3.2769        | 9.9988  | 4077  | 3.2404          | 8.106  | 33.3802 |
| 3.2769        | 10.9994 | 4485  | 3.2342          | 8.3037 | 33.303  |
| 3.2777        | 12.0    | 4893  | 3.2284          | 8.0674 | 33.3967 |
| 3.2476        | 12.9982 | 5300  | 3.2226          | 8.2883 | 33.8154 |
| 3.2611        | 13.9988 | 5708  | 3.2189          | 8.3537 | 34.0413 |
| 3.2511        | 14.9994 | 6116  | 3.2159          | 8.1365 | 34.5014 |
| 3.2437        | 16.0    | 6524  | 3.2140          | 8.3549 | 34.0606 |
| 3.2437        | 16.9982 | 6931  | 3.2131          | 8.2507 | 34.303  |
| 3.2498        | 17.9988 | 7339  | 3.2116          | 8.2928 | 33.9945 |
| 3.2341        | 18.9994 | 7747  | 3.2105          | 8.337  | 33.7052 |
| 3.2403        | 20.0    | 8155  | 3.2098          | 8.3179 | 34.3526 |
| 3.2229        | 20.9982 | 8562  | 3.2094          | 8.3848 | 34.2039 |
| 3.2229        | 21.9988 | 8970  | 3.2090          | 8.2042 | 34.6529 |
| 3.2379        | 22.9994 | 9378  | 3.2086          | 8.4227 | 34.0275 |
| 3.2257        | 24.0    | 9786  | 3.2082          | 8.3515 | 34.3306 |
| 3.2526        | 24.9982 | 10193 | 3.2085          | 8.4089 | 34.4986 |
| 3.2206        | 25.9988 | 10601 | 3.2082          | 8.476  | 34.6226 |
| 3.2288        | 26.9994 | 11009 | 3.2083          | 8.4452 | 33.697  |
| 3.2288        | 28.0    | 11417 | 3.2080          | 8.29   | 34.0331 |
| 3.2251        | 28.9982 | 11824 | 3.2080          | 8.35   | 34.2948 |
| 3.2302        | 29.9988 | 12232 | 3.2078          | 8.4408 | 33.416  |
| 3.21          | 30.9994 | 12640 | 3.2079          | 8.2934 | 34.0854 |
| 3.2271        | 32.0    | 13048 | 3.2079          | 8.4573 | 33.3912 |
| 3.2271        | 32.9982 | 13455 | 3.2078          | 8.4055 | 34.2452 |
| 3.2428        | 33.9988 | 13863 | 3.2079          | 8.5107 | 34.5152 |
| 3.2303        | 34.9994 | 14271 | 3.2080          | 8.3734 | 34.2562 |
| 3.2129        | 36.0    | 14679 | 3.2079          | 8.3193 | 34.4628 |
| 3.2119        | 36.9982 | 15086 | 3.2082          | 8.4122 | 34.2121 |
| 3.2119        | 37.9988 | 15494 | 3.2078          | 8.3585 | 33.8843 |
| 3.2445        | 38.9994 | 15902 | 3.2079          | 8.3968 | 34.6722 |
| 3.2356        | 39.9264 | 16280 | 3.2078          | 8.3292 | 34.4959 |

### Framework versions

- Transformers 4.40.2
- Pytorch 2.2.1+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1