Update README.md
Browse files
README.md
CHANGED
@@ -6,131 +6,35 @@ language:
|
|
6 |
metrics:
|
7 |
- BLEU
|
8 |
- TER
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
---
|
10 |
-
## Hitz Center’s English-Basque machine translation model
|
11 |
|
12 |
## Model description
|
13 |
|
14 |
-
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
- **Source Language:** English
|
19 |
-
- **Target Language:** Basque
|
20 |
-
- **License:** apache-2.0
|
21 |
|
22 |
-
|
23 |
|
24 |
-
|
|
|
|
|
|
|
25 |
|
26 |
-
|
27 |
|
28 |
-
## How to Get Started with the Model
|
29 |
|
30 |
-
Use the code below to get started with the model.
|
31 |
-
|
32 |
-
```
|
33 |
-
from transformers import MarianMTModel, MarianTokenizer
|
34 |
-
from transformers import AutoTokenizer
|
35 |
-
from transformers import AutoModelForSeq2SeqLM
|
36 |
-
|
37 |
-
src_text = ["this is a test"]
|
38 |
-
|
39 |
-
model_name = "HiTZ/mt-hitz-en-eu"
|
40 |
-
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
41 |
-
|
42 |
-
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
43 |
-
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=T
|
44 |
-
rue))
|
45 |
-
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`
|
46 |
-
```
|
47 |
-
The recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1
|
48 |
-
|
49 |
-
## Training Details
|
50 |
-
|
51 |
-
### Training Data
|
52 |
-
|
53 |
-
The English-Basque data collected from the web was a combination of the following datasets:
|
54 |
-
|
55 |
-
| Dataset | Sentences before cleaning |
|
56 |
-
|-----------------|--------------------------:|
|
57 |
-
| CCMatrix v1 | 7,788,871 |
|
58 |
-
| EhuHac | 585,210 |
|
59 |
-
| Ehuskaratuak | 482,259 |
|
60 |
-
| Ehuskaratuak | 482,259 |
|
61 |
-
| Elhuyar | 1,176,529 |
|
62 |
-
| HPLT | 4,546,563 |
|
63 |
-
| OpenSubtitles | 805,780 |
|
64 |
-
| PaCO_2012 | 109,524 |
|
65 |
-
| PaCO_2013 | 48,892 |
|
66 |
-
| WikiMatrix | 119,480 |
|
67 |
-
| **Total** | **15,653,108** |
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
The 11,489,433 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into English using the [ES-EN translator from Google Translate](https://translate.google.com/about/).
|
72 |
-
|
73 |
-
### Training Procedure
|
74 |
-
|
75 |
-
#### Preprocessing
|
76 |
-
|
77 |
-
After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/) for identifying repetions and cleaning encoding problems and LaBSE embeddings to filter missaligned sentences. Any sentence pairs with a LaBSE similarity score of less than 0.5 is removed. The filtered corpus is composed of 9,033,998 parallel sentences.
|
78 |
-
|
79 |
-
#### Tokenization
|
80 |
-
All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
|
81 |
-
|
82 |
-
## Evaluation
|
83 |
-
### Variable and metrics
|
84 |
-
We use the BLEU and TER scores for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
|
85 |
-
|
86 |
-
### Evaluation results
|
87 |
-
Below are the evaluation results on the machine translation from English to Basque compared to [Google Translate](https://translate.google.com/) and [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B):
|
88 |
-
|
89 |
-
####BLEU scores
|
90 |
-
|
91 |
-
| Test set |Google Translate | NLLB 3.3 |mt-hitz-en-eu|
|
92 |
-
|----------------------|-----------------|----------|-------------|
|
93 |
-
| Flores 200 devtest |**20.5** | 13.3 | 19.2 |
|
94 |
-
| TaCON | **12.1** | 9.4 | 8.8 |
|
95 |
-
| NTREX | **15.7** | 8.0 | 14.5 |
|
96 |
-
| Average | **16.1** | 10.2 | 14.2 |
|
97 |
-
|
98 |
-
####TER scores
|
99 |
-
|
100 |
-
| Test set |Google Translate | NLLB 3.3 |mt-hitz-en-eu|
|
101 |
-
|----------------------|-----------------|----------|-------------|
|
102 |
-
| Flores 200 devtest |**59.5** | 70.4 | 65.0 |
|
103 |
-
| TaCON |**69.5** | 75.3 | 76.8 |
|
104 |
-
| NTREX |**65.8** | 81.6 | 66.7 |
|
105 |
-
| Average |**64.9** | 75.8 | **68.2** |
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
<!-- Momentuz ez dugu artikulurik. ILENIAn zerbait egiten bada eguneratu beharko da -->
|
110 |
-
|
111 |
-
<!--
|
112 |
-
## Citation [optional]
|
113 |
-
|
114 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. - ->
|
115 |
-
|
116 |
-
**BibTeX:**
|
117 |
-
|
118 |
-
[More Information Needed]
|
119 |
-
|
120 |
-
**APA:**
|
121 |
-
|
122 |
-
[More Information Needed]
|
123 |
-
-->
|
124 |
-
|
125 |
-
## Additional information
|
126 |
-
### Author
|
127 |
-
HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
|
128 |
-
### Contact information
|
129 |
-
For further information, send an email to <[email protected]>
|
130 |
### Licensing information
|
131 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
132 |
-
### Funding
|
133 |
-
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
|
134 |
### Disclaimer
|
135 |
<details>
|
136 |
<summary>Click to expand</summary>
|
|
|
6 |
metrics:
|
7 |
- BLEU
|
8 |
- TER
|
9 |
+
base_model:
|
10 |
+
- HiTZ/mt-hitz-en-eu
|
11 |
+
pipeline_tag: translation
|
12 |
+
tags:
|
13 |
+
- ctranslate2
|
14 |
+
- translation
|
15 |
+
- marian
|
16 |
---
|
17 |
+
## Hitz Center’s English-Basque machine translation model converted to CTranslate2
|
18 |
|
19 |
## Model description
|
20 |
|
21 |
+
- [Original model](https://huggingface.co/HiTZ/mt-hitz-en-eu)
|
22 |
|
23 |
+
# What is CTranslate2?
|
24 |
+
[CTranslate2](https://opennmt.net/CTranslate2/) is a C++ and Python library for efficient inference with Transformer models.
|
|
|
|
|
|
|
25 |
|
26 |
+
CTranslate2 implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU.
|
27 |
|
28 |
+
CTranslate2 is one of the most performant ways of hosting translation models at scale. Current supported models include:
|
29 |
+
- Encoder-decoder models: Transformer base/big, M2M-100, NLLB, BART, mBART, Pegasus, T5, Whisper
|
30 |
+
- Decoder-only models: GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT, Llama, Mistral, Gemma, CodeGen, GPTBigCode, Falcon
|
31 |
+
- Encoder-only models: BERT, DistilBERT, XLM-RoBERTa
|
32 |
|
33 |
+
The project is production-oriented and comes with backward compatibility guarantees, but it also includes experimental features related to model compression and inference acceleration.
|
34 |
|
|
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
### Licensing information
|
37 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
|
|
|
|
38 |
### Disclaimer
|
39 |
<details>
|
40 |
<summary>Click to expand</summary>
|