opus-mt-tc-bible-big-cel-deu_eng_fra_por_spa

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from Celtic languages (cel) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-30
License: Apache-2.0
Language(s):
- Source Language(s): bre cor cym gla gle glv
- Target Language(s): deu eng fra por spa
- Valid Target Language Labels: >>deu<< >>eng<< >>fra<< >>por<< >>spa<< >>xxx<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-cel-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-cel-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
bre-eng	tatoeba-test-v2021-08-07	0.53473	35.0	383	2065
bre-fra	tatoeba-test-v2021-08-07	0.49013	28.3	2494	13324
cym-eng	tatoeba-test-v2021-08-07	0.68892	52.4	818	5563
gla-eng	tatoeba-test-v2021-08-07	0.39607	23.2	955	6611
gla-spa	tatoeba-test-v2021-08-07	0.51208	26.1	289	1608
gle-eng	tatoeba-test-v2021-08-07	0.64268	50.7	1913	11190
cym-deu	flores101-devtest	0.52672	22.4	1012	25094
cym-fra	flores101-devtest	0.58299	31.3	1012	28343
cym-por	flores101-devtest	0.47733	18.4	1012	26519
gle-eng	flores101-devtest	0.64773	38.6	1012	24721
gle-fra	flores101-devtest	0.54559	26.5	1012	28343
cym-deu	flores200-devtest	0.52745	22.6	1012	25094
cym-eng	flores200-devtest	0.75234	55.5	1012	24721
cym-fra	flores200-devtest	0.58339	31.4	1012	28343
cym-por	flores200-devtest	0.47566	18.3	1012	26519
cym-spa	flores200-devtest	0.48834	19.9	1012	29199
gla-deu	flores200-devtest	0.41962	13.0	1012	25094
gla-eng	flores200-devtest	0.53374	26.4	1012	24721
gla-fra	flores200-devtest	0.44916	16.6	1012	28343
gla-spa	flores200-devtest	0.40375	12.9	1012	29199
gle-deu	flores200-devtest	0.49962	19.2	1012	25094
gle-eng	flores200-devtest	0.64866	38.9	1012	24721
gle-fra	flores200-devtest	0.54564	26.7	1012	28343
gle-por	flores200-devtest	0.44768	14.9	1012	26519
gle-spa	flores200-devtest	0.47347	18.7	1012	29199
cym-deu	ntrex128	0.46627	16.3	1997	48761
cym-eng	ntrex128	0.65343	40.0	1997	47673
cym-fra	ntrex128	0.51183	23.8	1997	53481
cym-por	ntrex128	0.42857	14.4	1997	51631
cym-spa	ntrex128	0.51542	25.0	1997	54107
gle-deu	ntrex128	0.46495	15.5	1997	48761
gle-eng	ntrex128	0.60913	33.5	1997	47673
gle-fra	ntrex128	0.49513	20.7	1997	53481
gle-por	ntrex128	0.41767	13.2	1997	51631
gle-spa	ntrex128	0.50755	23.6	1997	54107

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: a0ea3b3
port time: Mon Oct 7 23:09:42 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-cel-deu_eng_fra_por_spa