Edit model card

opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmw

Table of Contents

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to West Germanic languages (gmw).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

  • Developed by: Language Technology Research Group at the University of Helsinki
  • Model Type: Translation (transformer-big)
  • Release: 2024-05-30
  • License: Apache-2.0
  • Language(s):
    • Source Language(s): deu eng fra por spa
    • Target Language(s): afr ang bar bis bzj deu djk drt eng enm frr fry gos gsw hrx hwc icr jam kri ksh lim ltz nds nld ofs pcm pdc pfl pih pis rop sco srm srn stq swg tcs tpi vls wae yid zea
    • Valid Target Language Labels: >>act<< >>afr<< >>afs<< >>aig<< >>ang<< >>ang_Latn<< >>bah<< >>bar<< >>bis<< >>bjs<< >>brc<< >>bzj<< >>bzk<< >>cim<< >>dcr<< >>deu<< >>djk<< >>djk_Latn<< >>drt<< >>drt_Latn<< >>dum<< >>eng<< >>enm<< >>enm_Latn<< >>fpe<< >>frk<< >>frr<< >>fry<< >>gcl<< >>gct<< >>geh<< >>gmh<< >>gml<< >>goh<< >>gos<< >>gpe<< >>gsw<< >>gul<< >>gyn<< >>hrx<< >>hrx_Latn<< >>hwc<< >>icr<< >>jam<< >>jvd<< >>kri<< >>ksh<< >>kww<< >>lim<< >>lng<< >>ltz<< >>mhn<< >>nds<< >>nld<< >>odt<< >>ofs<< >>ofs_Latn<< >>oor<< >>osx<< >>pcm<< >>pdc<< >>pdt<< >>pey<< >>pfl<< >>pih<< >>pih_Latn<< >>pis<< >>rop<< >>sco<< >>sdz<< >>skw<< >>sli<< >>srm<< >>srn<< >>stl<< >>stq<< >>svc<< >>swg<< >>sxu<< >>tch<< >>tcs<< >>tgh<< >>tpi<< >>trf<< >>twd<< >>uln<< >>vel<< >>vic<< >>vls<< >>vmf<< >>wae<< >>wep<< >>wes<< >>wym<< >>xxx<< >>yec<< >>yid<< >>zea<<
  • Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
  • Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>afr<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>afr<< Replace this with text in an accepted source language.",
    ">>zea<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmw"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmw")
print(pipe(">>afr<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
deu-afr tatoeba-test-v2021-08-07 0.72039 56.7 1583 9507
deu-deu tatoeba-test-v2021-08-07 0.59545 33.7 2500 20806
deu-eng tatoeba-test-v2021-08-07 0.66015 48.6 17565 149462
deu-ltz tatoeba-test-v2021-08-07 0.53760 34.2 347 2206
deu-nds tatoeba-test-v2021-08-07 0.44534 20.1 9999 76137
deu-nld tatoeba-test-v2021-08-07 0.71276 54.4 10218 75235
eng-afr tatoeba-test-v2021-08-07 0.72087 56.6 1374 10317
eng-deu tatoeba-test-v2021-08-07 0.62971 41.4 17565 151568
eng-eng tatoeba-test-v2021-08-07 0.80306 58.0 12062 115106
eng-fry tatoeba-test-v2021-08-07 0.40324 13.8 220 1600
eng-ltz tatoeba-test-v2021-08-07 0.64423 45.8 293 1828
eng-nds tatoeba-test-v2021-08-07 0.46446 22.2 2500 18264
eng-nld tatoeba-test-v2021-08-07 0.71190 54.5 12696 91796
fra-deu tatoeba-test-v2021-08-07 0.68991 50.3 12418 100545
fra-eng tatoeba-test-v2021-08-07 0.72564 58.0 12681 101754
fra-nld tatoeba-test-v2021-08-07 0.67078 48.7 11548 82164
por-deu tatoeba-test-v2021-08-07 0.68437 48.7 10000 81246
por-eng tatoeba-test-v2021-08-07 0.77081 64.3 13222 105351
por-nds tatoeba-test-v2021-08-07 0.45864 20.7 207 1292
por-nld tatoeba-test-v2021-08-07 0.69865 52.8 2500 17816
spa-afr tatoeba-test-v2021-08-07 0.77148 63.3 448 3044
spa-deu tatoeba-test-v2021-08-07 0.68037 49.1 10521 86430
spa-eng tatoeba-test-v2021-08-07 0.74575 60.2 16583 138123
spa-nds tatoeba-test-v2021-08-07 0.43154 18.5 923 5941
spa-nld tatoeba-test-v2021-08-07 0.68988 51.1 10113 79162
deu-afr flores101-devtest 0.57287 26.0 1012 25740
deu-eng flores101-devtest 0.66660 40.9 1012 24721
deu-nld flores101-devtest 0.55423 23.6 1012 25467
eng-afr flores101-devtest 0.67793 40.0 1012 25740
eng-deu flores101-devtest 0.64295 37.2 1012 25094
eng-nld flores101-devtest 0.57690 26.2 1012 25467
fra-ltz flores101-devtest 0.49430 17.3 1012 25087
fra-nld flores101-devtest 0.54318 22.2 1012 25467
por-deu flores101-devtest 0.58851 29.8 1012 25094
por-nld flores101-devtest 0.54571 22.6 1012 25467
spa-nld flores101-devtest 0.50968 17.5 1012 25467
deu-afr flores200-devtest 0.57725 26.2 1012 25740
deu-eng flores200-devtest 0.67043 41.5 1012 24721
deu-ltz flores200-devtest 0.54626 21.6 1012 25087
deu-nld flores200-devtest 0.55679 24.0 1012 25467
eng-afr flores200-devtest 0.68115 40.2 1012 25740
eng-deu flores200-devtest 0.64561 37.4 1012 25094
eng-ltz flores200-devtest 0.54932 22.0 1012 25087
eng-nld flores200-devtest 0.58124 26.8 1012 25467
eng-tpi flores200-devtest 0.40338 15.9 1012 35240
fra-afr flores200-devtest 0.57320 26.4 1012 25740
fra-deu flores200-devtest 0.58974 29.5 1012 25094
fra-eng flores200-devtest 0.68106 43.7 1012 24721
fra-ltz flores200-devtest 0.49618 17.8 1012 25087
fra-nld flores200-devtest 0.54623 22.5 1012 25467
por-afr flores200-devtest 0.58408 27.6 1012 25740
por-deu flores200-devtest 0.59121 30.4 1012 25094
por-eng flores200-devtest 0.71418 48.3 1012 24721
por-nld flores200-devtest 0.54828 22.9 1012 25467
spa-afr flores200-devtest 0.51514 17.8 1012 25740
spa-deu flores200-devtest 0.53603 21.4 1012 25094
spa-eng flores200-devtest 0.58604 28.2 1012 24721
spa-nld flores200-devtest 0.51244 17.9 1012 25467
deu-eng generaltest2022 0.55777 30.6 1984 37634
eng-deu generaltest2022 0.60792 33.0 2037 38914
fra-deu generaltest2022 0.67039 44.5 2006 37696
deu-eng multi30k_test_2016_flickr 0.60981 40.1 1000 12955
eng-deu multi30k_test_2016_flickr 0.64153 34.9 1000 12106
fra-deu multi30k_test_2016_flickr 0.61781 32.1 1000 12106
fra-eng multi30k_test_2016_flickr 0.66703 47.9 1000 12955
deu-eng multi30k_test_2017_flickr 0.63624 41.0 1000 11374
eng-deu multi30k_test_2017_flickr 0.63423 34.6 1000 10755
fra-deu multi30k_test_2017_flickr 0.60084 29.7 1000 10755
fra-eng multi30k_test_2017_flickr 0.69254 50.4 1000 11374
deu-eng multi30k_test_2017_mscoco 0.55790 32.5 461 5231
eng-deu multi30k_test_2017_mscoco 0.57491 28.6 461 5158
fra-deu multi30k_test_2017_mscoco 0.56108 26.4 461 5158
fra-eng multi30k_test_2017_mscoco 0.68212 49.1 461 5231
deu-eng multi30k_test_2018_flickr 0.59322 36.6 1071 14689
eng-deu multi30k_test_2018_flickr 0.59858 30.0 1071 13703
fra-deu multi30k_test_2018_flickr 0.55667 24.7 1071 13703
fra-eng multi30k_test_2018_flickr 0.64702 43.4 1071 14689
fra-eng newsdiscusstest2015 0.61399 38.5 1500 26982
deu-eng newssyscomb2009 0.55180 28.8 502 11818
eng-deu newssyscomb2009 0.53676 22.9 502 11271
fra-deu newssyscomb2009 0.53733 23.9 502 11271
fra-eng newssyscomb2009 0.57219 31.1 502 11818
spa-deu newssyscomb2009 0.53056 22.0 502 11271
spa-eng newssyscomb2009 0.57225 30.8 502 11818
deu-eng newstest2008 0.54506 26.9 2051 49380
eng-deu newstest2008 0.53077 23.1 2051 47447
fra-deu newstest2008 0.53204 22.9 2051 47447
fra-eng newstest2008 0.54320 26.4 2051 49380
spa-deu newstest2008 0.52066 21.6 2051 47447
spa-eng newstest2008 0.55305 27.9 2051 49380
deu-eng newstest2009 0.53773 26.2 2525 65399
eng-deu newstest2009 0.53217 22.3 2525 62816
fra-deu newstest2009 0.52995 22.9 2525 62816
fra-eng newstest2009 0.56663 30.0 2525 65399
spa-deu newstest2009 0.52586 22.1 2525 62816
spa-eng newstest2009 0.56756 29.9 2525 65399
deu-eng newstest2010 0.58365 30.4 2489 61711
eng-deu newstest2010 0.54917 25.7 2489 61503
fra-deu newstest2010 0.53904 24.3 2489 61503
fra-eng newstest2010 0.59241 32.4 2489 61711
spa-deu newstest2010 0.55378 26.2 2489 61503
spa-eng newstest2010 0.61316 35.8 2489 61711
deu-eng newstest2011 0.54907 26.1 3003 74681
eng-deu newstest2011 0.52873 23.0 3003 72981
fra-deu newstest2011 0.52977 23.0 3003 72981
fra-eng newstest2011 0.59565 32.8 3003 74681
spa-deu newstest2011 0.53095 23.4 3003 72981
spa-eng newstest2011 0.59513 33.3 3003 74681
deu-eng newstest2012 0.56230 28.1 3003 72812
eng-deu newstest2012 0.52871 23.7 3003 72886
fra-deu newstest2012 0.53035 24.1 3003 72886
fra-eng newstest2012 0.59137 33.0 3003 72812
spa-deu newstest2012 0.53438 24.3 3003 72886
spa-eng newstest2012 0.62058 37.0 3003 72812
deu-eng newstest2013 0.57940 31.5 3000 64505
eng-deu newstest2013 0.55718 27.5 3000 63737
fra-deu newstest2013 0.54408 25.6 3000 63737
fra-eng newstest2013 0.59151 33.9 3000 64505
spa-deu newstest2013 0.55215 26.2 3000 63737
spa-eng newstest2013 0.60465 34.4 3000 64505
deu-eng newstest2014 0.59723 33.1 3003 67337
eng-deu newstest2014 0.59127 28.5 3003 62688
fra-eng newstest2014 0.63411 38.0 3003 70708
deu-eng newstest2015 0.59799 33.7 2169 46443
eng-deu newstest2015 0.59977 32.0 2169 44260
deu-eng newstest2016 0.65039 40.4 2999 64119
eng-deu newstest2016 0.64144 37.9 2999 62669
deu-eng newstest2017 0.60921 35.3 3004 64399
eng-deu newstest2017 0.59114 30.4 3004 61287
deu-eng newstest2018 0.66680 42.6 2998 67012
eng-deu newstest2018 0.69428 45.8 2998 64276
deu-eng newstest2019 0.63482 39.1 2000 39227
eng-deu newstest2019 0.66430 42.0 1997 48746
fra-deu newstest2019 0.60993 29.4 1701 36446
deu-eng newstest2020 0.60403 34.0 785 38220
eng-deu newstest2020 0.60255 32.3 1418 52383
fra-deu newstest2020 0.61470 29.2 1619 30265
deu-eng newstest2021 0.59738 31.9 1000 20180
eng-deu newstest2021 0.56399 26.1 1002 27970
fra-deu newstest2021 0.66155 40.0 1026 26077
deu-eng newstestALL2020 0.60403 34.0 785 38220
eng-deu newstestALL2020 0.60255 32.3 1418 52383
deu-eng newstestB2020 0.60520 34.2 785 37696
eng-deu newstestB2020 0.59226 31.6 1418 53092
deu-afr ntrex128 0.57109 27.9 1997 50050
deu-eng ntrex128 0.62043 34.5 1997 47673
deu-ltz ntrex128 0.47642 15.4 1997 49763
deu-nld ntrex128 0.56777 27.6 1997 51884
eng-afr ntrex128 0.68616 44.1 1997 50050
eng-deu ntrex128 0.58743 30.2 1997 48761
eng-ltz ntrex128 0.50083 18.0 1997 49763
eng-nld ntrex128 0.61041 33.8 1997 51884
fra-afr ntrex128 0.55607 26.5 1997 50050
fra-deu ntrex128 0.53269 23.6 1997 48761
fra-eng ntrex128 0.61058 34.4 1997 47673
fra-ltz ntrex128 0.41312 12.0 1997 49763
fra-nld ntrex128 0.54615 25.2 1997 51884
por-afr ntrex128 0.58296 29.2 1997 50050
por-deu ntrex128 0.54944 24.7 1997 48761
por-eng ntrex128 0.65002 39.6 1997 47673
por-nld ntrex128 0.56384 28.1 1997 51884
spa-afr ntrex128 0.57772 27.7 1997 50050
spa-deu ntrex128 0.54561 24.0 1997 48761
spa-eng ntrex128 0.64305 37.3 1997 47673
spa-nld ntrex128 0.56397 27.8 1997 51884
fra-eng tico19-test 0.62059 39.2 2100 56323
por-eng tico19-test 0.73896 50.3 2100 56315
spa-eng tico19-test 0.72923 49.6 2100 56315

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 10:01:07 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
6
Safetensors
Model size
226M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmw

Evaluation results