opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmw

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to West Germanic languages (gmw).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-30
License: Apache-2.0
Language(s):
- Source Language(s): deu eng fra por spa
- Target Language(s): afr ang bar bis bzj deu djk drt eng enm frr fry gos gsw hrx hwc icr jam kri ksh lim ltz nds nld ofs pcm pdc pfl pih pis rop sco srm srn stq swg tcs tpi vls wae yid zea
- Valid Target Language Labels: >>act<< >>afr<< >>afs<< >>aig<< >>ang<< >>ang_Latn<< >>bah<< >>bar<< >>bis<< >>bjs<< >>brc<< >>bzj<< >>bzk<< >>cim<< >>dcr<< >>deu<< >>djk<< >>djk_Latn<< >>drt<< >>drt_Latn<< >>dum<< >>eng<< >>enm<< >>enm_Latn<< >>fpe<< >>frk<< >>frr<< >>fry<< >>gcl<< >>gct<< >>geh<< >>gmh<< >>gml<< >>goh<< >>gos<< >>gpe<< >>gsw<< >>gul<< >>gyn<< >>hrx<< >>hrx_Latn<< >>hwc<< >>icr<< >>jam<< >>jvd<< >>kri<< >>ksh<< >>kww<< >>lim<< >>lng<< >>ltz<< >>mhn<< >>nds<< >>nld<< >>odt<< >>ofs<< >>ofs_Latn<< >>oor<< >>osx<< >>pcm<< >>pdc<< >>pdt<< >>pey<< >>pfl<< >>pih<< >>pih_Latn<< >>pis<< >>rop<< >>sco<< >>sdz<< >>skw<< >>sli<< >>srm<< >>srn<< >>stl<< >>stq<< >>svc<< >>swg<< >>sxu<< >>tch<< >>tcs<< >>tgh<< >>tpi<< >>trf<< >>twd<< >>uln<< >>vel<< >>vic<< >>vls<< >>vmf<< >>wae<< >>wep<< >>wes<< >>wym<< >>xxx<< >>yec<< >>yid<< >>zea<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>afr<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>afr<< Replace this with text in an accepted source language.",
    ">>zea<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmw"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmw")
print(pipe(">>afr<< Replace this with text in an accepted source language."))

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
deu-afr	tatoeba-test-v2021-08-07	0.72039	56.7	1583	9507
deu-deu	tatoeba-test-v2021-08-07	0.59545	33.7	2500	20806
deu-eng	tatoeba-test-v2021-08-07	0.66015	48.6	17565	149462
deu-ltz	tatoeba-test-v2021-08-07	0.53760	34.2	347	2206
deu-nds	tatoeba-test-v2021-08-07	0.44534	20.1	9999	76137
deu-nld	tatoeba-test-v2021-08-07	0.71276	54.4	10218	75235
eng-afr	tatoeba-test-v2021-08-07	0.72087	56.6	1374	10317
eng-deu	tatoeba-test-v2021-08-07	0.62971	41.4	17565	151568
eng-eng	tatoeba-test-v2021-08-07	0.80306	58.0	12062	115106
eng-fry	tatoeba-test-v2021-08-07	0.40324	13.8	220	1600
eng-ltz	tatoeba-test-v2021-08-07	0.64423	45.8	293	1828
eng-nds	tatoeba-test-v2021-08-07	0.46446	22.2	2500	18264
eng-nld	tatoeba-test-v2021-08-07	0.71190	54.5	12696	91796
fra-deu	tatoeba-test-v2021-08-07	0.68991	50.3	12418	100545
fra-eng	tatoeba-test-v2021-08-07	0.72564	58.0	12681	101754
fra-nld	tatoeba-test-v2021-08-07	0.67078	48.7	11548	82164
por-deu	tatoeba-test-v2021-08-07	0.68437	48.7	10000	81246
por-eng	tatoeba-test-v2021-08-07	0.77081	64.3	13222	105351
por-nds	tatoeba-test-v2021-08-07	0.45864	20.7	207	1292
por-nld	tatoeba-test-v2021-08-07	0.69865	52.8	2500	17816
spa-afr	tatoeba-test-v2021-08-07	0.77148	63.3	448	3044
spa-deu	tatoeba-test-v2021-08-07	0.68037	49.1	10521	86430
spa-eng	tatoeba-test-v2021-08-07	0.74575	60.2	16583	138123
spa-nds	tatoeba-test-v2021-08-07	0.43154	18.5	923	5941
spa-nld	tatoeba-test-v2021-08-07	0.68988	51.1	10113	79162
deu-afr	flores101-devtest	0.57287	26.0	1012	25740
deu-eng	flores101-devtest	0.66660	40.9	1012	24721
deu-nld	flores101-devtest	0.55423	23.6	1012	25467
eng-afr	flores101-devtest	0.67793	40.0	1012	25740
eng-deu	flores101-devtest	0.64295	37.2	1012	25094
eng-nld	flores101-devtest	0.57690	26.2	1012	25467
fra-ltz	flores101-devtest	0.49430	17.3	1012	25087
fra-nld	flores101-devtest	0.54318	22.2	1012	25467
por-deu	flores101-devtest	0.58851	29.8	1012	25094
por-nld	flores101-devtest	0.54571	22.6	1012	25467
spa-nld	flores101-devtest	0.50968	17.5	1012	25467
deu-afr	flores200-devtest	0.57725	26.2	1012	25740
deu-eng	flores200-devtest	0.67043	41.5	1012	24721
deu-ltz	flores200-devtest	0.54626	21.6	1012	25087
deu-nld	flores200-devtest	0.55679	24.0	1012	25467
eng-afr	flores200-devtest	0.68115	40.2	1012	25740
eng-deu	flores200-devtest	0.64561	37.4	1012	25094
eng-ltz	flores200-devtest	0.54932	22.0	1012	25087
eng-nld	flores200-devtest	0.58124	26.8	1012	25467
eng-tpi	flores200-devtest	0.40338	15.9	1012	35240
fra-afr	flores200-devtest	0.57320	26.4	1012	25740
fra-deu	flores200-devtest	0.58974	29.5	1012	25094
fra-eng	flores200-devtest	0.68106	43.7	1012	24721
fra-ltz	flores200-devtest	0.49618	17.8	1012	25087
fra-nld	flores200-devtest	0.54623	22.5	1012	25467
por-afr	flores200-devtest	0.58408	27.6	1012	25740
por-deu	flores200-devtest	0.59121	30.4	1012	25094
por-eng	flores200-devtest	0.71418	48.3	1012	24721
por-nld	flores200-devtest	0.54828	22.9	1012	25467
spa-afr	flores200-devtest	0.51514	17.8	1012	25740
spa-deu	flores200-devtest	0.53603	21.4	1012	25094
spa-eng	flores200-devtest	0.58604	28.2	1012	24721
spa-nld	flores200-devtest	0.51244	17.9	1012	25467
deu-eng	generaltest2022	0.55777	30.6	1984	37634
eng-deu	generaltest2022	0.60792	33.0	2037	38914
fra-deu	generaltest2022	0.67039	44.5	2006	37696
deu-eng	multi30k_test_2016_flickr	0.60981	40.1	1000	12955
eng-deu	multi30k_test_2016_flickr	0.64153	34.9	1000	12106
fra-deu	multi30k_test_2016_flickr	0.61781	32.1	1000	12106
fra-eng	multi30k_test_2016_flickr	0.66703	47.9	1000	12955
deu-eng	multi30k_test_2017_flickr	0.63624	41.0	1000	11374
eng-deu	multi30k_test_2017_flickr	0.63423	34.6	1000	10755
fra-deu	multi30k_test_2017_flickr	0.60084	29.7	1000	10755
fra-eng	multi30k_test_2017_flickr	0.69254	50.4	1000	11374
deu-eng	multi30k_test_2017_mscoco	0.55790	32.5	461	5231
eng-deu	multi30k_test_2017_mscoco	0.57491	28.6	461	5158
fra-deu	multi30k_test_2017_mscoco	0.56108	26.4	461	5158
fra-eng	multi30k_test_2017_mscoco	0.68212	49.1	461	5231
deu-eng	multi30k_test_2018_flickr	0.59322	36.6	1071	14689
eng-deu	multi30k_test_2018_flickr	0.59858	30.0	1071	13703
fra-deu	multi30k_test_2018_flickr	0.55667	24.7	1071	13703
fra-eng	multi30k_test_2018_flickr	0.64702	43.4	1071	14689
fra-eng	newsdiscusstest2015	0.61399	38.5	1500	26982
deu-eng	newssyscomb2009	0.55180	28.8	502	11818
eng-deu	newssyscomb2009	0.53676	22.9	502	11271
fra-deu	newssyscomb2009	0.53733	23.9	502	11271
fra-eng	newssyscomb2009	0.57219	31.1	502	11818
spa-deu	newssyscomb2009	0.53056	22.0	502	11271
spa-eng	newssyscomb2009	0.57225	30.8	502	11818
deu-eng	newstest2008	0.54506	26.9	2051	49380
eng-deu	newstest2008	0.53077	23.1	2051	47447
fra-deu	newstest2008	0.53204	22.9	2051	47447
fra-eng	newstest2008	0.54320	26.4	2051	49380
spa-deu	newstest2008	0.52066	21.6	2051	47447
spa-eng	newstest2008	0.55305	27.9	2051	49380
deu-eng	newstest2009	0.53773	26.2	2525	65399
eng-deu	newstest2009	0.53217	22.3	2525	62816
fra-deu	newstest2009	0.52995	22.9	2525	62816
fra-eng	newstest2009	0.56663	30.0	2525	65399
spa-deu	newstest2009	0.52586	22.1	2525	62816
spa-eng	newstest2009	0.56756	29.9	2525	65399
deu-eng	newstest2010	0.58365	30.4	2489	61711
eng-deu	newstest2010	0.54917	25.7	2489	61503
fra-deu	newstest2010	0.53904	24.3	2489	61503
fra-eng	newstest2010	0.59241	32.4	2489	61711
spa-deu	newstest2010	0.55378	26.2	2489	61503
spa-eng	newstest2010	0.61316	35.8	2489	61711
deu-eng	newstest2011	0.54907	26.1	3003	74681
eng-deu	newstest2011	0.52873	23.0	3003	72981
fra-deu	newstest2011	0.52977	23.0	3003	72981
fra-eng	newstest2011	0.59565	32.8	3003	74681
spa-deu	newstest2011	0.53095	23.4	3003	72981
spa-eng	newstest2011	0.59513	33.3	3003	74681
deu-eng	newstest2012	0.56230	28.1	3003	72812
eng-deu	newstest2012	0.52871	23.7	3003	72886
fra-deu	newstest2012	0.53035	24.1	3003	72886
fra-eng	newstest2012	0.59137	33.0	3003	72812
spa-deu	newstest2012	0.53438	24.3	3003	72886
spa-eng	newstest2012	0.62058	37.0	3003	72812
deu-eng	newstest2013	0.57940	31.5	3000	64505
eng-deu	newstest2013	0.55718	27.5	3000	63737
fra-deu	newstest2013	0.54408	25.6	3000	63737
fra-eng	newstest2013	0.59151	33.9	3000	64505
spa-deu	newstest2013	0.55215	26.2	3000	63737
spa-eng	newstest2013	0.60465	34.4	3000	64505
deu-eng	newstest2014	0.59723	33.1	3003	67337
eng-deu	newstest2014	0.59127	28.5	3003	62688
fra-eng	newstest2014	0.63411	38.0	3003	70708
deu-eng	newstest2015	0.59799	33.7	2169	46443
eng-deu	newstest2015	0.59977	32.0	2169	44260
deu-eng	newstest2016	0.65039	40.4	2999	64119
eng-deu	newstest2016	0.64144	37.9	2999	62669
deu-eng	newstest2017	0.60921	35.3	3004	64399
eng-deu	newstest2017	0.59114	30.4	3004	61287
deu-eng	newstest2018	0.66680	42.6	2998	67012
eng-deu	newstest2018	0.69428	45.8	2998	64276
deu-eng	newstest2019	0.63482	39.1	2000	39227
eng-deu	newstest2019	0.66430	42.0	1997	48746
fra-deu	newstest2019	0.60993	29.4	1701	36446
deu-eng	newstest2020	0.60403	34.0	785	38220
eng-deu	newstest2020	0.60255	32.3	1418	52383
fra-deu	newstest2020	0.61470	29.2	1619	30265
deu-eng	newstest2021	0.59738	31.9	1000	20180
eng-deu	newstest2021	0.56399	26.1	1002	27970
fra-deu	newstest2021	0.66155	40.0	1026	26077
deu-eng	newstestALL2020	0.60403	34.0	785	38220
eng-deu	newstestALL2020	0.60255	32.3	1418	52383
deu-eng	newstestB2020	0.60520	34.2	785	37696
eng-deu	newstestB2020	0.59226	31.6	1418	53092
deu-afr	ntrex128	0.57109	27.9	1997	50050
deu-eng	ntrex128	0.62043	34.5	1997	47673
deu-ltz	ntrex128	0.47642	15.4	1997	49763
deu-nld	ntrex128	0.56777	27.6	1997	51884
eng-afr	ntrex128	0.68616	44.1	1997	50050
eng-deu	ntrex128	0.58743	30.2	1997	48761
eng-ltz	ntrex128	0.50083	18.0	1997	49763
eng-nld	ntrex128	0.61041	33.8	1997	51884
fra-afr	ntrex128	0.55607	26.5	1997	50050
fra-deu	ntrex128	0.53269	23.6	1997	48761
fra-eng	ntrex128	0.61058	34.4	1997	47673
fra-ltz	ntrex128	0.41312	12.0	1997	49763
fra-nld	ntrex128	0.54615	25.2	1997	51884
por-afr	ntrex128	0.58296	29.2	1997	50050
por-deu	ntrex128	0.54944	24.7	1997	48761
por-eng	ntrex128	0.65002	39.6	1997	47673
por-nld	ntrex128	0.56384	28.1	1997	51884
spa-afr	ntrex128	0.57772	27.7	1997	50050
spa-deu	ntrex128	0.54561	24.0	1997	48761
spa-eng	ntrex128	0.64305	37.3	1997	47673
spa-nld	ntrex128	0.56397	27.8	1997	51884
fra-eng	tico19-test	0.62059	39.2	2100	56323
por-eng	tico19-test	0.73896	50.3	2100	56315
spa-eng	tico19-test	0.72923	49.6	2100	56315

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 10:01:07 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmw