SnypzZz's picture
Update README.md (#5)
0b7d5cf
|
raw
history blame
4.19 kB
metadata
language:
  - multilingual
  - ar
  - cs
  - de
  - en
  - es
  - et
  - fi
  - fr
  - gu
  - hi
  - it
  - ja
  - kk
  - ko
  - lt
  - lv
  - my
  - ne
  - nl
  - ro
  - ru
  - si
  - tr
  - vi
  - zh
  - af
  - az
  - bn
  - fa
  - he
  - hr
  - id
  - ka
  - km
  - mk
  - ml
  - mn
  - mr
  - pl
  - ps
  - pt
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - uk
  - ur
  - xh
  - gl
  - sl
tags:
  - transformers
  - text-generation-inference
  - code
  - PyTorch
library_name: transformers

mBART-50 one to many multilingual machine translation GGML

This model is a fine-tuned checkpoint of TheBloke-Llama-2-13B. mbart-large-50-one-to-many-mmt is fine-tuned for multilingual machine translation. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper.

The model can translate English to other 49 languages mentioned below. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method.

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
article_en = "The head of the United Nations says there is no military solution in Syria"
model = MBartForConditionalGeneration.from_pretrained("SnypzZz/Llama2-13b-Language-translate")
tokenizer = MBart50TokenizerFast.from_pretrained("SnypzZz/Llama2-13b-Language-translate", src_lang="en_XX")

model_inputs = tokenizer(article_en, return_tensors="pt")

# translate from English to Hindi
generated_tokens = model.generate(
    **model_inputs,
    forced_bos_token_id=tokenizer.lang_code_to_id["hi_IN"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => 'संयुक्त राष्ट्र के नेता कहते हैं कि सीरिया में कोई सैन्य समाधान नहीं है'

# translate from English to Chinese
generated_tokens = model.generate(
    **model_inputs,
    forced_bos_token_id=tokenizer.lang_code_to_id["ko_KR"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => '联合国首脑说,叙利亚没有军事解决办法'

See the model hub to look for more fine-tuned versions.

Languages covered

Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)

BibTeX entry and citation info

@article{tang2020multilingual,
    title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
    author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
    year={2020},
    eprint={2008.00401},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Discord

For further support, and discussions on these models and AI in general, join us at:

SnypzZz's Discord server

PS i am a real gaming fanatic and this is also my gaming server so if anyone wants to play VALORANT or any other games, feel free to ping me--- @SNYPER#1942.

instagram

SnypzZz's Instagram

LinkedIn

SnypzZz's LinkedIn profile