naist-nlp/mitre_466m · Adding new languages to model

Jan 22

Hello, friends!
Your model is very good solution for MNMT tasks and I want to fine-tune your model for arabic and belarusian languages. Could you please give instructions on how to add new languages to your tokenizer?

Piggy97

NAIST NLP org Jan 23

Thanks for paying attention to MITRE!
Unfortunately, we do not have and do not plan to implement a pipeline for adding new languages to our models.
However, in MNMT, there are many mature methods to do this.
Here, we recommend referring to [1] and [2] as following steps:
(1) Training a new SentencePiece model with the new vocabulary.
(2) Extending the embedding parameters of MITRE with the size of the new vocabulary.
(3) The new embedding parameters (the extended part only) could be initialized by averaging the original part with random noise. Notably, the original part should be unchanged.
(4) Attention, you also have to add new language tags.
After these steps, you can fine-tune MITRE with new languages. But, the limitation is that you have to set up a new tokenizer (based on your SentencePiece model) to handle your data.
References:
[1] Improving Zero-Shot Translation by Disentangling Positional Information. ACL 2021. (https://aclanthology.org/2021.acl-long.101.pdf) (Section 4.2)
[2] Adapting to Non-Centered Languages for Zero-shot Multilingual Translation. COLING 2022. (https://aclanthology.org/2022.coling-1.467.pdf) (Section 4.4)

KirillZytsv

Jan 23

Thank you!