Model Card for NLLB-with-myv-v2024 (a translation model for Erzya)

This is a version of the nllb-200-distilled-600M machine translation model with one added language: Erzya (the new language code is myv_Cyrl). It can probably translate from all 202 NLLB languages, but it fine-tuned with the focus on Erzya, Russian, and, to a lesser extent, on Arabic, English, Estonian, Finnish, French, German, Kazakh, Mandarin, Mongolian, Spanish, Turkish, Ukrainian, and Uzbek.

Model Details

Model Description

Developed by: Isai Gordeev, Sergey Kuldin and David Dale
Model type: Encoder-decoder transformer
Language(s) (NLP): Erzya, Russian, and all the 202 NLLB languages.
License: CC-BY-NC-4.0
Finetuned from model: nllb-200-distilled-600M

Model Sources [optional]

Repository: will be published later
Paper: will be published later
Demo: https://lango.to/ (it is powered by a similar model)

Uses

Direct Use

Translation between Erzya, Russian, and potentially other languages. The model seems to be SOTA for translating into Erzya.

Out-of-Scope Use

Translation between other NLLB languages, not inclusing Erzya as source or target.

Bias, Risks, and Limitations

The model is not producing the most fluent translations into Russian and other high-resourced languages.

Its translations into Erzya seem to be better than anything else, but may still include inaccurate or ungrammatical translations, so they should be always manually reviewed before any high-responsibility use.

Recommendations

Please contact the authors for any substantial recommendation.

How to Get Started with the Model

See the NLLB generation code: https://huggingface.co/docs/transformers/v4.44.2/en/model_doc/nllb#generating-with-nllb.

Training Details

Training Data

Training Procedure

Preprocessing [optional]

The preprocessing code is adapted from the Stopes repo of the NLLB team: https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214

It performs punctuation normalization, nonprintable character removal and Unicode normalization.

Training Hyperparameters

The tokenizer of the model was updated with 6209 new Erzya tokens. They were initialized with the average embeddings of the old tokens from which they are combined.

training regime: fp32
batch_size: 6
grad_acc_steps: 4
max_length: 128
optimizer: Adafactor
lr: 1e-4
clip_threshold=1.0
weight_decay: 1e-3
warmup_steps: 3_000 (with a linear warmup from 0)
training_steps: 220_000
weight_loss_coef: 100 (a coefficient for the additional penalty, MSE between the embeddings of old tokens and their values for NLLB-200)

Technical Specifications

Model Architecture and Objective

A standard encoder-decoder translation model with cross-entropy loss.

Compute Infrastructure

Google Colab with a T4 GPU.

pip install --upgrade sentencepiece transformers==4.40 datasets sacremoses editdistance sacrebleu razdel ctranslate2

Model Card Contact

@cointegrated

slone
/

nllb-with-myv-v2024