facebook/nllb-200-3.3B · Strange results

Nov 21, 2022

Hey! Not sure if this is a valid question or the right forum, but I'm getting quite strange results trying out the model.

When translating longer texts (multiple sentences), quite often it translates 1-2 sentences before starting to repeat itself (either a specific word or a sentence) until max_length is reached.

Characters like | or & seems to trigger this behavior sometimes, but sometimes this happens without any such characters present.
Email addresses seems to be a problem too, with "[email protected]" becoming "The following information is provided for each Member State:" when translating from French to English.

I'm just wondering if anyone else have experienced something similar? Haven't seen these problems with M2M100, for example.

Thanks!

EkmekE

Oct 9, 2024

Did you find any solution to this behavior?

EkmekE

Oct 9, 2024

I think no_repeat_ngram_size parameter should work, u can try if u still have problems.

gabrieldeblois

1 day ago

•

edited 1 day ago

Hi, I have similar problems, consider the following output:

import transformers
# Initialize
model_name = "facebook/nllb-200-3.3B"  # or "facebook/nllb-200-54.5B" if you have the GPU memory

translator_pipeline = transformers.pipeline("translation", 
                                            model=model_name, batch_size=8, src_lang="eng_Latn",
                                            tgt_lang="fra_Latn", truncation=True, device="cuda:0")

print(translator_pipeline("France"))  # Should return "France" or "La France"
print(translator_pipeline("Germany"))  # Should return "Allemagne"
print(translator_pipeline("Paris"))    # Should return "Paris"

Actually outputs:

['Autriche']
['Allemagne']
['Le président']

Which is wrong and very unexpected for such a big model and such simple translations.. Maybe there is something wrong with the tokenizer ?