Strange results

#3
by felixduner - opened

Hey! Not sure if this is a valid question or the right forum, but I'm getting quite strange results trying out the model.

When translating longer texts (multiple sentences), quite often it translates 1-2 sentences before starting to repeat itself (either a specific word or a sentence) until max_length is reached.
Screen Shot 2022-11-21 at 15.24.40.png

Characters like | or & seems to trigger this behavior sometimes, but sometimes this happens without any such characters present.
Email addresses seems to be a problem too, with "[email protected]" becoming "The following information is provided for each Member State:" when translating from French to English.

I'm just wondering if anyone else have experienced something similar? Haven't seen these problems with M2M100, for example.

Thanks!

Did you find any solution to this behavior?

I think no_repeat_ngram_size parameter should work, u can try if u still have problems.

Hi, I have similar problems, consider the following output:

import transformers
# Initialize
model_name = "facebook/nllb-200-3.3B"  # or "facebook/nllb-200-54.5B" if you have the GPU memory

translator_pipeline = transformers.pipeline("translation", 
                                            model=model_name, batch_size=8, src_lang="eng_Latn",
                                            tgt_lang="fra_Latn", truncation=True, device="cuda:0")

print(translator_pipeline("France"))  # Should return "France" or "La France"
print(translator_pipeline("Germany"))  # Should return "Allemagne"
print(translator_pipeline("Paris"))    # Should return "Paris"

Actually outputs:

['Autriche']
['Allemagne']
['Le président']

Which is wrong and very unexpected for such a big model and such simple translations.. Maybe there is something wrong with the tokenizer ?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment