fdschmidt93
/

NLLB-LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse

fdschmidt93 commited on Oct 31, 2024

Commit

32d323e

verified ·

1 Parent(s): 0a0e65e

docs: add info about tokenizer src_lang

Files changed (1) hide show

README.md CHANGED Viewed

@@ -37,6 +37,8 @@ tags:
 This model has only been trained on self-supervised data and not yet been fine-tuned on any downstream task! This version is expected to perform better than self-supervised adaptation in the original paper, as LoRAs are merged into the model prior to task fine-tuning. The backbone of this model is [LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse). We use the encoder of [NLLB-600M](https://huggingface.co/facebook/nllb-200-distilled-600M).
 ## Usage
 ```python
 import torch

 This model has only been trained on self-supervised data and not yet been fine-tuned on any downstream task! This version is expected to perform better than self-supervised adaptation in the original paper, as LoRAs are merged into the model prior to task fine-tuning. The backbone of this model is [LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse). We use the encoder of [NLLB-600M](https://huggingface.co/facebook/nllb-200-distilled-600M).
+> ⚠️ Make sure that you correctly set the `src_lang` (i.e., `AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=LANG_CODE)` for the language you are using NLLB-LLM2Vec with! You can find a list of supported languages [here](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json)
 ## Usage
 ```python
 import torch