Nomic Embed v2
Collection
Multilingual Embedding Models
•
3 items
•
Updated
•
10
nomic-xlm-2048
is a finetuned XLM-Roberta Base model with learned positional embeddings swapped for RoPE and trained for 10k steps on CC100.
nomic-xlm-2048
performs competitively to other multilingual encoders on GLUE and XTREME-R
Model | Params | Pos. | Seq. | Avg. | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE |
---|---|---|---|---|---|---|---|---|---|---|---|---|
XLM-R-Base | 279M | Abs. | 512 | 82.35 | 46.95 | 92.54 | 87.37 | 89.32 | 90.69 | 84.34 | 90.35 | 77.26 |
nomic-xlm-2048 | 278M | RoPE | 2048 | 81.63 | 44.69 | 91.97 | 87.50 | 88.48 | 90.38 | 83.59 | 89.38 | 76.54 |
mGTE-Base | 306M | RoPE | 8192 | 80.77 | 27.22 | 91.97 | 89.71 | 89.55 | 91.20 | 85.16 | 90.91 | 80.41 |
Model | Avg. | XNLI | XCOPA | UDPOS | WikiANN | XQuAD | MLQA | TyDiQA-GoldP | Mewsli-X | LAReQA | Tatoeba |
---|---|---|---|---|---|---|---|---|---|---|---|
XLM-R-Base | 62.31 | 74.49 | 51.8 | 74.33 | 60.99 | 72.96 | 61.45 | 54.31 | 42.45 | 63.49 | 66.79 |
nomic-xlm-2048 | 62.70 | 73.57 | 61.71 | 74.92 | 60.96 | 71.13 | 59.61 | 43.46 | 45.27 | 67.49 | 70.82 |
mGTE-Base | 64.63 | 73.58 | 63.62 | 73.52 | 60.72 | 74.71 | 63.88 | 49.68 | 44.58 | 71.90 | 70.07 |
from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained('nomic-ai/nomic-xlm-2048') # `nomic-bert-2048` uses the standard BERT tokenizer
config = AutoConfig.from_pretrained('nomic-ai/nomic-xlm-2048', trust_remote_code=True) # the config needs to be passed in
model = AutoModelForMaskedLM.from_pretrained('nomic-ai/nomic-xlm-2048',config=config, trust_remote_code=True)
# To use this model directly for masked language modeling
classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer,device="cpu")
print(classifier("I [MASK] to the store yesterday."))
To finetune the model for a Sequence Classification task, you can use the following snippet
from transformers import AutoConfig, AutoModelForSequenceClassification
model_path = "nomic-ai/nomic-xlm-2048"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
# strict needs to be false here since we're initializing some new params
model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config, trust_remote_code=True, strict=False)