nomic-xlm-2048: XLM-Roberta Base with RoPE

nomic-xlm-2048 is a finetuned XLM-Roberta Base model with learned positional embeddings swapped for RoPE and trained for 10k steps on CC100.

nomic-xlm-2048 performs competitively to other multilingual encoders on GLUE and XTREME-R

Model	Params	Pos.	Seq.	Avg.	CoLA	SST-2	MRPC	STS-B	QQP	MNLI	QNLI	RTE
XLM-R-Base	279M	Abs.	512	82.35	46.95	92.54	87.37	89.32	90.69	84.34	90.35	77.26
nomic-xlm-2048	278M	RoPE	2048	81.63	44.69	91.97	87.50	88.48	90.38	83.59	89.38	76.54
mGTE-Base	306M	RoPE	8192	80.77	27.22	91.97	89.71	89.55	91.20	85.16	90.91	80.41

Model	Avg.	XNLI	XCOPA	UDPOS	WikiANN	XQuAD	MLQA	TyDiQA-GoldP	Mewsli-X	LAReQA	Tatoeba
XLM-R-Base	62.31	74.49	51.8	74.33	60.99	72.96	61.45	54.31	42.45	63.49	66.79
nomic-xlm-2048	62.70	73.57	61.71	74.92	60.96	71.13	59.61	43.46	45.27	67.49	70.82
mGTE-Base	64.63	73.58	63.62	73.52	60.72	74.71	63.88	49.68	44.58	71.90	70.07

Usage

from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained('nomic-ai/nomic-xlm-2048') # `nomic-bert-2048` uses the standard BERT tokenizer

config = AutoConfig.from_pretrained('nomic-ai/nomic-xlm-2048', trust_remote_code=True) # the config needs to be passed in
model = AutoModelForMaskedLM.from_pretrained('nomic-ai/nomic-xlm-2048',config=config, trust_remote_code=True)

# To use this model directly for masked language modeling
classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer,device="cpu")

print(classifier("I [MASK] to the store yesterday."))

To finetune the model for a Sequence Classification task, you can use the following snippet

from transformers import AutoConfig, AutoModelForSequenceClassification
model_path = "nomic-ai/nomic-xlm-2048"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
# strict needs to be false here since we're initializing some new params
model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config, trust_remote_code=True, strict=False)

nomic-ai
/

nomic-xlm-2048

nomic-xlm-2048: XLM-Roberta Base with RoPE

Usage

Join the Nomic Community

Model tree for nomic-ai/nomic-xlm-2048

Dataset used to train nomic-ai/nomic-xlm-2048

Collection including nomic-ai/nomic-xlm-2048

Nomic Embed v2