Model for English and Russian

This is a truncated version of BAAI/bge-m3.

This model has only English and Russian tokens left in the vocabulary. Thus making it 1.5 smaller than the original model while producing the same embeddings.

The model has been truncated in this notebook.

FAQ

Generate Embedding for text

tokenizer = XLMRobertaTokenizer.from_pretrained('qilowoq/bge-m3-en-ru')
model = XLMRobertaModel.from_pretrained('qilowoq/bge-m3-en-ru')

sentences = ["This is an example sentence", "Это пример предложения"]

with torch.no_grad():
  embeddings = new_model(**tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)).pooler_output

Acknowledgement

Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc. Thanks to the open-sourced libraries like Tevatron, Pyserini.

Citation

If you find this repository useful, please consider giving a star :star: and citation

@misc{bge-m3,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
36
Safetensors
Model size
375M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for qilowoq/bge-m3-en-ru

Base model

BAAI/bge-m3
Finetuned
(182)
this model