BAAI
/

bge-m3

@@ -9,7 +9,8 @@ license: mit
 For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
-# BGE-M3
 In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
 - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
 - Multi-Linguality: It can support more than 100 working languages.
@@ -26,12 +27,14 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
 ## News:
 - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
 ## Specs
 - Model
 | Model Name |  Dimension | Sequence Length | Introduction |
 |:----:|:---:|:---:|:---:|
 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
@@ -48,7 +51,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
 | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
 ## FAQ
 **1. Introduction for different retrieval methods**
@@ -57,7 +59,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
 - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
 - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
 **2. Comparison with BGE-v1.5 and other monolingual models**
 BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
@@ -77,6 +78,11 @@ For sparse retrieval methods, most open-source libraries currently do not suppor
 Contributions from the community are welcome.
 **4. How to fine-tune bge-M3 model?**
 You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
@@ -218,10 +224,10 @@ print(model.compute_score(sentence_pairs,
 - Long Document Retrieval
   - MLDR:
   ![avatar](./imgs/long.jpg)
-  Please note that MLDR is a document retrieval dataset we constructed via LLM,
   covering 13 languages, including test set, validation set, and training set.
   We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
-  Therefore, comparing baseline with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
   Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
   We believe that this data will be helpful for the open-source community in training document retrieval models.

 For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
+# BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
 In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
 - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
 - Multi-Linguality: It can support more than 100 working languages.
 ## News:
+- 2/6/2024: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR), a long document retrieval dataset covering 13 languages.
 - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
 ## Specs
 - Model
 | Model Name |  Dimension | Sequence Length | Introduction |
 |:----:|:---:|:---:|:---:|
 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
 | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
 ## FAQ
 **1. Introduction for different retrieval methods**
 - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
 - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
 **2. Comparison with BGE-v1.5 and other monolingual models**
 BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
 Contributions from the community are welcome.
+In our experiments, we use [Pyserini](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse) and Faiss to do hybrid retrieval.
+**Now you can ou can try the hybrid mode of BGE-M3 in [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
+). Thanks @jobergum.**
 **4. How to fine-tune bge-M3 model?**
 You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
 - Long Document Retrieval
   - MLDR:
   ![avatar](./imgs/long.jpg)
+  Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM,
   covering 13 languages, including test set, validation set, and training set.
   We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
+  Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
   Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
   We believe that this data will be helpful for the open-source community in training document retrieval models.