antoinelouis commited on
Commit
8784293
1 Parent(s): 9376ca0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -21
README.md CHANGED
@@ -12,7 +12,7 @@ tags:
12
  library_name: sentence-transformers
13
  ---
14
 
15
- # biencoder-distilcamembert-base-mmarcoFR
16
 
17
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
18
 
@@ -33,13 +33,11 @@ Then you can use the model like this:
33
  from sentence_transformers import SentenceTransformer
34
  sentences = ["This is an example sentence", "Each sentence is converted"]
35
 
36
- model = SentenceTransformer('antoinelouis/biencoder-distilcamembert-base-mmarcoFR')
37
  embeddings = model.encode(sentences)
38
  print(embeddings)
39
  ```
40
 
41
-
42
-
43
  #### 🤗 Transformers
44
 
45
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
@@ -60,8 +58,8 @@ def mean_pooling(model_output, attention_mask):
60
  sentences = ['This is an example sentence', 'Each sentence is converted']
61
 
62
  # Load model from HuggingFace Hub
63
- tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-distilcamembert-base-mmarcoFR')
64
- model = AutoModel.from_pretrained('antoinelouis/biencoder-distilcamembert-base-mmarcoFR')
65
 
66
  # Tokenize sentences
67
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -79,19 +77,19 @@ print(sentence_embeddings)
79
 
80
  ## Evaluation
81
  ***
 
82
  We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared the model performance with other biencoder models fine-tuned on the same dataset. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
83
 
84
- | | model | Size | MRR@10 | NDCG@10 | MAP@10 | R@10 | R@100(↑) | R@500 |
85
- |---:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|---------:|----------:|---------:|-------:|-----------:|--------:|
86
- | 1 | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 443MB | 28.53 | 33.72 | 27.93 | 51.46 | 77.82 | 89.13 |
87
- | 2 | [biencoder-all-mpnet-base-v2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-all-mpnet-base-v2-mmarcoFR) | 438MB | 28.04 | 33.28 | 27.5 | 51.07 | 77.68 | 88.67 |
88
- | 3 | [biencoder-sentence-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-sentence-camembert-base-mmarcoFR) | 443MB | 27.63 | 32.7 | 27.01 | 50.10 | 76.85 | 88.73 |
89
- | 4 | **biencoder-distilcamembert-base-mmarcoFR** | 272MB | 26.80 | 31.87 | 26.23 | 49.20 | 76.44 | 87.87 |
90
- | 5 | [biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR) | 471MB | 24.74 | 29.41 | 24.23 | 45.40 | 71.52 | 84.42 |
91
- | 6 | [biencoder-camemberta-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camemberta-base-mmarcoFR) | 447MB | 24.78 | 29.24 | 24.23 | 44.58 | 69.59 | 82.18 |
92
- | 7 | [biencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR) | 440MB | 23.38 | 27.97 | 22.91 | 43.50 | 68.96 | 81.61 |
93
- | 8 | [biencoder-mMiniLM-L6-v2-mmarco-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLM-L6-v2-mmarco-mmarcoFR) | 428MB | 22.87 | 27.26 | 22.37 | 42.3 | 68.78 | 81.39 |
94
- | 9 | [biencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR) | 428MB | 22.29 | 26.57 | 21.8 | 41.25 | 66.78 | 79.83 |
95
 
96
  ## Training
97
  ***
@@ -113,17 +111,15 @@ We used the French version of the [mMARCO](https://huggingface.co/datasets/unica
113
  - a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
114
  Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
115
 
116
-
117
-
118
  ## Citation
119
 
120
  ```bibtex
121
  @online{louis2023,
122
  author = 'Antoine Louis',
123
- title = 'biencoder-distilcamembert-base-mmarcoFR: A Biencoder Model Trained on French mMARCO',
124
  publisher = 'Hugging Face',
125
  month = 'may',
126
  year = '2023',
127
- url = 'https://huggingface.co/antoinelouis/biencoder-distilcamembert-base-mmarcoFR',
128
  }
129
  ```
 
12
  library_name: sentence-transformers
13
  ---
14
 
15
+ # biencoder-distilcamembert-mmarcoFR
16
 
17
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
18
 
 
33
  from sentence_transformers import SentenceTransformer
34
  sentences = ["This is an example sentence", "Each sentence is converted"]
35
 
36
+ model = SentenceTransformer('antoinelouis/biencoder-distilcamembert-mmarcoFR')
37
  embeddings = model.encode(sentences)
38
  print(embeddings)
39
  ```
40
 
 
 
41
  #### 🤗 Transformers
42
 
43
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 
58
  sentences = ['This is an example sentence', 'Each sentence is converted']
59
 
60
  # Load model from HuggingFace Hub
61
+ tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-distilcamembert-mmarcoFR')
62
+ model = AutoModel.from_pretrained('antoinelouis/biencoder-distilcamembert-mmarcoFR')
63
 
64
  # Tokenize sentences
65
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
77
 
78
  ## Evaluation
79
  ***
80
+
81
  We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared the model performance with other biencoder models fine-tuned on the same dataset. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
82
 
83
+ | | model | Vocab. | #Param. | Size | MRR@10 | NDCG@10 | MAP@10 | R@10 | R@100(↑) | R@500 |
84
+ |---:|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|----------:|---------:|-------:|-----------:|--------:|
85
+ | 1 | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 🇫🇷 | 110M | 443MB | 28.53 | 33.72 | 27.93 | 51.46 | 77.82 | 89.13 |
86
+ | 2 | [biencoder-mpnet-base-all-v2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mpnet-base-all-v2-mmarcoFR) | 🇬🇧 | 109M | 438MB | 28.04 | 33.28 | 27.50 | 51.07 | 77.68 | 88.67 |
87
+ | 3 | **biencoder-distilcamembert-mmarcoFR** | 🇫🇷 | 68M | 272MB | 26.80 | 31.87 | 26.23 | 49.20 | 76.44 | 87.87 |
88
+ | 4 | [biencoder-MiniLM-L6-all-v2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-MiniLM-L6-all-v2-mmarcoFR) | 🇬🇧 | 23M | 91MB | 25.49 | 30.39 | 24.99 | 47.10 | 73.48 | 86.09 |
89
+ | 5 | [biencoder-mMiniLMv2-L12-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L12-mmarcoFR) | 🇫🇷,99+ | 117M | 471MB | 24.74 | 29.41 | 24.23 | 45.40 | 71.52 | 84.42 |
90
+ | 6 | [biencoder-camemberta-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camemberta-base-mmarcoFR) | 🇫🇷 | 112M | 447MB | 24.78 | 29.24 | 24.23 | 44.58 | 69.59 | 82.18 |
91
+ | 7 | [biencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-electra-base-french-mmarcoFR) | 🇫🇷 | 110M | 440MB | 23.38 | 27.97 | 22.91 | 43.50 | 68.96 | 81.61 |
92
+ | 8 | [biencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L6-mmarcoFR) | 🇫🇷,99+ | 107M | 428MB | 22.29 | 26.57 | 21.80 | 41.25 | 66.78 | 79.83 |
 
93
 
94
  ## Training
95
  ***
 
111
  - a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
112
  Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
113
 
 
 
114
  ## Citation
115
 
116
  ```bibtex
117
  @online{louis2023,
118
  author = 'Antoine Louis',
119
+ title = 'biencoder-distilcamembert-mmarcoFR: A Biencoder Model Trained on French mMARCO',
120
  publisher = 'Hugging Face',
121
  month = 'may',
122
  year = '2023',
123
+ url = 'https://huggingface.co/antoinelouis/biencoder-distilcamembert-mmarcoFR',
124
  }
125
  ```