RomainDarous commited on
Commit
7bf3685
·
verified ·
1 Parent(s): 123f4ca

Add new SentenceTransformer model

Browse files
1_MultiHeadGeneralizedPooling/config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "sentence_dim": 768,
3
+ "token_dim": 768,
4
+ "num_heads": 8,
5
+ "initialize": "random"
6
+ }
2_Dense/config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"in_features": 768, "out_features": 512, "bias": true, "activation_function": "torch.nn.modules.activation.Tanh"}
2_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:30d70ccca33c10dc41544e294d3a570d35d4cedec2bbfdb9c93d3d9c6e283661
3
+ size 1575072
README.md ADDED
@@ -0,0 +1,917 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bn
4
+ - cs
5
+ - de
6
+ - en
7
+ - et
8
+ - fi
9
+ - fr
10
+ - gu
11
+ - ha
12
+ - hi
13
+ - is
14
+ - ja
15
+ - kk
16
+ - km
17
+ - lt
18
+ - lv
19
+ - pl
20
+ - ps
21
+ - ru
22
+ - ta
23
+ - tr
24
+ - uk
25
+ - xh
26
+ - zh
27
+ - zu
28
+ - ne
29
+ - ro
30
+ - si
31
+ tags:
32
+ - sentence-transformers
33
+ - sentence-similarity
34
+ - feature-extraction
35
+ - generated_from_trainer
36
+ - dataset_size:1327190
37
+ - loss:CoSENTLoss
38
+ base_model: sentence-transformers/distiluse-base-multilingual-cased-v2
39
+ widget:
40
+ - source_sentence: यहाँका केही धार्मिक सम्पदाहरू यस प्रकार रहेका छन्।
41
+ sentences:
42
+ - A party works journalists from advertisements about a massive Himalayan post.
43
+ - Some religious affiliations here remain.
44
+ - In Spain, the strict opposition of Roman Catholic churches is found to have assumed
45
+ a marriage similar to male beach wives.
46
+ - source_sentence: West puzzled many with a performance of the song I Love it, dressed
47
+ as a Perrier Bottle.
48
+ sentences:
49
+ - It studies what kind of moral code of conduct people should adhere to), applied
50
+ ethics (the study of the application of ethical theory to real life situations,
51
+ including bioethics, political ethics, etc.), and descriptive ethics (the collection
52
+ of information about how people live, summarizing it from observed patterns.
53
+ - West rätselte viele mit einer Aufführung des Songs I Love it, gekleidet als Perrier
54
+ Bottle.
55
+ - Однако явка составила всего 16 процентов по сравнению с 34 процентами на последних
56
+ парламентских выборах в 2016 году, когда 66 процентов зарегистрированных избирателей
57
+ отдали свои голоса.
58
+ - source_sentence: He possesses a pistol with silver bullets for protection from vampires
59
+ and werewolves.
60
+ sentences:
61
+ - Er besitzt eine Pistole mit silbernen Kugeln zum Schutz vor Vampiren und Werwölfen.
62
+ - Bibimbap umfasst Reis, Spinat, Rettich, Bohnensprossen.
63
+ - BSAC profitierte auch von den großen, aber nicht unbeschränkten persönlichen Vermögen
64
+ von Rhodos und Beit vor ihrem Tod.
65
+ - source_sentence: To the west of the Badger Head Inlier is the Port Sorell Formation,
66
+ a tectonic mélange of marine sediments and dolerite.
67
+ sentences:
68
+ - Er brennt einen Speer und brennt Flammen aus seinem Mund, wenn er wütend ist.
69
+ - Westlich des Badger Head Inlier befindet sich die Port Sorell Formation, eine
70
+ tektonische Mischung aus Sedimenten und Dolerit.
71
+ - Public Lynching and Mob Violence under Modi Government
72
+ - source_sentence: Garnizoana otomană se retrage în sudul Dunării, iar după 164 de
73
+ ani cetatea intră din nou sub stăpânirea europenilor.
74
+ sentences:
75
+ - This is because, once again, we have taken into account the fact that we have
76
+ adopted a large number of legislative proposals.
77
+ - Helsinki University ranks 75th among universities for 2010.
78
+ - Ottoman garnisoana is withdrawing into the south of the Danube and, after 164
79
+ years, it is once again under the control of Europeans.
80
+ datasets:
81
+ - RicardoRei/wmt-da-human-evaluation
82
+ - wmt/wmt20_mlqe_task1
83
+ pipeline_tag: sentence-similarity
84
+ library_name: sentence-transformers
85
+ metrics:
86
+ - pearson_cosine
87
+ - spearman_cosine
88
+ model-index:
89
+ - name: SentenceTransformer based on sentence-transformers/distiluse-base-multilingual-cased-v2
90
+ results:
91
+ - task:
92
+ type: semantic-similarity
93
+ name: Semantic Similarity
94
+ dataset:
95
+ name: sts eval
96
+ type: sts-eval
97
+ metrics:
98
+ - type: pearson_cosine
99
+ value: 0.4243667202500231
100
+ name: Pearson Cosine
101
+ - type: spearman_cosine
102
+ value: 0.41746981927882326
103
+ name: Spearman Cosine
104
+ - type: pearson_cosine
105
+ value: 0.036498633959812184
106
+ name: Pearson Cosine
107
+ - type: spearman_cosine
108
+ value: 0.09384356852669157
109
+ name: Spearman Cosine
110
+ - type: pearson_cosine
111
+ value: 0.19141187125868983
112
+ name: Pearson Cosine
113
+ - type: spearman_cosine
114
+ value: 0.2047001484242623
115
+ name: Spearman Cosine
116
+ - type: pearson_cosine
117
+ value: 0.3733296015852904
118
+ name: Pearson Cosine
119
+ - type: spearman_cosine
120
+ value: 0.3778308496486885
121
+ name: Spearman Cosine
122
+ - type: pearson_cosine
123
+ value: 0.4091855502665824
124
+ name: Pearson Cosine
125
+ - type: spearman_cosine
126
+ value: 0.40251691896505326
127
+ name: Spearman Cosine
128
+ - type: pearson_cosine
129
+ value: 0.48640964316891533
130
+ name: Pearson Cosine
131
+ - type: spearman_cosine
132
+ value: 0.46565916114817835
133
+ name: Spearman Cosine
134
+ - type: pearson_cosine
135
+ value: 0.29881047072627687
136
+ name: Pearson Cosine
137
+ - type: spearman_cosine
138
+ value: 0.2767845088983564
139
+ name: Spearman Cosine
140
+ - task:
141
+ type: semantic-similarity
142
+ name: Semantic Similarity
143
+ dataset:
144
+ name: sts test
145
+ type: sts-test
146
+ metrics:
147
+ - type: pearson_cosine
148
+ value: 0.41810030878371285
149
+ name: Pearson Cosine
150
+ - type: spearman_cosine
151
+ value: 0.41259857114370785
152
+ name: Spearman Cosine
153
+ - type: pearson_cosine
154
+ value: 0.04780665638721907
155
+ name: Pearson Cosine
156
+ - type: spearman_cosine
157
+ value: 0.07961038715143137
158
+ name: Spearman Cosine
159
+ - type: pearson_cosine
160
+ value: 0.12785313730453238
161
+ name: Pearson Cosine
162
+ - type: spearman_cosine
163
+ value: 0.19638277823696285
164
+ name: Spearman Cosine
165
+ - type: pearson_cosine
166
+ value: 0.3754522642012458
167
+ name: Pearson Cosine
168
+ - type: spearman_cosine
169
+ value: 0.37252866177121946
170
+ name: Spearman Cosine
171
+ - type: pearson_cosine
172
+ value: 0.4320012607869886
173
+ name: Pearson Cosine
174
+ - type: spearman_cosine
175
+ value: 0.4394031152482244
176
+ name: Spearman Cosine
177
+ - type: pearson_cosine
178
+ value: 0.4399520313853801
179
+ name: Pearson Cosine
180
+ - type: spearman_cosine
181
+ value: 0.4113638664308507
182
+ name: Spearman Cosine
183
+ - type: pearson_cosine
184
+ value: 0.3045620930146385
185
+ name: Pearson Cosine
186
+ - type: spearman_cosine
187
+ value: 0.2675578288363888
188
+ name: Spearman Cosine
189
+ ---
190
+
191
+ # SentenceTransformer based on sentence-transformers/distiluse-base-multilingual-cased-v2
192
+
193
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) on the [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation), [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) and [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) datasets. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
194
+
195
+ ## Model Details
196
+
197
+ ### Model Description
198
+ - **Model Type:** Sentence Transformer
199
+ - **Base model:** [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) <!-- at revision dad0fa1ee4fa6e982d3adbce87c73c02e6aee838 -->
200
+ - **Maximum Sequence Length:** 128 tokens
201
+ - **Output Dimensionality:** 512 dimensions
202
+ - **Similarity Function:** Cosine Similarity
203
+ - **Training Datasets:**
204
+ - [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation)
205
+ - [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
206
+ - [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
207
+ - [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
208
+ - [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
209
+ - [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
210
+ - [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
211
+ - **Languages:** bn, cs, de, en, et, fi, fr, gu, ha, hi, is, ja, kk, km, lt, lv, pl, ps, ru, ta, tr, uk, xh, zh, zu, ne, ro, si
212
+ <!-- - **License:** Unknown -->
213
+
214
+ ### Model Sources
215
+
216
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
217
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
218
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
219
+
220
+ ### Full Model Architecture
221
+
222
+ ```
223
+ SentenceTransformer(
224
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel
225
+ (1): MultiHeadGeneralizedPooling(
226
+ (P): ModuleList(
227
+ (0-7): 8 x Linear(in_features=768, out_features=96, bias=True)
228
+ )
229
+ (W1): ModuleList(
230
+ (0-7): 8 x Linear(in_features=96, out_features=384, bias=True)
231
+ )
232
+ (W2): ModuleList(
233
+ (0-7): 8 x Linear(in_features=384, out_features=96, bias=True)
234
+ )
235
+ )
236
+ (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
237
+ )
238
+ ```
239
+
240
+ ## Usage
241
+
242
+ ### Direct Usage (Sentence Transformers)
243
+
244
+ First install the Sentence Transformers library:
245
+
246
+ ```bash
247
+ pip install -U sentence-transformers
248
+ ```
249
+
250
+ Then you can load this model and run inference.
251
+ ```python
252
+ from sentence_transformers import SentenceTransformer
253
+
254
+ # Download from the 🤗 Hub
255
+ model = SentenceTransformer("RomainDarous/pre_training_additive_generalized_model-sts")
256
+ # Run inference
257
+ sentences = [
258
+ 'Garnizoana otomană se retrage în sudul Dunării, iar după 164 de ani cetatea intră din nou sub stăpânirea europenilor.',
259
+ 'Ottoman garnisoana is withdrawing into the south of the Danube and, after 164 years, it is once again under the control of Europeans.',
260
+ 'This is because, once again, we have taken into account the fact that we have adopted a large number of legislative proposals.',
261
+ ]
262
+ embeddings = model.encode(sentences)
263
+ print(embeddings.shape)
264
+ # [3, 512]
265
+
266
+ # Get the similarity scores for the embeddings
267
+ similarities = model.similarity(embeddings, embeddings)
268
+ print(similarities.shape)
269
+ # [3, 3]
270
+ ```
271
+
272
+ <!--
273
+ ### Direct Usage (Transformers)
274
+
275
+ <details><summary>Click to see the direct usage in Transformers</summary>
276
+
277
+ </details>
278
+ -->
279
+
280
+ <!--
281
+ ### Downstream Usage (Sentence Transformers)
282
+
283
+ You can finetune this model on your own dataset.
284
+
285
+ <details><summary>Click to expand</summary>
286
+
287
+ </details>
288
+ -->
289
+
290
+ <!--
291
+ ### Out-of-Scope Use
292
+
293
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
294
+ -->
295
+
296
+ ## Evaluation
297
+
298
+ ### Metrics
299
+
300
+ #### Semantic Similarity
301
+
302
+ * Datasets: `sts-eval`, `sts-test`, `sts-test`, `sts-test`, `sts-test`, `sts-test`, `sts-test` and `sts-test`
303
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
304
+
305
+ | Metric | sts-eval | sts-test |
306
+ |:--------------------|:-----------|:-----------|
307
+ | pearson_cosine | 0.4244 | 0.3046 |
308
+ | **spearman_cosine** | **0.4175** | **0.2676** |
309
+
310
+ #### Semantic Similarity
311
+
312
+ * Dataset: `sts-eval`
313
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
314
+
315
+ | Metric | Value |
316
+ |:--------------------|:-----------|
317
+ | pearson_cosine | 0.0365 |
318
+ | **spearman_cosine** | **0.0938** |
319
+
320
+ #### Semantic Similarity
321
+
322
+ * Dataset: `sts-eval`
323
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
324
+
325
+ | Metric | Value |
326
+ |:--------------------|:-----------|
327
+ | pearson_cosine | 0.1914 |
328
+ | **spearman_cosine** | **0.2047** |
329
+
330
+ #### Semantic Similarity
331
+
332
+ * Dataset: `sts-eval`
333
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
334
+
335
+ | Metric | Value |
336
+ |:--------------------|:-----------|
337
+ | pearson_cosine | 0.3733 |
338
+ | **spearman_cosine** | **0.3778** |
339
+
340
+ #### Semantic Similarity
341
+
342
+ * Dataset: `sts-eval`
343
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
344
+
345
+ | Metric | Value |
346
+ |:--------------------|:-----------|
347
+ | pearson_cosine | 0.4092 |
348
+ | **spearman_cosine** | **0.4025** |
349
+
350
+ #### Semantic Similarity
351
+
352
+ * Dataset: `sts-eval`
353
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
354
+
355
+ | Metric | Value |
356
+ |:--------------------|:-----------|
357
+ | pearson_cosine | 0.4864 |
358
+ | **spearman_cosine** | **0.4657** |
359
+
360
+ #### Semantic Similarity
361
+
362
+ * Dataset: `sts-eval`
363
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
364
+
365
+ | Metric | Value |
366
+ |:--------------------|:-----------|
367
+ | pearson_cosine | 0.2988 |
368
+ | **spearman_cosine** | **0.2768** |
369
+
370
+ <!--
371
+ ## Bias, Risks and Limitations
372
+
373
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
374
+ -->
375
+
376
+ <!--
377
+ ### Recommendations
378
+
379
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
380
+ -->
381
+
382
+ ## Training Details
383
+
384
+ ### Training Datasets
385
+
386
+ #### wmt_da
387
+
388
+ * Dataset: [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation) at [301de38](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation/tree/301de385bf05b0c00a8f4be74965e186164dd425)
389
+ * Size: 1,285,190 training samples
390
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
391
+ * Approximate statistics based on the first 1000 samples:
392
+ | | sentence1 | sentence2 | score |
393
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------|
394
+ | type | string | string | float |
395
+ | details | <ul><li>min: 5 tokens</li><li>mean: 36.96 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 37.03 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.7</li><li>max: 1.0</li></ul> |
396
+ * Samples:
397
+ | sentence1 | sentence2 | score |
398
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------|
399
+ | <code>Lifeguard Capt. Larry Giles said at a media briefing that a shark had been spotted in the area a few weeks earlier, but it was determined not to be a dangerous species of shark.</code> | <code>Lifeguard કેપ્ટન. લેરી Giles ખાતે જણાવ્યું હતું કે, એક મીડિયા પરિષદ કે એક શાર્ક કરવામાં આવી હતી દેખાયો વિસ્તારમાં થોડા અઠવાડિયા અગાઉ, પરંતુ તે નક્કી કરવામાં આવ્યું નથી કરી એક ખતરનાક પ્રજાતિઓ શાર્ક છે.</code> | <code>0.175</code> |
400
+ | <code>Structural biologists can now take this information and reclassify the structure of the viruses, which will help unveil molecular and evolutionary relationships between different viruses.</code> | <code>Strukturbiologen können nun diese Informationen aufnehmen und die Struktur der Viren neu klassifizieren, was dazu beitragen wird, molekulare und evolutionäre Beziehungen zwischen verschiedenen Viren aufzudecken.</code> | <code>1.0</code> |
401
+ | <code>Ich bitte Sie“.</code> | <code>Žádám vás. "</code> | <code>0.92</code> |
402
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
403
+ ```json
404
+ {
405
+ "scale": 20.0,
406
+ "similarity_fct": "pairwise_cos_sim"
407
+ }
408
+ ```
409
+
410
+ #### mlqe_en_de
411
+
412
+ * Dataset: [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
413
+ * Size: 7,000 training samples
414
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
415
+ * Approximate statistics based on the first 1000 samples:
416
+ | | sentence1 | sentence2 | score |
417
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
418
+ | type | string | string | float |
419
+ | details | <ul><li>min: 11 tokens</li><li>mean: 23.78 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 26.51 tokens</li><li>max: 54 tokens</li></ul> | <ul><li>min: 0.06</li><li>mean: 0.86</li><li>max: 1.0</li></ul> |
420
+ * Samples:
421
+ | sentence1 | sentence2 | score |
422
+ |:-------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
423
+ | <code>Early Muslim traders and merchants visited Bengal while traversing the Silk Road in the first millennium.</code> | <code>Frühe muslimische Händler und Kaufleute besuchten Bengalen, während sie im ersten Jahrtausend die Seidenstraße durchquerten.</code> | <code>0.9233333468437195</code> |
424
+ | <code>While Fran dissipated shortly after that, the tropical wave progressed into the northeastern Pacific Ocean.</code> | <code>Während Fran kurz danach zerstreute, entwickelte sich die tropische Welle in den nordöstlichen Pazifischen Ozean.</code> | <code>0.8899999856948853</code> |
425
+ | <code>Distressed securities include such events as restructurings, recapitalizations, and bankruptcies.</code> | <code>Zu den belasteten Wertpapieren gehören Restrukturierungen, Rekapitalisierungen und Insolvenzen.</code> | <code>0.9300000071525574</code> |
426
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
427
+ ```json
428
+ {
429
+ "scale": 20.0,
430
+ "similarity_fct": "pairwise_cos_sim"
431
+ }
432
+ ```
433
+
434
+ #### mlqe_en_zh
435
+
436
+ * Dataset: [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
437
+ * Size: 7,000 training samples
438
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
439
+ * Approximate statistics based on the first 1000 samples:
440
+ | | sentence1 | sentence2 | score |
441
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------|
442
+ | type | string | string | float |
443
+ | details | <ul><li>min: 9 tokens</li><li>mean: 24.09 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 29.93 tokens</li><li>max: 74 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 0.98</li></ul> |
444
+ * Samples:
445
+ | sentence1 | sentence2 | score |
446
+ |:-------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------|:---------------------------------|
447
+ | <code>In the late 1980s, the hotel's reputation declined, and it functioned partly as a "backpackers hangout."</code> | <code>在 20 世纪 80 年代末 , 这家旅馆的声誉下降了 , 部分地起到了 "背包吊销" 的作用。</code> | <code>0.40666666626930237</code> |
448
+ | <code>From 1870 to 1915, 36 million Europeans migrated away from Europe.</code> | <code>从 1870 年到 1915 年 , 3, 600 万欧洲人从欧洲移民。</code> | <code>0.8333333730697632</code> |
449
+ | <code>In some photos, the footpads did press into the regolith, especially when they moved sideways at touchdown.</code> | <code>在一些照片中 , 脚垫确实挤进了后台 , 尤其是当他们在触地时侧面移动时。</code> | <code>0.33000001311302185</code> |
450
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
451
+ ```json
452
+ {
453
+ "scale": 20.0,
454
+ "similarity_fct": "pairwise_cos_sim"
455
+ }
456
+ ```
457
+
458
+ #### mlqe_et_en
459
+
460
+ * Dataset: [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
461
+ * Size: 7,000 training samples
462
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
463
+ * Approximate statistics based on the first 1000 samples:
464
+ | | sentence1 | sentence2 | score |
465
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
466
+ | type | string | string | float |
467
+ | details | <ul><li>min: 14 tokens</li><li>mean: 31.88 tokens</li><li>max: 63 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 24.57 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.67</li><li>max: 1.0</li></ul> |
468
+ * Samples:
469
+ | sentence1 | sentence2 | score |
470
+ |:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
471
+ | <code>Gruusias vahistati president Mihhail Saakašvili pressibüroo nõunik Simon Kiladze, keda süüdistati spioneerimises.</code> | <code>In Georgia, an adviser to the press office of President Mikhail Saakashvili, Simon Kiladze, was arrested and accused of spying.</code> | <code>0.9466666579246521</code> |
472
+ | <code>Nii teadmissotsioloogia pooldajad tavaliselt Kuhni tõlgendavadki, arendades tema vaated sõnaselgeks relativismiks.</code> | <code>This is how supporters of knowledge sociology usually interpret Kuhn by developing his views into an explicit relativism.</code> | <code>0.9366666674613953</code> |
473
+ | <code>18. jaanuaril 2003 haarasid mitmeid Canberra eeslinnu võsapõlengud, milles hukkus neli ja sai vigastada 435 inimest.</code> | <code>On 18 January 2003, several of the suburbs of Canberra were seized by debt fires which killed four people and injured 435 people.</code> | <code>0.8666666150093079</code> |
474
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
475
+ ```json
476
+ {
477
+ "scale": 20.0,
478
+ "similarity_fct": "pairwise_cos_sim"
479
+ }
480
+ ```
481
+
482
+ #### mlqe_ne_en
483
+
484
+ * Dataset: [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
485
+ * Size: 7,000 training samples
486
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
487
+ * Approximate statistics based on the first 1000 samples:
488
+ | | sentence1 | sentence2 | score |
489
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
490
+ | type | string | string | float |
491
+ | details | <ul><li>min: 17 tokens</li><li>mean: 40.67 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 24.66 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.39</li><li>max: 1.0</li></ul> |
492
+ * Samples:
493
+ | sentence1 | sentence2 | score |
494
+ |:------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------|:---------------------------------|
495
+ | <code>सामान्‍य बजट प्रायः फेब्रुअरीका अंतिम कार्य दिवसमा लाईन्छ।</code> | <code>A normal budget is usually awarded to the digital working day of February.</code> | <code>0.5600000023841858</code> |
496
+ | <code>कविताका यस्ता स्वरूपमा दुई, तिन वा चार पाउसम्मका मुक्तक, हाइकु, सायरी र लोकसूक्तिहरू पर्दछन् ।</code> | <code>The book consists of two, free of her or four paulets, haiku, Sairi, and locus in such forms.</code> | <code>0.23666666448116302</code> |
497
+ | <code>ब्रिट्नीले यस बारेमा प्रतिक्रिया ब्यक्ता गरदै भनिन,"कुन ठूलो कुरा हो र?</code> | <code>Britney did not respond to this, saying "which is a big thing and a big thing?</code> | <code>0.21666665375232697</code> |
498
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
499
+ ```json
500
+ {
501
+ "scale": 20.0,
502
+ "similarity_fct": "pairwise_cos_sim"
503
+ }
504
+ ```
505
+
506
+ #### mlqe_ro_en
507
+
508
+ * Dataset: [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
509
+ * Size: 7,000 training samples
510
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
511
+ * Approximate statistics based on the first 1000 samples:
512
+ | | sentence1 | sentence2 | score |
513
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
514
+ | type | string | string | float |
515
+ | details | <ul><li>min: 12 tokens</li><li>mean: 29.44 tokens</li><li>max: 60 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 22.38 tokens</li><li>max: 65 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 1.0</li></ul> |
516
+ * Samples:
517
+ | sentence1 | sentence2 | score |
518
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
519
+ | <code>Orașul va fi împărțit în patru districte, iar suburbiile în 10 mahalale.</code> | <code>The city will be divided into four districts and suburbs into 10 mahalals.</code> | <code>0.4699999988079071</code> |
520
+ | <code>La scurt timp după aceasta, au devenit cunoscute debarcările germane de la Trondheim, Bergen și Stavanger, precum și luptele din Oslofjord.</code> | <code>In the light of the above, the Authority concludes that the aid granted to ADIF is compatible with the internal market pursuant to Article 61 (3) (c) of the EEA Agreement.</code> | <code>0.02666666731238365</code> |
521
+ | <code>Până în vara 1791, în Clubul iacobinilor au dominat reprezentanții monarhismului liberal constituțional.</code> | <code>Until the summer of 1791, representatives of liberal constitutional monarchism dominated in the Jacobins Club.</code> | <code>0.8733333349227905</code> |
522
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
523
+ ```json
524
+ {
525
+ "scale": 20.0,
526
+ "similarity_fct": "pairwise_cos_sim"
527
+ }
528
+ ```
529
+
530
+ #### mlqe_si_en
531
+
532
+ * Dataset: [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
533
+ * Size: 7,000 training samples
534
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
535
+ * Approximate statistics based on the first 1000 samples:
536
+ | | sentence1 | sentence2 | score |
537
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
538
+ | type | string | string | float |
539
+ | details | <ul><li>min: 8 tokens</li><li>mean: 18.19 tokens</li><li>max: 38 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 22.31 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.51</li><li>max: 1.0</li></ul> |
540
+ * Samples:
541
+ | sentence1 | sentence2 | score |
542
+ |:----------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
543
+ | <code>ඇපලෝ 4 සැටර්න් V බූස්ටරයේ ප්‍රථම පර්යේෂණ පියාසැරිය විය.</code> | <code>The first research flight of the Apollo 4 Saturn V Booster.</code> | <code>0.7966666221618652</code> |
544
+ | <code>මෙහි අවපාතය සැලකීමේ දී, මෙහි 48%ක අවරෝහණය $ මිලියන 125කට අධික චිත්‍රපටයක් ලද තෙවන කුඩාම අවපාතය වේ.</code> | <code>In conjunction with the depression here, 48 % of obesity here is the third smallest depression in over $ 125 million film.</code> | <code>0.17666666209697723</code> |
545
+ | <code>එසේම "බකමූණන් මගින් මෙම රාක්ෂසියගේ රාත්‍රී හැසිරීම සංකේතවත් වන බව" පවසයි.</code> | <code>Also "the owl says that this monster's night behavior is symbolic".</code> | <code>0.8799999952316284</code> |
546
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
547
+ ```json
548
+ {
549
+ "scale": 20.0,
550
+ "similarity_fct": "pairwise_cos_sim"
551
+ }
552
+ ```
553
+
554
+ ### Evaluation Datasets
555
+
556
+ #### wmt_da
557
+
558
+ * Dataset: [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation) at [301de38](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation/tree/301de385bf05b0c00a8f4be74965e186164dd425)
559
+ * Size: 1,285,190 evaluation samples
560
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
561
+ * Approximate statistics based on the first 1000 samples:
562
+ | | sentence1 | sentence2 | score |
563
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------|
564
+ | type | string | string | float |
565
+ | details | <ul><li>min: 5 tokens</li><li>mean: 35.67 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 36.53 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.7</li><li>max: 1.0</li></ul> |
566
+ * Samples:
567
+ | sentence1 | sentence2 | score |
568
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------|:------------------|
569
+ | <code>ARKASINDA TABİİ Kİ FETÖ VAR</code> | <code>Behind the TABILITY OF FEAR</code> | <code>0.02</code> |
570
+ | <code>អត្ថប្រយោជន៍ដ៏ធំ មួយ នៅក្នុង ការវាយបក គឺជា កំណើនកម្លាំងចលករ ទៅមុខ នៃ អ្ន��វាយប្រហារ ដែល រុញផ្ទប់ ពួកគេ ខ្លាំង បន្ថែមទៀត ចូលក្នុង ការវាយបក ឬ សង របស់អ្នក។</code> | <code>A big advantage in chaos is the growing strength of the attackers, which pushes them further into your grasp or reputation.</code> | <code>0.22</code> |
571
+ | <code>વર્ષ 2012માં તેમને આંગણવાડી સેવિકા તરીકે બઢતી આપવામાં આવી હતી.</code> | <code>In 2010, she was promoted to kindergarten Swinka.</code> | <code>0.19</code> |
572
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
573
+ ```json
574
+ {
575
+ "scale": 20.0,
576
+ "similarity_fct": "pairwise_cos_sim"
577
+ }
578
+ ```
579
+
580
+ #### mlqe_en_de
581
+
582
+ * Dataset: [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
583
+ * Size: 1,000 evaluation samples
584
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
585
+ * Approximate statistics based on the first 1000 samples:
586
+ | | sentence1 | sentence2 | score |
587
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
588
+ | type | string | string | float |
589
+ | details | <ul><li>min: 11 tokens</li><li>mean: 24.11 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 26.66 tokens</li><li>max: 55 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.81</li><li>max: 1.0</li></ul> |
590
+ * Samples:
591
+ | sentence1 | sentence2 | score |
592
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
593
+ | <code>Resuming her patrols, Constitution managed to recapture the American sloop Neutrality on 27 March and, a few days later, the French ship Carteret.</code> | <code>Mit der Wiederaufnahme ihrer Patrouillen gelang es der Verfassung, am 27. März die amerikanische Schleuderneutralität und wenige Tage später das französische Schiff Carteret zurückzuerobern.</code> | <code>0.9033333659172058</code> |
594
+ | <code>Blaine's nomination alienated many Republicans who viewed Blaine as ambitious and immoral.</code> | <code>Blaines Nominierung entfremdete viele Republikaner, die Blaine als ehrgeizig und unmoralisch betrachteten.</code> | <code>0.9216666221618652</code> |
595
+ | <code>This initiated a brief correspondence between the two which quickly descended into political rancor.</code> | <code>Dies leitete eine kurze Korrespondenz zwischen den beiden ein, die schnell zu politischem Groll abstieg.</code> | <code>0.878333330154419</code> |
596
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
597
+ ```json
598
+ {
599
+ "scale": 20.0,
600
+ "similarity_fct": "pairwise_cos_sim"
601
+ }
602
+ ```
603
+
604
+ #### mlqe_en_zh
605
+
606
+ * Dataset: [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
607
+ * Size: 1,000 evaluation samples
608
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
609
+ * Approximate statistics based on the first 1000 samples:
610
+ | | sentence1 | sentence2 | score |
611
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
612
+ | type | string | string | float |
613
+ | details | <ul><li>min: 9 tokens</li><li>mean: 23.75 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 29.56 tokens</li><li>max: 67 tokens</li></ul> | <ul><li>min: 0.26</li><li>mean: 0.65</li><li>max: 0.9</li></ul> |
614
+ * Samples:
615
+ | sentence1 | sentence2 | score |
616
+ |:---------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------|:--------------------------------|
617
+ | <code>Freeman briefly stayed with the king before returning to Accra via Whydah, Ahgwey and Little Popo.</code> | <code>弗里曼在经过惠达、阿格威和小波波回到阿克拉之前与国王一起住了一会儿。</code> | <code>0.6683333516120911</code> |
618
+ | <code>Fantastic Fiction "Scratches in the Sky, Ben Peek, Agog!</code> | <code>奇特的虚构 "天空中的碎片 , 本佩克 , 阿戈 !</code> | <code>0.71833336353302</code> |
619
+ | <code>For Hermann Keller, the running quavers and semiquavers "suffuse the setting with health and strength."</code> | <code>对赫尔曼 · 凯勒来说 , 跑步的跳跃者和半跳跃者 "让环境充满健康和力量" 。</code> | <code>0.7066666483879089</code> |
620
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
621
+ ```json
622
+ {
623
+ "scale": 20.0,
624
+ "similarity_fct": "pairwise_cos_sim"
625
+ }
626
+ ```
627
+
628
+ #### mlqe_et_en
629
+
630
+ * Dataset: [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
631
+ * Size: 1,000 evaluation samples
632
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
633
+ * Approximate statistics based on the first 1000 samples:
634
+ | | sentence1 | sentence2 | score |
635
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
636
+ | type | string | string | float |
637
+ | details | <ul><li>min: 12 tokens</li><li>mean: 32.4 tokens</li><li>max: 58 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 24.87 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.6</li><li>max: 0.99</li></ul> |
638
+ * Samples:
639
+ | sentence1 | sentence2 | score |
640
+ |:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------|:---------------------------------|
641
+ | <code>Jackson pidas seal kõne, öeldes, et James Brown on tema suurim inspiratsioon.</code> | <code>Jackson gave a speech there saying that James Brown is his greatest inspiration.</code> | <code>0.9833333492279053</code> |
642
+ | <code>Kaanelugu rääkis loo kolme ungarlase üleelamistest Ungari revolutsiooni päevil.</code> | <code>The life of the Man spoke of a story of three Hungarians living in the days of the Hungarian Revolution.</code> | <code>0.28999999165534973</code> |
643
+ | <code>Teise maailmasõja ajal oli ta mitme Saksa juhatusele alluvate eesti väeosa ülem.</code> | <code>During World War II, he was the commander of several of the German leadership.</code> | <code>0.4516666829586029</code> |
644
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
645
+ ```json
646
+ {
647
+ "scale": 20.0,
648
+ "similarity_fct": "pairwise_cos_sim"
649
+ }
650
+ ```
651
+
652
+ #### mlqe_ne_en
653
+
654
+ * Dataset: [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
655
+ * Size: 1,000 evaluation samples
656
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
657
+ * Approximate statistics based on the first 1000 samples:
658
+ | | sentence1 | sentence2 | score |
659
+ |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------|
660
+ | type | string | string | float |
661
+ | details | <ul><li>min: 17 tokens</li><li>mean: 41.03 tokens</li><li>max: 85 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 24.77 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.05</li><li>mean: 0.36</li><li>max: 0.92</li></ul> |
662
+ * Samples:
663
+ | sentence1 | sentence2 | score |
664
+ |:------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------|:---------------------------------|
665
+ | <code>१८९२ तिर भवानीदत्त पाण्डेले 'मुद्रा राक्षस'को अनुवाद गरे।</code> | <code>Around 1892, Bhavani Pandit translated the 'money monster'.</code> | <code>0.8416666388511658</code> |
666
+ | <code>यस बच्चाको मुखले आमाको स्तन यस बच्चाको मुखले आमाको स्तन राम्ररी च्यापेको छ ।</code> | <code>The breasts of this child's mouth are taped well with the mother's mouth.</code> | <code>0.2150000035762787</code> |
667
+ | <code>बुवाको बन्दुक चोरेर हिँडेका बराललाई केआई सिंहले अब गोली ल्याउन लगाए ।...</code> | <code>Kei Singh, who stole the boy's closet, took the bullet to bring it now..</code> | <code>0.27000001072883606</code> |
668
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
669
+ ```json
670
+ {
671
+ "scale": 20.0,
672
+ "similarity_fct": "pairwise_cos_sim"
673
+ }
674
+ ```
675
+
676
+ #### mlqe_ro_en
677
+
678
+ * Dataset: [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
679
+ * Size: 1,000 evaluation samples
680
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
681
+ * Approximate statistics based on the first 1000 samples:
682
+ | | sentence1 | sentence2 | score |
683
+ |:--------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------|
684
+ | type | string | string | float |
685
+ | details | <ul><li>min: 14 tokens</li><li>mean: 30.25 tokens</li><li>max: 59 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 22.7 tokens</li><li>max: 55 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 1.0</li></ul> |
686
+ * Samples:
687
+ | sentence1 | sentence2 | score |
688
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------|
689
+ | <code>Cornwallis se afla înconjurat pe uscat de forțe armate net superioare și retragerea pe mare era îndoielnică din cauza flotei franceze.</code> | <code>Cornwallis was surrounded by shore by higher armed forces and the sea withdrawal was doubtful due to the French fleet.</code> | <code>0.8199999928474426</code> |
690
+ | <code>thumbrightuprightDansatori [[cretani de muzică tradițională.</code> | <code>Number of employees employed in the production of the like product in the Union.</code> | <code>0.009999999776482582</code> |
691
+ | <code>Potrivit documentelor vremii și tradiției orale, aceasta a fost cea mai grea perioadă din istoria orașului.</code> | <code>According to the documents of the oral weather and tradition, this was the hardest period in the city's history.</code> | <code>0.5383332967758179</code> |
692
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
693
+ ```json
694
+ {
695
+ "scale": 20.0,
696
+ "similarity_fct": "pairwise_cos_sim"
697
+ }
698
+ ```
699
+
700
+ #### mlqe_si_en
701
+
702
+ * Dataset: [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
703
+ * Size: 1,000 evaluation samples
704
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
705
+ * Approximate statistics based on the first 1000 samples:
706
+ | | sentence1 | sentence2 | score |
707
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------|
708
+ | type | string | string | float |
709
+ | details | <ul><li>min: 8 tokens</li><li>mean: 18.12 tokens</li><li>max: 36 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 22.18 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.51</li><li>max: 0.99</li></ul> |
710
+ * Samples:
711
+ | sentence1 | sentence2 | score |
712
+ |:----------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:--------------------------------|
713
+ | <code>එයට ශි්‍ර ලංකාවේ සාමය ඇති කිරිමටත් නැති කිරිමටත් පුළුවන්.</code> | <code>It can also cause peace in Sri Lanka.</code> | <code>0.3199999928474426</code> |
714
+ | <code>ඔහු මනෝ විද්‍යාව, සමාජ විද්‍යාව, ඉතිහාසය හා සන්නිවේදනය යන විෂය ක්ෂේත්‍රයන් පිලිබදවද අධ්‍යයනයන් සිදු කිරීමට උත්සාහ කරන ලදි.</code> | <code>He attempted to do subjects in psychology, sociology, history and communication.</code> | <code>0.5366666913032532</code> |
715
+ | <code>එහෙත් කිසිදු මිනිසෙක්‌ හෝ ගැහැනියෙක්‌ එලිමහනක නොවූහ.</code> | <code>But no man or woman was eliminated.</code> | <code>0.2783333361148834</code> |
716
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
717
+ ```json
718
+ {
719
+ "scale": 20.0,
720
+ "similarity_fct": "pairwise_cos_sim"
721
+ }
722
+ ```
723
+
724
+ ### Training Hyperparameters
725
+ #### Non-Default Hyperparameters
726
+
727
+ - `eval_strategy`: steps
728
+ - `per_device_train_batch_size`: 64
729
+ - `per_device_eval_batch_size`: 64
730
+ - `num_train_epochs`: 2
731
+ - `warmup_ratio`: 0.1
732
+
733
+ #### All Hyperparameters
734
+ <details><summary>Click to expand</summary>
735
+
736
+ - `overwrite_output_dir`: False
737
+ - `do_predict`: False
738
+ - `eval_strategy`: steps
739
+ - `prediction_loss_only`: True
740
+ - `per_device_train_batch_size`: 64
741
+ - `per_device_eval_batch_size`: 64
742
+ - `per_gpu_train_batch_size`: None
743
+ - `per_gpu_eval_batch_size`: None
744
+ - `gradient_accumulation_steps`: 1
745
+ - `eval_accumulation_steps`: None
746
+ - `torch_empty_cache_steps`: None
747
+ - `learning_rate`: 5e-05
748
+ - `weight_decay`: 0.0
749
+ - `adam_beta1`: 0.9
750
+ - `adam_beta2`: 0.999
751
+ - `adam_epsilon`: 1e-08
752
+ - `max_grad_norm`: 1.0
753
+ - `num_train_epochs`: 2
754
+ - `max_steps`: -1
755
+ - `lr_scheduler_type`: linear
756
+ - `lr_scheduler_kwargs`: {}
757
+ - `warmup_ratio`: 0.1
758
+ - `warmup_steps`: 0
759
+ - `log_level`: passive
760
+ - `log_level_replica`: warning
761
+ - `log_on_each_node`: True
762
+ - `logging_nan_inf_filter`: True
763
+ - `save_safetensors`: True
764
+ - `save_on_each_node`: False
765
+ - `save_only_model`: False
766
+ - `restore_callback_states_from_checkpoint`: False
767
+ - `no_cuda`: False
768
+ - `use_cpu`: False
769
+ - `use_mps_device`: False
770
+ - `seed`: 42
771
+ - `data_seed`: None
772
+ - `jit_mode_eval`: False
773
+ - `use_ipex`: False
774
+ - `bf16`: False
775
+ - `fp16`: False
776
+ - `fp16_opt_level`: O1
777
+ - `half_precision_backend`: auto
778
+ - `bf16_full_eval`: False
779
+ - `fp16_full_eval`: False
780
+ - `tf32`: None
781
+ - `local_rank`: 0
782
+ - `ddp_backend`: None
783
+ - `tpu_num_cores`: None
784
+ - `tpu_metrics_debug`: False
785
+ - `debug`: []
786
+ - `dataloader_drop_last`: False
787
+ - `dataloader_num_workers`: 0
788
+ - `dataloader_prefetch_factor`: None
789
+ - `past_index`: -1
790
+ - `disable_tqdm`: False
791
+ - `remove_unused_columns`: True
792
+ - `label_names`: None
793
+ - `load_best_model_at_end`: False
794
+ - `ignore_data_skip`: False
795
+ - `fsdp`: []
796
+ - `fsdp_min_num_params`: 0
797
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
798
+ - `fsdp_transformer_layer_cls_to_wrap`: None
799
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
800
+ - `deepspeed`: None
801
+ - `label_smoothing_factor`: 0.0
802
+ - `optim`: adamw_torch
803
+ - `optim_args`: None
804
+ - `adafactor`: False
805
+ - `group_by_length`: False
806
+ - `length_column_name`: length
807
+ - `ddp_find_unused_parameters`: None
808
+ - `ddp_bucket_cap_mb`: None
809
+ - `ddp_broadcast_buffers`: False
810
+ - `dataloader_pin_memory`: True
811
+ - `dataloader_persistent_workers`: False
812
+ - `skip_memory_metrics`: True
813
+ - `use_legacy_prediction_loop`: False
814
+ - `push_to_hub`: False
815
+ - `resume_from_checkpoint`: None
816
+ - `hub_model_id`: None
817
+ - `hub_strategy`: every_save
818
+ - `hub_private_repo`: None
819
+ - `hub_always_push`: False
820
+ - `gradient_checkpointing`: False
821
+ - `gradient_checkpointing_kwargs`: None
822
+ - `include_inputs_for_metrics`: False
823
+ - `include_for_metrics`: []
824
+ - `eval_do_concat_batches`: True
825
+ - `fp16_backend`: auto
826
+ - `push_to_hub_model_id`: None
827
+ - `push_to_hub_organization`: None
828
+ - `mp_parameters`:
829
+ - `auto_find_batch_size`: False
830
+ - `full_determinism`: False
831
+ - `torchdynamo`: None
832
+ - `ray_scope`: last
833
+ - `ddp_timeout`: 1800
834
+ - `torch_compile`: False
835
+ - `torch_compile_backend`: None
836
+ - `torch_compile_mode`: None
837
+ - `dispatch_batches`: None
838
+ - `split_batches`: None
839
+ - `include_tokens_per_second`: False
840
+ - `include_num_input_tokens_seen`: False
841
+ - `neftune_noise_alpha`: None
842
+ - `optim_target_modules`: None
843
+ - `batch_eval_metrics`: False
844
+ - `eval_on_start`: False
845
+ - `use_liger_kernel`: False
846
+ - `eval_use_gather_object`: False
847
+ - `average_tokens_across_devices`: False
848
+ - `prompts`: None
849
+ - `batch_sampler`: batch_sampler
850
+ - `multi_dataset_batch_sampler`: proportional
851
+
852
+ </details>
853
+
854
+ ### Training Logs
855
+ | Epoch | Step | Training Loss | wmt da loss | mlqe en de loss | mlqe en zh loss | mlqe et en loss | mlqe ne en loss | mlqe ro en loss | mlqe si en loss | sts-eval_spearman_cosine | sts-test_spearman_cosine |
856
+ |:-----:|:-----:|:-------------:|:-----------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:------------------------:|:------------------------:|
857
+ | 0.4 | 6690 | 7.7924 | 7.5526 | 7.5730 | 7.5671 | 7.5310 | 7.5275 | 7.5066 | 7.5532 | 0.2149 | - |
858
+ | 0.8 | 13380 | 7.5514 | 7.5407 | 7.5866 | 7.5611 | 7.5121 | 7.5192 | 7.4806 | 7.5379 | 0.2855 | - |
859
+ | 1.2 | 20070 | 7.5208 | 7.5386 | 7.6114 | 7.5660 | 7.5198 | 7.5141 | 7.4859 | 7.5461 | 0.2722 | - |
860
+ | 1.6 | 26760 | 7.5011 | 7.5307 | 7.6242 | 7.5659 | 7.5220 | 7.5073 | 7.4819 | 7.5440 | 0.2830 | - |
861
+ | 2.0 | 33450 | 7.4927 | 7.5275 | 7.6315 | 7.5681 | 7.5200 | 7.5144 | 7.4908 | 7.5481 | 0.2768 | 0.2676 |
862
+
863
+
864
+ ### Framework Versions
865
+ - Python: 3.11.10
866
+ - Sentence Transformers: 3.3.1
867
+ - Transformers: 4.47.1
868
+ - PyTorch: 2.3.1+cu121
869
+ - Accelerate: 1.2.1
870
+ - Datasets: 3.2.0
871
+ - Tokenizers: 0.21.0
872
+
873
+ ## Citation
874
+
875
+ ### BibTeX
876
+
877
+ #### Sentence Transformers
878
+ ```bibtex
879
+ @inproceedings{reimers-2019-sentence-bert,
880
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
881
+ author = "Reimers, Nils and Gurevych, Iryna",
882
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
883
+ month = "11",
884
+ year = "2019",
885
+ publisher = "Association for Computational Linguistics",
886
+ url = "https://arxiv.org/abs/1908.10084",
887
+ }
888
+ ```
889
+
890
+ #### CoSENTLoss
891
+ ```bibtex
892
+ @online{kexuefm-8847,
893
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
894
+ author={Su Jianlin},
895
+ year={2022},
896
+ month={Jan},
897
+ url={https://kexue.fm/archives/8847},
898
+ }
899
+ ```
900
+
901
+ <!--
902
+ ## Glossary
903
+
904
+ *Clearly define terms in order to be accessible across audiences.*
905
+ -->
906
+
907
+ <!--
908
+ ## Model Card Authors
909
+
910
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
911
+ -->
912
+
913
+ <!--
914
+ ## Model Card Contact
915
+
916
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
917
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v2",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertModel"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "output_hidden_states": true,
17
+ "output_past": true,
18
+ "pad_token_id": 0,
19
+ "qa_dropout": 0.1,
20
+ "seq_classif_dropout": 0.2,
21
+ "sinusoidal_pos_embds": false,
22
+ "tie_weights_": true,
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.47.1",
25
+ "vocab_size": 119547
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.3.1",
4
+ "transformers": "4.47.1",
5
+ "pytorch": "2.3.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7836bfc3e4d585ad45947c8e5483f91de68361a854d8636f9439e2962aca3301
3
+ size 538947416
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_MultiHeadGeneralizedPooling",
12
+ "type": "sentence_pooling.multihead_generalized_pooling.MultiHeadGeneralizedPooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Dense",
18
+ "type": "sentence_transformers.models.Dense"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "full_tokenizer_file": null,
50
+ "mask_token": "[MASK]",
51
+ "max_len": 512,
52
+ "model_max_length": 128,
53
+ "never_split": null,
54
+ "pad_token": "[PAD]",
55
+ "sep_token": "[SEP]",
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "DistilBertTokenizer",
59
+ "unk_token": "[UNK]"
60
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff