RomainDarous commited on
Commit
28d1e0d
1 Parent(s): 597a361

Add new SentenceTransformer model

Browse files
1_MultiHeadGeneralizedPooling/config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "sentence_dim": 768,
3
+ "token_dim": 768,
4
+ "num_heads": 8,
5
+ "initialize": "random"
6
+ }
2_Dense/config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"in_features": 768, "out_features": 512, "bias": true, "activation_function": "torch.nn.modules.activation.Tanh"}
2_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd2d205d1ef60a79b731726fa0d39c3e083780648f145f43d8d52ff9ccb8f107
3
+ size 1575072
README.md ADDED
@@ -0,0 +1,915 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bn
4
+ - cs
5
+ - de
6
+ - en
7
+ - et
8
+ - fi
9
+ - fr
10
+ - gu
11
+ - ha
12
+ - hi
13
+ - is
14
+ - ja
15
+ - kk
16
+ - km
17
+ - lt
18
+ - lv
19
+ - pl
20
+ - ps
21
+ - ru
22
+ - ta
23
+ - tr
24
+ - uk
25
+ - xh
26
+ - zh
27
+ - zu
28
+ - ne
29
+ - ro
30
+ - si
31
+ tags:
32
+ - sentence-transformers
33
+ - sentence-similarity
34
+ - feature-extraction
35
+ - generated_from_trainer
36
+ - dataset_size:1327190
37
+ - loss:CoSENTLoss
38
+ base_model: sentence-transformers/distiluse-base-multilingual-cased-v2
39
+ widget:
40
+ - source_sentence: यहाँका केही धार्मिक सम्पदाहरू यस प्रकार रहेका छन्।
41
+ sentences:
42
+ - A party works journalists from advertisements about a massive Himalayan post.
43
+ - Some religious affiliations here remain.
44
+ - In Spain, the strict opposition of Roman Catholic churches is found to have assumed
45
+ a marriage similar to male beach wives.
46
+ - source_sentence: AP White House reporter Jill Colvin greeted McEnany at her first
47
+ briefing by asking, "Will you pledge never to lie to us from that podium?"
48
+ sentences:
49
+ - There is a need for the people of Kano State, especially those who are employed,
50
+ to give the unemployed access to the program to address the problems of the unemployed
51
+ youth in the country.
52
+ - 美联社白宫记者吉尔·科尔文(Jill Colvin)在麦克纳尼的第一次简报会上向她打招呼,问道:“你能保证永远不会在讲台上对我们撒谎吗?”
53
+ - The violence underscores the precarious security situation in Afghanistan as U.S.
54
+ President Donald Trump weighs increasing the number of U.S. troops supporting
55
+ the military and police in the country.
56
+ - source_sentence: He possesses a pistol with silver bullets for protection from vampires
57
+ and werewolves.
58
+ sentences:
59
+ - Er besitzt eine Pistole mit silbernen Kugeln zum Schutz vor Vampiren und Werwölfen.
60
+ - Bibimbap umfasst Reis, Spinat, Rettich, Bohnensprossen.
61
+ - BSAC profitierte auch von den großen, aber nicht unbeschränkten persönlichen Vermögen
62
+ von Rhodos und Beit vor ihrem Tod.
63
+ - source_sentence: To the west of the Badger Head Inlier is the Port Sorell Formation,
64
+ a tectonic mélange of marine sediments and dolerite.
65
+ sentences:
66
+ - Er brennt einen Speer und brennt Flammen aus seinem Mund, wenn er wütend ist.
67
+ - Westlich des Badger Head Inlier befindet sich die Port Sorell Formation, eine
68
+ tektonische Mischung aus Sedimenten und Dolerit.
69
+ - Public Lynching and Mob Violence under Modi Government
70
+ - source_sentence: Garnizoana otomană se retrage în sudul Dunării, iar după 164 de
71
+ ani cetatea intră din nou sub stăpânirea europenilor.
72
+ sentences:
73
+ - This is because, once again, we have taken into account the fact that we have
74
+ adopted a large number of legislative proposals.
75
+ - Helsinki University ranks 75th among universities for 2010.
76
+ - Ottoman garnisoana is withdrawing into the south of the Danube and, after 164
77
+ years, it is once again under the control of Europeans.
78
+ datasets:
79
+ - RicardoRei/wmt-da-human-evaluation
80
+ - wmt/wmt20_mlqe_task1
81
+ pipeline_tag: sentence-similarity
82
+ library_name: sentence-transformers
83
+ metrics:
84
+ - pearson_cosine
85
+ - spearman_cosine
86
+ model-index:
87
+ - name: SentenceTransformer based on sentence-transformers/distiluse-base-multilingual-cased-v2
88
+ results:
89
+ - task:
90
+ type: semantic-similarity
91
+ name: Semantic Similarity
92
+ dataset:
93
+ name: sts eval
94
+ type: sts-eval
95
+ metrics:
96
+ - type: pearson_cosine
97
+ value: 0.3206973346263331
98
+ name: Pearson Cosine
99
+ - type: spearman_cosine
100
+ value: 0.30186185706678065
101
+ name: Spearman Cosine
102
+ - type: pearson_cosine
103
+ value: 0.16415599381152823
104
+ name: Pearson Cosine
105
+ - type: spearman_cosine
106
+ value: 0.2100212895924085
107
+ name: Spearman Cosine
108
+ - type: pearson_cosine
109
+ value: 0.2835638593581582
110
+ name: Pearson Cosine
111
+ - type: spearman_cosine
112
+ value: 0.28768623299130575
113
+ name: Spearman Cosine
114
+ - type: pearson_cosine
115
+ value: 0.5058926579356612
116
+ name: Pearson Cosine
117
+ - type: spearman_cosine
118
+ value: 0.4940621216662592
119
+ name: Spearman Cosine
120
+ - type: pearson_cosine
121
+ value: 0.37071342497736826
122
+ name: Pearson Cosine
123
+ - type: spearman_cosine
124
+ value: 0.3890195172034537
125
+ name: Spearman Cosine
126
+ - type: pearson_cosine
127
+ value: 0.6655183783252212
128
+ name: Pearson Cosine
129
+ - type: spearman_cosine
130
+ value: 0.6069408353469313
131
+ name: Spearman Cosine
132
+ - type: pearson_cosine
133
+ value: 0.2833344156983574
134
+ name: Pearson Cosine
135
+ - type: spearman_cosine
136
+ value: 0.2814491820129572
137
+ name: Spearman Cosine
138
+ - task:
139
+ type: semantic-similarity
140
+ name: Semantic Similarity
141
+ dataset:
142
+ name: sts test
143
+ type: sts-test
144
+ metrics:
145
+ - type: pearson_cosine
146
+ value: 0.31527674589721005
147
+ name: Pearson Cosine
148
+ - type: spearman_cosine
149
+ value: 0.29671444308890826
150
+ name: Spearman Cosine
151
+ - type: pearson_cosine
152
+ value: 0.1309209199952754
153
+ name: Pearson Cosine
154
+ - type: spearman_cosine
155
+ value: 0.09868784578188826
156
+ name: Spearman Cosine
157
+ - type: pearson_cosine
158
+ value: 0.22966057387948113
159
+ name: Pearson Cosine
160
+ - type: spearman_cosine
161
+ value: 0.24221319169582142
162
+ name: Spearman Cosine
163
+ - type: pearson_cosine
164
+ value: 0.49607072945477154
165
+ name: Pearson Cosine
166
+ - type: spearman_cosine
167
+ value: 0.4952015667722211
168
+ name: Spearman Cosine
169
+ - type: pearson_cosine
170
+ value: 0.3697043788503178
171
+ name: Pearson Cosine
172
+ - type: spearman_cosine
173
+ value: 0.37691503947177424
174
+ name: Spearman Cosine
175
+ - type: pearson_cosine
176
+ value: 0.7060091540128164
177
+ name: Pearson Cosine
178
+ - type: spearman_cosine
179
+ value: 0.6354850557046146
180
+ name: Spearman Cosine
181
+ - type: pearson_cosine
182
+ value: 0.34560690557182
183
+ name: Pearson Cosine
184
+ - type: spearman_cosine
185
+ value: 0.3130941622579434
186
+ name: Spearman Cosine
187
+ ---
188
+
189
+ # SentenceTransformer based on sentence-transformers/distiluse-base-multilingual-cased-v2
190
+
191
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) on the [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation), [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1), [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) and [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) datasets. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
192
+
193
+ ## Model Details
194
+
195
+ ### Model Description
196
+ - **Model Type:** Sentence Transformer
197
+ - **Base model:** [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) <!-- at revision dad0fa1ee4fa6e982d3adbce87c73c02e6aee838 -->
198
+ - **Maximum Sequence Length:** 128 tokens
199
+ - **Output Dimensionality:** 512 dimensions
200
+ - **Similarity Function:** Cosine Similarity
201
+ - **Training Datasets:**
202
+ - [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation)
203
+ - [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
204
+ - [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
205
+ - [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
206
+ - [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
207
+ - [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
208
+ - [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1)
209
+ - **Languages:** bn, cs, de, en, et, fi, fr, gu, ha, hi, is, ja, kk, km, lt, lv, pl, ps, ru, ta, tr, uk, xh, zh, zu, ne, ro, si
210
+ <!-- - **License:** Unknown -->
211
+
212
+ ### Model Sources
213
+
214
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
215
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
216
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
217
+
218
+ ### Full Model Architecture
219
+
220
+ ```
221
+ SentenceTransformer(
222
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel
223
+ (1): MultiHeadGeneralizedPooling(
224
+ (P): ModuleList(
225
+ (0-7): 8 x Linear(in_features=768, out_features=96, bias=True)
226
+ )
227
+ (W1): ModuleList(
228
+ (0-7): 8 x Linear(in_features=96, out_features=384, bias=True)
229
+ )
230
+ (W2): ModuleList(
231
+ (0-7): 8 x Linear(in_features=384, out_features=96, bias=True)
232
+ )
233
+ )
234
+ (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
235
+ )
236
+ ```
237
+
238
+ ## Usage
239
+
240
+ ### Direct Usage (Sentence Transformers)
241
+
242
+ First install the Sentence Transformers library:
243
+
244
+ ```bash
245
+ pip install -U sentence-transformers
246
+ ```
247
+
248
+ Then you can load this model and run inference.
249
+ ```python
250
+ from sentence_transformers import SentenceTransformer
251
+
252
+ # Download from the 🤗 Hub
253
+ model = SentenceTransformer("RomainDarous/generalized")
254
+ # Run inference
255
+ sentences = [
256
+ 'Garnizoana otomană se retrage în sudul Dunării, iar după 164 de ani cetatea intră din nou sub stăpânirea europenilor.',
257
+ 'Ottoman garnisoana is withdrawing into the south of the Danube and, after 164 years, it is once again under the control of Europeans.',
258
+ 'This is because, once again, we have taken into account the fact that we have adopted a large number of legislative proposals.',
259
+ ]
260
+ embeddings = model.encode(sentences)
261
+ print(embeddings.shape)
262
+ # [3, 512]
263
+
264
+ # Get the similarity scores for the embeddings
265
+ similarities = model.similarity(embeddings, embeddings)
266
+ print(similarities.shape)
267
+ # [3, 3]
268
+ ```
269
+
270
+ <!--
271
+ ### Direct Usage (Transformers)
272
+
273
+ <details><summary>Click to see the direct usage in Transformers</summary>
274
+
275
+ </details>
276
+ -->
277
+
278
+ <!--
279
+ ### Downstream Usage (Sentence Transformers)
280
+
281
+ You can finetune this model on your own dataset.
282
+
283
+ <details><summary>Click to expand</summary>
284
+
285
+ </details>
286
+ -->
287
+
288
+ <!--
289
+ ### Out-of-Scope Use
290
+
291
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
292
+ -->
293
+
294
+ ## Evaluation
295
+
296
+ ### Metrics
297
+
298
+ #### Semantic Similarity
299
+
300
+ * Datasets: `sts-eval`, `sts-test`, `sts-test`, `sts-test`, `sts-test`, `sts-test`, `sts-test` and `sts-test`
301
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
302
+
303
+ | Metric | sts-eval | sts-test |
304
+ |:--------------------|:-----------|:-----------|
305
+ | pearson_cosine | 0.3207 | 0.3456 |
306
+ | **spearman_cosine** | **0.3019** | **0.3131** |
307
+
308
+ #### Semantic Similarity
309
+
310
+ * Dataset: `sts-eval`
311
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
312
+
313
+ | Metric | Value |
314
+ |:--------------------|:---------|
315
+ | pearson_cosine | 0.1642 |
316
+ | **spearman_cosine** | **0.21** |
317
+
318
+ #### Semantic Similarity
319
+
320
+ * Dataset: `sts-eval`
321
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
322
+
323
+ | Metric | Value |
324
+ |:--------------------|:-----------|
325
+ | pearson_cosine | 0.2836 |
326
+ | **spearman_cosine** | **0.2877** |
327
+
328
+ #### Semantic Similarity
329
+
330
+ * Dataset: `sts-eval`
331
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
332
+
333
+ | Metric | Value |
334
+ |:--------------------|:-----------|
335
+ | pearson_cosine | 0.5059 |
336
+ | **spearman_cosine** | **0.4941** |
337
+
338
+ #### Semantic Similarity
339
+
340
+ * Dataset: `sts-eval`
341
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
342
+
343
+ | Metric | Value |
344
+ |:--------------------|:----------|
345
+ | pearson_cosine | 0.3707 |
346
+ | **spearman_cosine** | **0.389** |
347
+
348
+ #### Semantic Similarity
349
+
350
+ * Dataset: `sts-eval`
351
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
352
+
353
+ | Metric | Value |
354
+ |:--------------------|:-----------|
355
+ | pearson_cosine | 0.6655 |
356
+ | **spearman_cosine** | **0.6069** |
357
+
358
+ #### Semantic Similarity
359
+
360
+ * Dataset: `sts-eval`
361
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
362
+
363
+ | Metric | Value |
364
+ |:--------------------|:-----------|
365
+ | pearson_cosine | 0.2833 |
366
+ | **spearman_cosine** | **0.2814** |
367
+
368
+ <!--
369
+ ## Bias, Risks and Limitations
370
+
371
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
372
+ -->
373
+
374
+ <!--
375
+ ### Recommendations
376
+
377
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
378
+ -->
379
+
380
+ ## Training Details
381
+
382
+ ### Training Datasets
383
+
384
+ #### wmt_da
385
+
386
+ * Dataset: [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation) at [301de38](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation/tree/301de385bf05b0c00a8f4be74965e186164dd425)
387
+ * Size: 1,285,190 training samples
388
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
389
+ * Approximate statistics based on the first 1000 samples:
390
+ | | sentence1 | sentence2 | score |
391
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------|
392
+ | type | string | string | float |
393
+ | details | <ul><li>min: 4 tokens</li><li>mean: 37.0 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 36.84 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.72</li><li>max: 1.0</li></ul> |
394
+ * Samples:
395
+ | sentence1 | sentence2 | score |
396
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|
397
+ | <code>在指挥员下达升旗指令后,升旗手奋力挥臂划出一道弧线,鲜艳的五星红旗如同“雄鹰展翅”一般舒展开旗面,伴随国歌激昂雄壮的旋律缓缓升起。</code> | <code>After the commander gave orders to raise the flag, the flag-bearer swung his arm to draw an arc, and the bright five-star red flag spread out like an eagle's wing, slowly rising with the national anthem's strong melody.</code> | <code>0.94</code> |
398
+ | <code>The report also said the monitoring team had received information that two senior Islamic State commanders, Abu Qutaibah and Abu Hajar al-Iraqi, had recently arrived in Afghanistan from the Middle East.</code> | <code>另外,报告还表示,监管小组目前已经得到消息称伊斯兰国两名高级指挥官阿布•库泰巴(Abu Qutaibah)和阿布•哈吉尔•伊拉克(Abu Qutaibah and Abu Hajar al-Iraqi)近期已从中东抵达阿富汗。</code> | <code>0.82</code> |
399
+ | <code>Aus der Schusswunde floss dann Blut.</code> | <code>From the gunshot wound then flowed blood.</code> | <code>0.73</code> |
400
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
401
+ ```json
402
+ {
403
+ "scale": 20.0,
404
+ "similarity_fct": "pairwise_cos_sim"
405
+ }
406
+ ```
407
+
408
+ #### mlqe_en_de
409
+
410
+ * Dataset: [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
411
+ * Size: 7,000 training samples
412
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
413
+ * Approximate statistics based on the first 1000 samples:
414
+ | | sentence1 | sentence2 | score |
415
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
416
+ | type | string | string | float |
417
+ | details | <ul><li>min: 11 tokens</li><li>mean: 23.78 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 26.51 tokens</li><li>max: 54 tokens</li></ul> | <ul><li>min: 0.06</li><li>mean: 0.86</li><li>max: 1.0</li></ul> |
418
+ * Samples:
419
+ | sentence1 | sentence2 | score |
420
+ |:-------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
421
+ | <code>Early Muslim traders and merchants visited Bengal while traversing the Silk Road in the first millennium.</code> | <code>Frühe muslimische Händler und Kaufleute besuchten Bengalen, während sie im ersten Jahrtausend die Seidenstraße durchquerten.</code> | <code>0.9233333468437195</code> |
422
+ | <code>While Fran dissipated shortly after that, the tropical wave progressed into the northeastern Pacific Ocean.</code> | <code>Während Fran kurz danach zerstreute, entwickelte sich die tropische Welle in den nordöstlichen Pazifischen Ozean.</code> | <code>0.8899999856948853</code> |
423
+ | <code>Distressed securities include such events as restructurings, recapitalizations, and bankruptcies.</code> | <code>Zu den belasteten Wertpapieren gehören Restrukturierungen, Rekapitalisierungen und Insolvenzen.</code> | <code>0.9300000071525574</code> |
424
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
425
+ ```json
426
+ {
427
+ "scale": 20.0,
428
+ "similarity_fct": "pairwise_cos_sim"
429
+ }
430
+ ```
431
+
432
+ #### mlqe_en_zh
433
+
434
+ * Dataset: [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
435
+ * Size: 7,000 training samples
436
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
437
+ * Approximate statistics based on the first 1000 samples:
438
+ | | sentence1 | sentence2 | score |
439
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------|
440
+ | type | string | string | float |
441
+ | details | <ul><li>min: 9 tokens</li><li>mean: 24.09 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 29.93 tokens</li><li>max: 74 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 0.98</li></ul> |
442
+ * Samples:
443
+ | sentence1 | sentence2 | score |
444
+ |:-------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------|:---------------------------------|
445
+ | <code>In the late 1980s, the hotel's reputation declined, and it functioned partly as a "backpackers hangout."</code> | <code>在 20 世纪 80 年代末 , 这家旅馆的声誉下降了 , 部分地起到了 "背包吊销" 的作用。</code> | <code>0.40666666626930237</code> |
446
+ | <code>From 1870 to 1915, 36 million Europeans migrated away from Europe.</code> | <code>从 1870 年到 1915 年 , 3, 600 万欧洲人从欧洲移民。</code> | <code>0.8333333730697632</code> |
447
+ | <code>In some photos, the footpads did press into the regolith, especially when they moved sideways at touchdown.</code> | <code>在一些照片中 , 脚垫确实挤进了后台 , 尤其是当他们在触地时侧面移动时。</code> | <code>0.33000001311302185</code> |
448
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
449
+ ```json
450
+ {
451
+ "scale": 20.0,
452
+ "similarity_fct": "pairwise_cos_sim"
453
+ }
454
+ ```
455
+
456
+ #### mlqe_et_en
457
+
458
+ * Dataset: [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
459
+ * Size: 7,000 training samples
460
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
461
+ * Approximate statistics based on the first 1000 samples:
462
+ | | sentence1 | sentence2 | score |
463
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
464
+ | type | string | string | float |
465
+ | details | <ul><li>min: 14 tokens</li><li>mean: 31.88 tokens</li><li>max: 63 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 24.57 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.67</li><li>max: 1.0</li></ul> |
466
+ * Samples:
467
+ | sentence1 | sentence2 | score |
468
+ |:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
469
+ | <code>Gruusias vahistati president Mihhail Saakašvili pressibüroo nõunik Simon Kiladze, keda süüdistati spioneerimises.</code> | <code>In Georgia, an adviser to the press office of President Mikhail Saakashvili, Simon Kiladze, was arrested and accused of spying.</code> | <code>0.9466666579246521</code> |
470
+ | <code>Nii teadmissotsioloogia pooldajad tavaliselt Kuhni tõlgendavadki, arendades tema vaated sõnaselgeks relativismiks.</code> | <code>This is how supporters of knowledge sociology usually interpret Kuhn by developing his views into an explicit relativism.</code> | <code>0.9366666674613953</code> |
471
+ | <code>18. jaanuaril 2003 haarasid mitmeid Canberra eeslinnu võsapõlengud, milles hukkus neli ja sai vigastada 435 inimest.</code> | <code>On 18 January 2003, several of the suburbs of Canberra were seized by debt fires which killed four people and injured 435 people.</code> | <code>0.8666666150093079</code> |
472
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
473
+ ```json
474
+ {
475
+ "scale": 20.0,
476
+ "similarity_fct": "pairwise_cos_sim"
477
+ }
478
+ ```
479
+
480
+ #### mlqe_ne_en
481
+
482
+ * Dataset: [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
483
+ * Size: 7,000 training samples
484
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
485
+ * Approximate statistics based on the first 1000 samples:
486
+ | | sentence1 | sentence2 | score |
487
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
488
+ | type | string | string | float |
489
+ | details | <ul><li>min: 17 tokens</li><li>mean: 40.67 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 24.66 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.39</li><li>max: 1.0</li></ul> |
490
+ * Samples:
491
+ | sentence1 | sentence2 | score |
492
+ |:------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------|:---------------------------------|
493
+ | <code>सामान्‍य बजट प्रायः फेब्रुअरीका अंतिम कार्य दिवसमा लाईन्छ।</code> | <code>A normal budget is usually awarded to the digital working day of February.</code> | <code>0.5600000023841858</code> |
494
+ | <code>कविताका यस्ता स्वरूपमा दुई, तिन वा चार पाउसम्मका मुक्तक, हाइकु, सायरी र लोकसूक्तिहरू पर्दछन् ।</code> | <code>The book consists of two, free of her or four paulets, haiku, Sairi, and locus in such forms.</code> | <code>0.23666666448116302</code> |
495
+ | <code>ब्रिट्नीले यस बारेमा प्रतिक्रिया ब्यक्ता गरदै भनिन,"कुन ठूलो ��ुरा हो र?</code> | <code>Britney did not respond to this, saying "which is a big thing and a big thing?</code> | <code>0.21666665375232697</code> |
496
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
497
+ ```json
498
+ {
499
+ "scale": 20.0,
500
+ "similarity_fct": "pairwise_cos_sim"
501
+ }
502
+ ```
503
+
504
+ #### mlqe_ro_en
505
+
506
+ * Dataset: [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
507
+ * Size: 7,000 training samples
508
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
509
+ * Approximate statistics based on the first 1000 samples:
510
+ | | sentence1 | sentence2 | score |
511
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
512
+ | type | string | string | float |
513
+ | details | <ul><li>min: 12 tokens</li><li>mean: 29.44 tokens</li><li>max: 60 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 22.38 tokens</li><li>max: 65 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 1.0</li></ul> |
514
+ * Samples:
515
+ | sentence1 | sentence2 | score |
516
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
517
+ | <code>Orașul va fi împărțit în patru districte, iar suburbiile în 10 mahalale.</code> | <code>The city will be divided into four districts and suburbs into 10 mahalals.</code> | <code>0.4699999988079071</code> |
518
+ | <code>La scurt timp după aceasta, au devenit cunoscute debarcările germane de la Trondheim, Bergen și Stavanger, precum și luptele din Oslofjord.</code> | <code>In the light of the above, the Authority concludes that the aid granted to ADIF is compatible with the internal market pursuant to Article 61 (3) (c) of the EEA Agreement.</code> | <code>0.02666666731238365</code> |
519
+ | <code>Până în vara 1791, în Clubul iacobinilor au dominat reprezentanții monarhismului liberal constituțional.</code> | <code>Until the summer of 1791, representatives of liberal constitutional monarchism dominated in the Jacobins Club.</code> | <code>0.8733333349227905</code> |
520
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
521
+ ```json
522
+ {
523
+ "scale": 20.0,
524
+ "similarity_fct": "pairwise_cos_sim"
525
+ }
526
+ ```
527
+
528
+ #### mlqe_si_en
529
+
530
+ * Dataset: [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
531
+ * Size: 7,000 training samples
532
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
533
+ * Approximate statistics based on the first 1000 samples:
534
+ | | sentence1 | sentence2 | score |
535
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
536
+ | type | string | string | float |
537
+ | details | <ul><li>min: 8 tokens</li><li>mean: 18.19 tokens</li><li>max: 38 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 22.31 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.51</li><li>max: 1.0</li></ul> |
538
+ * Samples:
539
+ | sentence1 | sentence2 | score |
540
+ |:----------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
541
+ | <code>ඇපලෝ 4 සැටර්න් V බූස්ටරයේ ප්‍රථම පර්යේෂණ පියාසැරිය විය.</code> | <code>The first research flight of the Apollo 4 Saturn V Booster.</code> | <code>0.7966666221618652</code> |
542
+ | <code>මෙහි අවපාතය සැලකීමේ දී, මෙහි 48%ක අවරෝහණය $ මිලියන 125කට අධික චිත්‍රපටයක් ලද තෙවන කුඩාම අවපාතය වේ.</code> | <code>In conjunction with the depression here, 48 % of obesity here is the third smallest depression in over $ 125 million film.</code> | <code>0.17666666209697723</code> |
543
+ | <code>එසේම "බකමූණන් මගින් මෙම රාක්ෂසියගේ රාත්‍රී හැසිරීම සංකේතවත් වන බව" පවසයි.</code> | <code>Also "the owl says that this monster's night behavior is symbolic".</code> | <code>0.8799999952316284</code> |
544
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
545
+ ```json
546
+ {
547
+ "scale": 20.0,
548
+ "similarity_fct": "pairwise_cos_sim"
549
+ }
550
+ ```
551
+
552
+ ### Evaluation Datasets
553
+
554
+ #### wmt_da
555
+
556
+ * Dataset: [wmt_da](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation) at [301de38](https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation/tree/301de385bf05b0c00a8f4be74965e186164dd425)
557
+ * Size: 1,285,190 evaluation samples
558
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
559
+ * Approximate statistics based on the first 1000 samples:
560
+ | | sentence1 | sentence2 | score |
561
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------|
562
+ | type | string | string | float |
563
+ | details | <ul><li>min: 4 tokens</li><li>mean: 38.0 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 38.13 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.71</li><li>max: 1.0</li></ul> |
564
+ * Samples:
565
+ | sentence1 | sentence2 | score |
566
+ |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------|:------------------|
567
+ | <code>Langmajer v krvi kvůli sázce o pivo?</code> | <code>Langmajer in blood due to a beer bet?</code> | <code>0.51</code> |
568
+ | <code>Detective Inspector Brian O'Hagan said: 'The investigation is in the early stages but I would appeal to anyone who was in the vicinity of John Street in Birkenhead who saw or heard anything suspicious to contact us.</code> | <code>侦探督察布赖恩奥赫干说:"调查是在早期阶段,但我会呼吁任何人谁是在约翰街附近的伯肯黑德谁看到或听到任何可疑的联系我们。</code> | <code>0.65</code> |
569
+ | <code>また、政府として補償措置や人権啓発などの活動に取り組むとしていた。</code> | <code>The government also said it would take activities such as compensation measures and human rights awareness.</code> | <code>0.89</code> |
570
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
571
+ ```json
572
+ {
573
+ "scale": 20.0,
574
+ "similarity_fct": "pairwise_cos_sim"
575
+ }
576
+ ```
577
+
578
+ #### mlqe_en_de
579
+
580
+ * Dataset: [mlqe_en_de](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
581
+ * Size: 1,000 evaluation samples
582
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
583
+ * Approximate statistics based on the first 1000 samples:
584
+ | | sentence1 | sentence2 | score |
585
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
586
+ | type | string | string | float |
587
+ | details | <ul><li>min: 11 tokens</li><li>mean: 24.11 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 26.66 tokens</li><li>max: 55 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.81</li><li>max: 1.0</li></ul> |
588
+ * Samples:
589
+ | sentence1 | sentence2 | score |
590
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
591
+ | <code>Resuming her patrols, Constitution managed to recapture the American sloop Neutrality on 27 March and, a few days later, the French ship Carteret.</code> | <code>Mit der Wiederaufnahme ihrer Patrouillen gelang es der Verfassung, am 27. März die amerikanische Schleuderneutralität und wenige Tage später das französische Schiff Carteret zurückzuerobern.</code> | <code>0.9033333659172058</code> |
592
+ | <code>Blaine's nomination alienated many Republicans who viewed Blaine as ambitious and immoral.</code> | <code>Blaines Nominierung entfremdete viele Republikaner, die Blaine als ehrgeizig und unmoralisch betrachteten.</code> | <code>0.9216666221618652</code> |
593
+ | <code>This initiated a brief correspondence between the two which quickly descended into political rancor.</code> | <code>Dies leitete eine kurze Korrespondenz zwischen den beiden ein, die schnell zu politischem Groll abstieg.</code> | <code>0.878333330154419</code> |
594
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
595
+ ```json
596
+ {
597
+ "scale": 20.0,
598
+ "similarity_fct": "pairwise_cos_sim"
599
+ }
600
+ ```
601
+
602
+ #### mlqe_en_zh
603
+
604
+ * Dataset: [mlqe_en_zh](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
605
+ * Size: 1,000 evaluation samples
606
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
607
+ * Approximate statistics based on the first 1000 samples:
608
+ | | sentence1 | sentence2 | score |
609
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
610
+ | type | string | string | float |
611
+ | details | <ul><li>min: 9 tokens</li><li>mean: 23.75 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 11 tokens</li><li>mean: 29.56 tokens</li><li>max: 67 tokens</li></ul> | <ul><li>min: 0.26</li><li>mean: 0.65</li><li>max: 0.9</li></ul> |
612
+ * Samples:
613
+ | sentence1 | sentence2 | score |
614
+ |:---------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------|:--------------------------------|
615
+ | <code>Freeman briefly stayed with the king before returning to Accra via Whydah, Ahgwey and Little Popo.</code> | <code>弗里曼在经过惠达、阿格威和小波波回到阿克拉之前与国王一起住了一会儿。</code> | <code>0.6683333516120911</code> |
616
+ | <code>Fantastic Fiction "Scratches in the Sky, Ben Peek, Agog!</code> | <code>奇特的虚构 "天空中的碎片 , 本佩克 , 阿戈 !</code> | <code>0.71833336353302</code> |
617
+ | <code>For Hermann Keller, the running quavers and semiquavers "suffuse the setting with health and strength."</code> | <code>对赫尔曼 · 凯勒来说 , 跑步的跳跃者和半跳跃者 "让环境充满健康和力量" 。</code> | <code>0.7066666483879089</code> |
618
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
619
+ ```json
620
+ {
621
+ "scale": 20.0,
622
+ "similarity_fct": "pairwise_cos_sim"
623
+ }
624
+ ```
625
+
626
+ #### mlqe_et_en
627
+
628
+ * Dataset: [mlqe_et_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
629
+ * Size: 1,000 evaluation samples
630
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
631
+ * Approximate statistics based on the first 1000 samples:
632
+ | | sentence1 | sentence2 | score |
633
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------|
634
+ | type | string | string | float |
635
+ | details | <ul><li>min: 12 tokens</li><li>mean: 32.4 tokens</li><li>max: 58 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 24.87 tokens</li><li>max: 47 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.6</li><li>max: 0.99</li></ul> |
636
+ * Samples:
637
+ | sentence1 | sentence2 | score |
638
+ |:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------|:---------------------------------|
639
+ | <code>Jackson pidas seal kõne, öeldes, et James Brown on tema suurim inspiratsioon.</code> | <code>Jackson gave a speech there saying that James Brown is his greatest inspiration.</code> | <code>0.9833333492279053</code> |
640
+ | <code>Kaanelugu rääkis loo kolme ungarlase üleelamistest Ungari revolutsiooni päevil.</code> | <code>The life of the Man spoke of a story of three Hungarians living in the days of the Hungarian Revolution.</code> | <code>0.28999999165534973</code> |
641
+ | <code>Teise maailmasõja ajal oli ta mitme Saksa juhatusele alluvate eesti väeosa ülem.</code> | <code>During World War II, he was the commander of several of the German leadership.</code> | <code>0.4516666829586029</code> |
642
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
643
+ ```json
644
+ {
645
+ "scale": 20.0,
646
+ "similarity_fct": "pairwise_cos_sim"
647
+ }
648
+ ```
649
+
650
+ #### mlqe_ne_en
651
+
652
+ * Dataset: [mlqe_ne_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
653
+ * Size: 1,000 evaluation samples
654
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
655
+ * Approximate statistics based on the first 1000 samples:
656
+ | | sentence1 | sentence2 | score |
657
+ |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------|
658
+ | type | string | string | float |
659
+ | details | <ul><li>min: 17 tokens</li><li>mean: 41.03 tokens</li><li>max: 85 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 24.77 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.05</li><li>mean: 0.36</li><li>max: 0.92</li></ul> |
660
+ * Samples:
661
+ | sentence1 | sentence2 | score |
662
+ |:------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------|:---------------------------------|
663
+ | <code>१८९२ तिर भवानीदत्त पाण्डेले 'मुद्रा राक्षस'को अनुवाद गरे।</code> | <code>Around 1892, Bhavani Pandit translated the 'money monster'.</code> | <code>0.8416666388511658</code> |
664
+ | <code>यस बच्चाको मुखले आमाको स्तन यस बच्चाको मुखले आमाको स्तन राम्ररी च्यापेको छ ।</code> | <code>The breasts of this child's mouth are taped well with the mother's mouth.</code> | <code>0.2150000035762787</code> |
665
+ | <code>बुवाको बन्दुक चोरेर हिँडेका बराललाई केआई सिंहले अब गोली ल्याउन लगाए ।...</code> | <code>Kei Singh, who stole the boy's closet, took the bullet to bring it now..</code> | <code>0.27000001072883606</code> |
666
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
667
+ ```json
668
+ {
669
+ "scale": 20.0,
670
+ "similarity_fct": "pairwise_cos_sim"
671
+ }
672
+ ```
673
+
674
+ #### mlqe_ro_en
675
+
676
+ * Dataset: [mlqe_ro_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
677
+ * Size: 1,000 evaluation samples
678
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
679
+ * Approximate statistics based on the first 1000 samples:
680
+ | | sentence1 | sentence2 | score |
681
+ |:--------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------|
682
+ | type | string | string | float |
683
+ | details | <ul><li>min: 14 tokens</li><li>mean: 30.25 tokens</li><li>max: 59 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 22.7 tokens</li><li>max: 55 tokens</li></ul> | <ul><li>min: 0.01</li><li>mean: 0.68</li><li>max: 1.0</li></ul> |
684
+ * Samples:
685
+ | sentence1 | sentence2 | score |
686
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------|
687
+ | <code>Cornwallis se afla înconjurat pe uscat de forțe armate net superioare și retragerea pe mare era îndoielnică din cauza flotei franceze.</code> | <code>Cornwallis was surrounded by shore by higher armed forces and the sea withdrawal was doubtful due to the French fleet.</code> | <code>0.8199999928474426</code> |
688
+ | <code>thumbrightuprightDansatori [[cretani de muzică tradițională.</code> | <code>Number of employees employed in the production of the like product in the Union.</code> | <code>0.009999999776482582</code> |
689
+ | <code>Potrivit documentelor vremii și tradiției orale, aceasta a fost cea mai grea perioadă din istoria orașului.</code> | <code>According to the documents of the oral weather and tradition, this was the hardest period in the city's history.</code> | <code>0.5383332967758179</code> |
690
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
691
+ ```json
692
+ {
693
+ "scale": 20.0,
694
+ "similarity_fct": "pairwise_cos_sim"
695
+ }
696
+ ```
697
+
698
+ #### mlqe_si_en
699
+
700
+ * Dataset: [mlqe_si_en](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1) at [0783ed2](https://huggingface.co/datasets/wmt/wmt20_mlqe_task1/tree/0783ed2bd75f44835df4ea664f9ccb85812c8563)
701
+ * Size: 1,000 evaluation samples
702
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
703
+ * Approximate statistics based on the first 1000 samples:
704
+ | | sentence1 | sentence2 | score |
705
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------|
706
+ | type | string | string | float |
707
+ | details | <ul><li>min: 8 tokens</li><li>mean: 18.12 tokens</li><li>max: 36 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 22.18 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.03</li><li>mean: 0.51</li><li>max: 0.99</li></ul> |
708
+ * Samples:
709
+ | sentence1 | sentence2 | score |
710
+ |:----------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:--------------------------------|
711
+ | <code>එයට ශි්‍ර ලංකාවේ සාමය ඇති කිරිමටත් නැති කිරිමටත් පුළුවන්.</code> | <code>It can also cause peace in Sri Lanka.</code> | <code>0.3199999928474426</code> |
712
+ | <code>ඔහු මනෝ විද්‍යාව, සමාජ විද්‍යාව, ඉතිහාසය හා සන්නිවේදනය යන විෂය ක්ෂේත්‍රයන් පිලිබදවද අධ්‍යයනයන් සිදු කිරීමට උත්සාහ කරන ලදි.</code> | <code>He attempted to do subjects in psychology, sociology, history and communication.</code> | <code>0.5366666913032532</code> |
713
+ | <code>එහෙත් කිසිදු මිනිසෙක්‌ හෝ ගැහැනියෙක්‌ එලිමහනක නොවූහ.</code> | <code>But no man or woman was eliminated.</code> | <code>0.2783333361148834</code> |
714
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
715
+ ```json
716
+ {
717
+ "scale": 20.0,
718
+ "similarity_fct": "pairwise_cos_sim"
719
+ }
720
+ ```
721
+
722
+ ### Training Hyperparameters
723
+ #### Non-Default Hyperparameters
724
+
725
+ - `eval_strategy`: steps
726
+ - `per_device_train_batch_size`: 64
727
+ - `per_device_eval_batch_size`: 64
728
+ - `num_train_epochs`: 2
729
+ - `warmup_ratio`: 0.1
730
+
731
+ #### All Hyperparameters
732
+ <details><summary>Click to expand</summary>
733
+
734
+ - `overwrite_output_dir`: False
735
+ - `do_predict`: False
736
+ - `eval_strategy`: steps
737
+ - `prediction_loss_only`: True
738
+ - `per_device_train_batch_size`: 64
739
+ - `per_device_eval_batch_size`: 64
740
+ - `per_gpu_train_batch_size`: None
741
+ - `per_gpu_eval_batch_size`: None
742
+ - `gradient_accumulation_steps`: 1
743
+ - `eval_accumulation_steps`: None
744
+ - `torch_empty_cache_steps`: None
745
+ - `learning_rate`: 5e-05
746
+ - `weight_decay`: 0.0
747
+ - `adam_beta1`: 0.9
748
+ - `adam_beta2`: 0.999
749
+ - `adam_epsilon`: 1e-08
750
+ - `max_grad_norm`: 1.0
751
+ - `num_train_epochs`: 2
752
+ - `max_steps`: -1
753
+ - `lr_scheduler_type`: linear
754
+ - `lr_scheduler_kwargs`: {}
755
+ - `warmup_ratio`: 0.1
756
+ - `warmup_steps`: 0
757
+ - `log_level`: passive
758
+ - `log_level_replica`: warning
759
+ - `log_on_each_node`: True
760
+ - `logging_nan_inf_filter`: True
761
+ - `save_safetensors`: True
762
+ - `save_on_each_node`: False
763
+ - `save_only_model`: False
764
+ - `restore_callback_states_from_checkpoint`: False
765
+ - `no_cuda`: False
766
+ - `use_cpu`: False
767
+ - `use_mps_device`: False
768
+ - `seed`: 42
769
+ - `data_seed`: None
770
+ - `jit_mode_eval`: False
771
+ - `use_ipex`: False
772
+ - `bf16`: False
773
+ - `fp16`: False
774
+ - `fp16_opt_level`: O1
775
+ - `half_precision_backend`: auto
776
+ - `bf16_full_eval`: False
777
+ - `fp16_full_eval`: False
778
+ - `tf32`: None
779
+ - `local_rank`: 0
780
+ - `ddp_backend`: None
781
+ - `tpu_num_cores`: None
782
+ - `tpu_metrics_debug`: False
783
+ - `debug`: []
784
+ - `dataloader_drop_last`: False
785
+ - `dataloader_num_workers`: 0
786
+ - `dataloader_prefetch_factor`: None
787
+ - `past_index`: -1
788
+ - `disable_tqdm`: False
789
+ - `remove_unused_columns`: True
790
+ - `label_names`: None
791
+ - `load_best_model_at_end`: False
792
+ - `ignore_data_skip`: False
793
+ - `fsdp`: []
794
+ - `fsdp_min_num_params`: 0
795
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
796
+ - `fsdp_transformer_layer_cls_to_wrap`: None
797
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
798
+ - `deepspeed`: None
799
+ - `label_smoothing_factor`: 0.0
800
+ - `optim`: adamw_torch
801
+ - `optim_args`: None
802
+ - `adafactor`: False
803
+ - `group_by_length`: False
804
+ - `length_column_name`: length
805
+ - `ddp_find_unused_parameters`: None
806
+ - `ddp_bucket_cap_mb`: None
807
+ - `ddp_broadcast_buffers`: False
808
+ - `dataloader_pin_memory`: True
809
+ - `dataloader_persistent_workers`: False
810
+ - `skip_memory_metrics`: True
811
+ - `use_legacy_prediction_loop`: False
812
+ - `push_to_hub`: False
813
+ - `resume_from_checkpoint`: None
814
+ - `hub_model_id`: None
815
+ - `hub_strategy`: every_save
816
+ - `hub_private_repo`: False
817
+ - `hub_always_push`: False
818
+ - `gradient_checkpointing`: False
819
+ - `gradient_checkpointing_kwargs`: None
820
+ - `include_inputs_for_metrics`: False
821
+ - `include_for_metrics`: []
822
+ - `eval_do_concat_batches`: True
823
+ - `fp16_backend`: auto
824
+ - `push_to_hub_model_id`: None
825
+ - `push_to_hub_organization`: None
826
+ - `mp_parameters`:
827
+ - `auto_find_batch_size`: False
828
+ - `full_determinism`: False
829
+ - `torchdynamo`: None
830
+ - `ray_scope`: last
831
+ - `ddp_timeout`: 1800
832
+ - `torch_compile`: False
833
+ - `torch_compile_backend`: None
834
+ - `torch_compile_mode`: None
835
+ - `dispatch_batches`: None
836
+ - `split_batches`: None
837
+ - `include_tokens_per_second`: False
838
+ - `include_num_input_tokens_seen`: False
839
+ - `neftune_noise_alpha`: None
840
+ - `optim_target_modules`: None
841
+ - `batch_eval_metrics`: False
842
+ - `eval_on_start`: False
843
+ - `use_liger_kernel`: False
844
+ - `eval_use_gather_object`: False
845
+ - `average_tokens_across_devices`: False
846
+ - `prompts`: None
847
+ - `batch_sampler`: batch_sampler
848
+ - `multi_dataset_batch_sampler`: proportional
849
+
850
+ </details>
851
+
852
+ ### Training Logs
853
+ | Epoch | Step | Training Loss | wmt da loss | mlqe en de loss | mlqe en zh loss | mlqe et en loss | mlqe ne en loss | mlqe ro en loss | mlqe si en loss | sts-eval_spearman_cosine | sts-test_spearman_cosine |
854
+ |:-----:|:-----:|:-------------:|:-----------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:------------------------:|:------------------------:|
855
+ | 0.4 | 6690 | 9.3414 | 7.5667 | 7.5538 | 7.5468 | 7.4966 | 7.5247 | 7.4379 | 7.5499 | 0.2263 | - |
856
+ | 0.8 | 13380 | 7.5636 | 7.5622 | 7.5517 | 7.5412 | 7.4917 | 7.5199 | 7.4313 | 7.5437 | 0.2703 | - |
857
+ | 1.2 | 20070 | 7.5579 | 7.5599 | 7.5515 | 7.5430 | 7.4876 | 7.5155 | 7.4235 | 7.5431 | 0.2693 | - |
858
+ | 1.6 | 26760 | 7.5556 | 7.5591 | 7.5501 | 7.5401 | 7.4876 | 7.5156 | 7.4202 | 7.5422 | 0.2707 | - |
859
+ | 2.0 | 33450 | 7.5527 | 7.5585 | 7.5498 | 7.5409 | 7.4837 | 7.5148 | 7.4185 | 7.5410 | 0.2814 | 0.3131 |
860
+
861
+
862
+ ### Framework Versions
863
+ - Python: 3.12.3
864
+ - Sentence Transformers: 3.3.1
865
+ - Transformers: 4.46.3
866
+ - PyTorch: 2.5.1+cu124
867
+ - Accelerate: 1.1.1
868
+ - Datasets: 3.1.0
869
+ - Tokenizers: 0.20.3
870
+
871
+ ## Citation
872
+
873
+ ### BibTeX
874
+
875
+ #### Sentence Transformers
876
+ ```bibtex
877
+ @inproceedings{reimers-2019-sentence-bert,
878
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
879
+ author = "Reimers, Nils and Gurevych, Iryna",
880
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
881
+ month = "11",
882
+ year = "2019",
883
+ publisher = "Association for Computational Linguistics",
884
+ url = "https://arxiv.org/abs/1908.10084",
885
+ }
886
+ ```
887
+
888
+ #### CoSENTLoss
889
+ ```bibtex
890
+ @online{kexuefm-8847,
891
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
892
+ author={Su Jianlin},
893
+ year={2022},
894
+ month={Jan},
895
+ url={https://kexue.fm/archives/8847},
896
+ }
897
+ ```
898
+
899
+ <!--
900
+ ## Glossary
901
+
902
+ *Clearly define terms in order to be accessible across audiences.*
903
+ -->
904
+
905
+ <!--
906
+ ## Model Card Authors
907
+
908
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
909
+ -->
910
+
911
+ <!--
912
+ ## Model Card Contact
913
+
914
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
915
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./output/pre_training_generalized_model-2024-12-13_15-32-01",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertModel"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "output_hidden_states": true,
17
+ "output_past": true,
18
+ "pad_token_id": 0,
19
+ "qa_dropout": 0.1,
20
+ "seq_classif_dropout": 0.2,
21
+ "sinusoidal_pos_embds": false,
22
+ "tie_weights_": true,
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.46.3",
25
+ "vocab_size": 119547
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.3.1",
4
+ "transformers": "4.46.3",
5
+ "pytorch": "2.5.1+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8c2aed21297045330bd7c36ad1fee2ca8a7c527ac94cf19d20c3dd2bee564d7
3
+ size 538947416
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_MultiHeadGeneralizedPooling",
12
+ "type": "sentence_pooling.multihead_generalized_pooling.MultiHeadGeneralizedPooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Dense",
18
+ "type": "sentence_transformers.models.Dense"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "full_tokenizer_file": null,
49
+ "mask_token": "[MASK]",
50
+ "max_len": 512,
51
+ "max_length": 128,
52
+ "model_max_length": 128,
53
+ "never_split": null,
54
+ "pad_to_multiple_of": null,
55
+ "pad_token": "[PAD]",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
+ "sep_token": "[SEP]",
59
+ "stride": 0,
60
+ "strip_accents": null,
61
+ "tokenize_chinese_chars": true,
62
+ "tokenizer_class": "DistilBertTokenizer",
63
+ "truncation_side": "right",
64
+ "truncation_strategy": "longest_first",
65
+ "unk_token": "[UNK]"
66
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff