davidadamczyk commited on
Commit
48803a2
·
verified ·
1 Parent(s): 2fcf154

Add SetFit model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: sentence-transformers/all-mpnet-base-v2
3
+ library_name: setfit
4
+ metrics:
5
+ - accuracy
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - setfit
9
+ - sentence-transformers
10
+ - text-classification
11
+ - generated_from_setfit_trainer
12
+ widget:
13
+ - text: 'As soon as I saw my human pull out my favorite red leash, my tail started
14
+ wagging and I started barking enthusiastically. I had been waiting all day to
15
+ go to my favorite place in the whole world, outside. When my human clipped my
16
+ leash to my collar, I felt my heart sing with joy. Finally! As soon as I stepped
17
+ out the door, I felt the cool autumn breeze wash over me like a wave. The squirrels
18
+ scattered inside of the trees and the birds whistled in harmonies that floated
19
+ through the autumn breeze. The trees were a deep, rich emerald color that drifted
20
+ me off into a new universe. That is what I loved about the outside, it was always
21
+ so peaceful and serene. We walked a few blocks admiring nature''s beauty until
22
+ we suddenly halted to a stop. I looked around but found nothing that looked out
23
+ of the ordinary. My human opened the car door and placed me in the back seat.
24
+ My heart started to beat so fast I thought it would burst out of my chest and
25
+ my mind was racing. Where are we going? Millions of dreadful thoughts popped into
26
+ my brain. By the time we arrived, my fur was soaked with sweat. As soon as I walked
27
+ out the door, I stopped in my tracks. In front of me was a place I can only describe
28
+ as paradise. Behind the white gate, there were clusters of dogs and rubber balls
29
+ crowding the green grass. What more could a dog ever dream of? My heart sang with
30
+ joy as I stepped through the gate. I knew then that this would be the best day
31
+ of my life.
32
+
33
+ '
34
+ - text: 'Rock bottom interest rates and easy money, maybe. But many of these truly
35
+ tech companies like Microsoft, Apple, Facebook and so on have huge cash reserves. I
36
+ live in Gatesville, Seattle, and I will offer another explanation or at least
37
+ a contributing factor. A senior software engineer at Microsoft makes anywhere
38
+ from a new hire at $250K per year with gold plated benefits up to $500K per year
39
+ for someone with a few years under their belt. Microsoft hires numerous "independent
40
+ contractors" at half or less than what they pay full time employees also with
41
+ substantially lesser benefits who work from home. Look for them to increase their
42
+ base of independent contractors as long as the government lets them get away with
43
+ it.
44
+
45
+ '
46
+ - text: '“Amid this dynamic environment, we delivered record results in fiscal year
47
+ 2022: We reported $198 billion in revenue and $83 billion in operating income.
48
+ And the Microsoft Cloud surpassed $100 billion in annualized revenue for the first
49
+ time.”- From Microsoft’s 2022 Annual Report Shareholder’s Letter
50
+
51
+ '
52
+ - text: 'Paresh Y Murudkar Hypothesis: Google wants it leaked. OpenAI has by being
53
+ public acquired huge amount of attention. Although Google will likely achieve
54
+ partity with OpenAI shortly, their immediate danger is to become the default definition
55
+ of the technology. Microsoft found out years ago that even though Bing had reached
56
+ technical parity with Google, the public had been convinced to search for something
57
+ was to "Google It.''Thus, Google has to ghet out there with its own stuff, before
58
+ the "GPT It" because the next generation term for search.
59
+
60
+ '
61
+ - text: 'Mor -- You sound like someone who has never experienced real hardship. Your
62
+ idea that homelessness is a "lifestyle", as if it were freely chosen, suggests
63
+ you have never been there. Try to imagine this: Your employer has a big layoff,
64
+ and with two week''s severance, you lose your job. For a while, you get by on
65
+ unemployment and your spouse''s part-time income. But then unemployment runs out
66
+ because your industry has tanked in your state. You search fruitlessly for a job,
67
+ and begin to get really depressed. Your spouse is diagnosed with cancer, and to
68
+ pay for their treatment, you sell your modest home and move in with your brother-in-law
69
+ and his family, living in their basement, sharing their one bathroom. Your teenage
70
+ child who has been uprooted to a new town and school starts taking drugs and acting
71
+ out, getting arrested, coming home really late, making a lot of noise, being very
72
+ depressed and angry at everyone. The brother-in-law says his sister with cancer
73
+ can stay but your teen cannot. You two move into another relative''s basement,
74
+ but that doesn''t last long. Your teen disappears, leaves a note "I can''t stand
75
+ it anymore. Sorry, love you, gotta go." You run out of your last cash sending
76
+ it to help your wife. The relative can''t afford to feed you. You end up on the
77
+ street. Open your mind.
78
+
79
+ '
80
+ inference: true
81
+ model-index:
82
+ - name: SetFit with sentence-transformers/all-mpnet-base-v2
83
+ results:
84
+ - task:
85
+ type: text-classification
86
+ name: Text Classification
87
+ dataset:
88
+ name: Unknown
89
+ type: unknown
90
+ split: test
91
+ metrics:
92
+ - type: accuracy
93
+ value: 1.0
94
+ name: Accuracy
95
+ ---
96
+
97
+ # SetFit with sentence-transformers/all-mpnet-base-v2
98
+
99
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
100
+
101
+ The model has been trained using an efficient few-shot learning technique that involves:
102
+
103
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
104
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
105
+
106
+ ## Model Details
107
+
108
+ ### Model Description
109
+ - **Model Type:** SetFit
110
+ - **Sentence Transformer body:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
111
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
112
+ - **Maximum Sequence Length:** 384 tokens
113
+ - **Number of Classes:** 2 classes
114
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
115
+ <!-- - **Language:** Unknown -->
116
+ <!-- - **License:** Unknown -->
117
+
118
+ ### Model Sources
119
+
120
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
121
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
122
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
123
+
124
+ ### Model Labels
125
+ | Label | Examples |
126
+ |:------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
127
+ | yes | <ul><li>'MS: Invests $10B into ChatGPT and then immediately lays off 10,000 workers to pay for it.\n'</li><li>'Skepticism aside, it\'s way too late to stop or even realistically control A.I. The genie is literally out of the bottle, with more sophisticated iterations of A.I. to come. There\'s too much financial momentum behind it. OpenAI, the research lab behind the viral ChatGPT chatbot, is in talks to sell existing shares in a tender offer that would value the company at around $29 billion, making it one of the most valuable U.S. startups on paper. Microsoft Corp. has also been in advanced talks to increase its investment in OpenAI. In 2019, Microsoft invested $1 billion in OpenAI and became its preferred partner for commercializing new technologies for services like the search engine Bing and the design app Microsoft Design. Other backers include Tesla CEO Elon Musk, LinkedIn co-founder Reid Hoffman. There are over 100 AI companies developing various Machine learning tasks, new features coming daily. ChatGPT is a genuine productivity boost and a technological wonder. It can write code in Python, TypeScript, and many other languages at my command. It does have bugs in the code, but they are fixable. The possibilities are endless. I can\'t imagine what version 2.0 or 3.0 would look like. For better and/or worse, this is the future. It is incredible, even at this early stage. This technology is mind-blowing and will unquestionably change the world. As Victor Hugo said, " A force more powerful than all of the armies in the world is an idea whose time has come." Indeed it has.\n'</li><li>'Microsoft Bets Big on the Creator of ChatGPT in Race to Dominate A.I. As a new chatbot wows the world with its conversational talents, a resurgent tech giant is poised to reap the benefits while doubling down on a relationship with the start-up OpenAI. When a chatbot called ChatGPT hit the internet late last year, executives at a number of Silicon Valley companies worried they were suddenly dealing with new artificial intelligence technology that could disrupt their businesses. As a new chatbot wows the world with its conversational talents, a resurgent tech giant is poised to reap the benefits while doubling down on a relationship with the start-up OpenAI.\n'</li></ul> |
128
+ | no | <ul><li>"The tragedy of this war, any war, is overwhelming. A city of 100,000 reduced to ruble and the smell of corpses. One can easily imagine all the families who went about their lives prior to the invasion. Schools ringing with children sounds. Shops and eateries filled with patrons, exchanging smiles, saying hello, friends getting together. Homes secure, places of family warmth, humor, love. All gone. Gone in this lifetime. Gone in the blink of a mad man's perverted notion of his needs. We have our mad men and women too - in our congress. We just saw their shameful show. Just the appetizer for a lousy meal to come. In response to the brave Ukrainians who resist, who fight and die, will the mad ones in the new congress stand for freedom or turn away?Will they do as the French did 250 years ago when they came to our aid against a king or will they allow King Putin to have his way?Americans have freedom in their blood. Make that blood boil if this congress forgets that and turns its back on the fight against a king.\n"</li><li>'The dangers of gas stoves are found in only a few studies funded by anti-fossil fuel groups. Anyone who distrusts studies by Exxon, big pharma, big tobacco, should be skeptical of these as well."The science" (tm) does not support these studies that proport to say that gas stoves are a specific problem. NO(x) forms at 2800 F under high pressure, and typically from Nitrogen in the fuel, not the air, where it is relatively stable, being bound to another Nitrogen as N2. Natural gas does not contain Nitrogen, and cooktops do not operate at high pressure. Likewise, natural gas, burning in excess air (open flame) does not produce significant CO. It is indeed a clean burning fuel.Cooking does release particulates and gasses, smoke and smells, but that does not depend on how the food is heated. Cooking bacon smells the same on electric or gas or charcoal or wood (may actually smell better on wood and charcoal) or dung (well maybe not dung).\n'</li><li>'When my electricity goes down due to winter storms, I still have hot water for showers, a place to cook food and heat all via my gas water heater, gas fireplace and gas cooktop. Easy to ignite with a match. We can briefly open windows to air out fumes. I’ll never willingly go all electric.\n'</li></ul> |
129
+
130
+ ## Evaluation
131
+
132
+ ### Metrics
133
+ | Label | Accuracy |
134
+ |:--------|:---------|
135
+ | **all** | 1.0 |
136
+
137
+ ## Uses
138
+
139
+ ### Direct Use for Inference
140
+
141
+ First install the SetFit library:
142
+
143
+ ```bash
144
+ pip install setfit
145
+ ```
146
+
147
+ Then you can load this model and run inference.
148
+
149
+ ```python
150
+ from setfit import SetFitModel
151
+
152
+ # Download from the 🤗 Hub
153
+ model = SetFitModel.from_pretrained("davidadamczyk/my-awesome-setfit-model")
154
+ # Run inference
155
+ preds = model("“Amid this dynamic environment, we delivered record results in fiscal year 2022: We reported $198 billion in revenue and $83 billion in operating income. And the Microsoft Cloud surpassed $100 billion in annualized revenue for the first time.”- From Microsoft’s 2022 Annual Report Shareholder’s Letter
156
+ ")
157
+ ```
158
+
159
+ <!--
160
+ ### Downstream Use
161
+
162
+ *List how someone could finetune this model on their own dataset.*
163
+ -->
164
+
165
+ <!--
166
+ ### Out-of-Scope Use
167
+
168
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
169
+ -->
170
+
171
+ <!--
172
+ ## Bias, Risks and Limitations
173
+
174
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
175
+ -->
176
+
177
+ <!--
178
+ ### Recommendations
179
+
180
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
181
+ -->
182
+
183
+ ## Training Details
184
+
185
+ ### Training Set Metrics
186
+ | Training set | Min | Median | Max |
187
+ |:-------------|:----|:--------|:----|
188
+ | Word count | 13 | 132.875 | 296 |
189
+
190
+ | Label | Training Sample Count |
191
+ |:------|:----------------------|
192
+ | no | 18 |
193
+ | yes | 22 |
194
+
195
+ ### Training Hyperparameters
196
+ - batch_size: (16, 16)
197
+ - num_epochs: (1, 1)
198
+ - max_steps: -1
199
+ - sampling_strategy: oversampling
200
+ - num_iterations: 20
201
+ - body_learning_rate: (2e-05, 2e-05)
202
+ - head_learning_rate: 2e-05
203
+ - loss: CosineSimilarityLoss
204
+ - distance_metric: cosine_distance
205
+ - margin: 0.25
206
+ - end_to_end: False
207
+ - use_amp: False
208
+ - warmup_proportion: 0.1
209
+ - l2_weight: 0.01
210
+ - seed: 42
211
+ - eval_max_steps: -1
212
+ - load_best_model_at_end: False
213
+
214
+ ### Training Results
215
+ | Epoch | Step | Training Loss | Validation Loss |
216
+ |:-----:|:----:|:-------------:|:---------------:|
217
+ | 0.01 | 1 | 0.3469 | - |
218
+ | 0.5 | 50 | 0.0603 | - |
219
+ | 1.0 | 100 | 0.0011 | - |
220
+
221
+ ### Framework Versions
222
+ - Python: 3.10.13
223
+ - SetFit: 1.1.0
224
+ - Sentence Transformers: 3.0.1
225
+ - Transformers: 4.45.2
226
+ - PyTorch: 2.4.0+cu124
227
+ - Datasets: 2.21.0
228
+ - Tokenizers: 0.20.0
229
+
230
+ ## Citation
231
+
232
+ ### BibTeX
233
+ ```bibtex
234
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
235
+ doi = {10.48550/ARXIV.2209.11055},
236
+ url = {https://arxiv.org/abs/2209.11055},
237
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
238
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
239
+ title = {Efficient Few-Shot Learning Without Prompts},
240
+ publisher = {arXiv},
241
+ year = {2022},
242
+ copyright = {Creative Commons Attribution 4.0 International}
243
+ }
244
+ ```
245
+
246
+ <!--
247
+ ## Glossary
248
+
249
+ *Clearly define terms in order to be accessible across audiences.*
250
+ -->
251
+
252
+ <!--
253
+ ## Model Card Authors
254
+
255
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
256
+ -->
257
+
258
+ <!--
259
+ ## Model Card Contact
260
+
261
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
262
+ -->
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/all-mpnet-base-v2",
3
+ "architectures": [
4
+ "MPNetModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.45.2",
23
+ "vocab_size": 30527
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.45.2",
5
+ "pytorch": "2.4.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "normalize_embeddings": false,
3
+ "labels": [
4
+ "no",
5
+ "yes"
6
+ ]
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e322255e8bf718c858c840e1b7e5ef9d4c7f51464cd3afb88ddc89ec47e976ae
3
+ size 437967672
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:727d6e469774ea8526d09a7068288ca9d5c4b2d11925f38f7dee018b85c05f44
3
+ size 7023
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 384,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "mask_token": "<mask>",
58
+ "max_length": 128,
59
+ "model_max_length": 384,
60
+ "pad_to_multiple_of": null,
61
+ "pad_token": "<pad>",
62
+ "pad_token_type_id": 0,
63
+ "padding_side": "right",
64
+ "sep_token": "</s>",
65
+ "stride": 0,
66
+ "strip_accents": null,
67
+ "tokenize_chinese_chars": true,
68
+ "tokenizer_class": "MPNetTokenizer",
69
+ "truncation_side": "right",
70
+ "truncation_strategy": "longest_first",
71
+ "unk_token": "[UNK]"
72
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff