KenyaNonaka0210 commited on
Commit
3c71742
·
verified ·
1 Parent(s): 5f4e769

Upload tokenizer

Browse files
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "<ent2>": 32769,
3
+ "<ent>": 32768
4
+ }
entity_vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<ent>",
4
+ "<ent2>",
5
+ "<ent>",
6
+ "<ent2>",
7
+ "<ent>",
8
+ "<ent2>",
9
+ {
10
+ "content": "<ent>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ {
17
+ "content": "<ent2>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ ],
24
+ "cls_token": {
25
+ "content": "[CLS]",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ "mask_token": {
32
+ "content": "[MASK]",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false
37
+ },
38
+ "pad_token": {
39
+ "content": "[PAD]",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false
44
+ },
45
+ "sep_token": {
46
+ "content": "[SEP]",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false
51
+ },
52
+ "unk_token": {
53
+ "content": "[UNK]",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false
58
+ }
59
+ }
tokenization_luke_bert_japanese.py ADDED
@@ -0,0 +1,1580 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright Studio-Ouisa and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """Tokenization classes for LUKE."""
16
+
17
+ import collections
18
+ import copy
19
+ import itertools
20
+ import json
21
+ import os
22
+ from collections.abc import Mapping
23
+ from typing import Dict, List, Optional, Tuple, Union
24
+
25
+ import numpy as np
26
+ from transformers.models.bert_japanese.tokenization_bert_japanese import (
27
+ BasicTokenizer,
28
+ CharacterTokenizer,
29
+ JumanppTokenizer,
30
+ MecabTokenizer,
31
+ SentencepieceTokenizer,
32
+ SudachiTokenizer,
33
+ WordpieceTokenizer,
34
+ load_vocab,
35
+ )
36
+ from transformers.models.luke.tokenization_luke import (
37
+ ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING, EntityInput, EntitySpanInput
38
+ )
39
+ from transformers.tokenization_utils import PreTrainedTokenizer
40
+ from transformers.tokenization_utils_base import (
41
+ ENCODE_KWARGS_DOCSTRING,
42
+ AddedToken,
43
+ BatchEncoding,
44
+ EncodedInput,
45
+ PaddingStrategy,
46
+ TextInput,
47
+ TextInputPair,
48
+ TensorType,
49
+ TruncationStrategy,
50
+ to_py_obj,
51
+ )
52
+ from transformers.utils import add_end_docstrings, is_tf_tensor, is_torch_tensor, logging
53
+
54
+
55
+ logger = logging.get_logger(__name__)
56
+
57
+ VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "entity_vocab_file": "entity_vocab.json", "spm_file": "spiece.model"}
58
+
59
+
60
+ class LukeBertJapaneseTokenizer(PreTrainedTokenizer):
61
+ vocab_files_names = VOCAB_FILES_NAMES
62
+ model_input_names = ["input_ids", "attention_mask", "position_ids"]
63
+
64
+ def __init__(
65
+ self,
66
+ vocab_file,
67
+ entity_vocab_file,
68
+ task=None,
69
+ max_entity_length=32,
70
+ max_mention_length=30,
71
+ entity_token_1="<ent>",
72
+ entity_token_2="<ent2>",
73
+ entity_unk_token="[UNK]",
74
+ entity_pad_token="[PAD]",
75
+ entity_mask_token="[MASK]",
76
+ entity_mask2_token="[MASK2]",
77
+ spm_file=None,
78
+ do_lower_case=False,
79
+ do_word_tokenize=True,
80
+ do_subword_tokenize=True,
81
+ word_tokenizer_type="basic",
82
+ subword_tokenizer_type="wordpiece",
83
+ never_split=None,
84
+ unk_token="[UNK]",
85
+ sep_token="[SEP]",
86
+ pad_token="[PAD]",
87
+ cls_token="[CLS]",
88
+ mask_token="[MASK]",
89
+ mecab_kwargs=None,
90
+ sudachi_kwargs=None,
91
+ jumanpp_kwargs=None,
92
+ **kwargs,
93
+ ):
94
+ ## Start of block copied from BertJapaneseTokenizer.__init__
95
+ if subword_tokenizer_type == "sentencepiece":
96
+ if not os.path.isfile(spm_file):
97
+ raise ValueError(
98
+ f"Can't find a vocabulary file at path '{spm_file}'. To load the vocabulary from a Google"
99
+ " pretrained model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
100
+ )
101
+ self.spm_file = spm_file
102
+ else:
103
+ if not os.path.isfile(vocab_file):
104
+ raise ValueError(
105
+ f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google"
106
+ " pretrained model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
107
+ )
108
+ self.vocab = load_vocab(vocab_file)
109
+ self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
110
+
111
+ self.do_word_tokenize = do_word_tokenize
112
+ self.word_tokenizer_type = word_tokenizer_type
113
+ self.lower_case = do_lower_case
114
+ self.never_split = never_split
115
+ self.mecab_kwargs = copy.deepcopy(mecab_kwargs)
116
+ self.sudachi_kwargs = copy.deepcopy(sudachi_kwargs)
117
+ self.jumanpp_kwargs = copy.deepcopy(jumanpp_kwargs)
118
+ if do_word_tokenize:
119
+ if word_tokenizer_type == "basic":
120
+ self.word_tokenizer = BasicTokenizer(
121
+ do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=False
122
+ )
123
+ elif word_tokenizer_type == "mecab":
124
+ self.word_tokenizer = MecabTokenizer(
125
+ do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
126
+ )
127
+ elif word_tokenizer_type == "sudachi":
128
+ self.word_tokenizer = SudachiTokenizer(
129
+ do_lower_case=do_lower_case, never_split=never_split, **(sudachi_kwargs or {})
130
+ )
131
+ elif word_tokenizer_type == "jumanpp":
132
+ self.word_tokenizer = JumanppTokenizer(
133
+ do_lower_case=do_lower_case, never_split=never_split, **(jumanpp_kwargs or {})
134
+ )
135
+ else:
136
+ raise ValueError(f"Invalid word_tokenizer_type '{word_tokenizer_type}' is specified.")
137
+
138
+ self.do_subword_tokenize = do_subword_tokenize
139
+ self.subword_tokenizer_type = subword_tokenizer_type
140
+ if do_subword_tokenize:
141
+ if subword_tokenizer_type == "wordpiece":
142
+ self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
143
+ elif subword_tokenizer_type == "character":
144
+ self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=str(unk_token))
145
+ elif subword_tokenizer_type == "sentencepiece":
146
+ self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=str(unk_token))
147
+ else:
148
+ raise ValueError(f"Invalid subword_tokenizer_type '{subword_tokenizer_type}' is specified.")
149
+ ## End of block copied from BertJapaneseTokenizer.__init__
150
+
151
+ ## Start of block copied from LukeTokenizer.__init__
152
+ # we add 2 special tokens for downstream tasks
153
+ # for more information about lstrip and rstrip, see https://github.com/huggingface/transformers/pull/2778
154
+ entity_token_1 = (
155
+ AddedToken(entity_token_1, lstrip=False, rstrip=False)
156
+ if isinstance(entity_token_1, str)
157
+ else entity_token_1
158
+ )
159
+ entity_token_2 = (
160
+ AddedToken(entity_token_2, lstrip=False, rstrip=False)
161
+ if isinstance(entity_token_2, str)
162
+ else entity_token_2
163
+ )
164
+ kwargs["additional_special_tokens"] = kwargs.get("additional_special_tokens", [])
165
+ kwargs["additional_special_tokens"] += [entity_token_1, entity_token_2]
166
+
167
+ with open(entity_vocab_file, encoding="utf-8") as entity_vocab_handle:
168
+ self.entity_vocab = json.load(entity_vocab_handle)
169
+ for entity_special_token in [entity_unk_token, entity_pad_token, entity_mask_token, entity_mask2_token]:
170
+ if entity_special_token not in self.entity_vocab:
171
+ raise ValueError(
172
+ f"Specified entity special token ``{entity_special_token}`` is not found in entity_vocab. "
173
+ f"Probably an incorrect entity vocab file is loaded: {entity_vocab_file}."
174
+ )
175
+ self.entity_unk_token_id = self.entity_vocab[entity_unk_token]
176
+ self.entity_pad_token_id = self.entity_vocab[entity_pad_token]
177
+ self.entity_mask_token_id = self.entity_vocab[entity_mask_token]
178
+ self.entity_mask2_token_id = self.entity_vocab[entity_mask2_token]
179
+
180
+ self.task = task
181
+ if task is None or task == "entity_span_classification":
182
+ self.max_entity_length = max_entity_length
183
+ elif task == "entity_classification":
184
+ self.max_entity_length = 1
185
+ elif task == "entity_pair_classification":
186
+ self.max_entity_length = 2
187
+ else:
188
+ raise ValueError(
189
+ f"Task {task} not supported. Select task from ['entity_classification', 'entity_pair_classification',"
190
+ " 'entity_span_classification'] only."
191
+ )
192
+
193
+ self.max_mention_length = max_mention_length
194
+ ## End of block copied from LukeTokenizer.__init__
195
+
196
+ super().__init__(
197
+ spm_file=spm_file,
198
+ unk_token=unk_token,
199
+ sep_token=sep_token,
200
+ pad_token=pad_token,
201
+ cls_token=cls_token,
202
+ mask_token=mask_token,
203
+ do_lower_case=do_lower_case,
204
+ do_word_tokenize=do_word_tokenize,
205
+ do_subword_tokenize=do_subword_tokenize,
206
+ word_tokenizer_type=word_tokenizer_type,
207
+ subword_tokenizer_type=subword_tokenizer_type,
208
+ never_split=never_split,
209
+ mecab_kwargs=mecab_kwargs,
210
+ sudachi_kwargs=sudachi_kwargs,
211
+ jumanpp_kwargs=jumanpp_kwargs,
212
+ task=task,
213
+ max_entity_length=max_entity_length, # Fixed to set the correct value
214
+ max_mention_length=max_mention_length, # Fixed to set the correct value
215
+ entity_token_1=entity_token_1.content, # Fixed to set the correct value
216
+ entity_token_2=entity_token_2.content, # Fixed to set the correct value
217
+ entity_unk_token=entity_unk_token,
218
+ entity_pad_token=entity_pad_token,
219
+ entity_mask_token=entity_mask_token,
220
+ entity_mask2_token=entity_mask2_token,
221
+ **kwargs,
222
+ )
223
+
224
+ ## Copied from BertJapaneseTokenizer
225
+ @property
226
+ def do_lower_case(self):
227
+ return self.lower_case
228
+
229
+ ## Copied from BertJapaneseTokenizer
230
+ def __getstate__(self):
231
+ state = dict(self.__dict__)
232
+ if self.word_tokenizer_type in ["mecab", "sudachi", "jumanpp"]:
233
+ del state["word_tokenizer"]
234
+ return state
235
+
236
+ ## Copied from BertJapaneseTokenizer
237
+ def __setstate__(self, state):
238
+ self.__dict__ = state
239
+ if self.word_tokenizer_type == "mecab":
240
+ self.word_tokenizer = MecabTokenizer(
241
+ do_lower_case=self.do_lower_case, never_split=self.never_split, **(self.mecab_kwargs or {})
242
+ )
243
+ elif self.word_tokenizer_type == "sudachi":
244
+ self.word_tokenizer = SudachiTokenizer(
245
+ do_lower_case=self.do_lower_case, never_split=self.never_split, **(self.sudachi_kwargs or {})
246
+ )
247
+ elif self.word_tokenizer_type == "jumanpp":
248
+ self.word_tokenizer = JumanppTokenizer(
249
+ do_lower_case=self.do_lower_case, never_split=self.never_split, **(self.jumanpp_kwargs or {})
250
+ )
251
+
252
+ ## Copied from BertJapaneseTokenizer
253
+ def _tokenize(self, text):
254
+ if self.do_word_tokenize:
255
+ tokens = self.word_tokenizer.tokenize(text, never_split=self.all_special_tokens)
256
+ else:
257
+ tokens = [text]
258
+
259
+ if self.do_subword_tokenize:
260
+ split_tokens = [sub_token for token in tokens for sub_token in self.subword_tokenizer.tokenize(token)]
261
+ else:
262
+ split_tokens = tokens
263
+
264
+ return split_tokens
265
+
266
+ # Copied from BertJapaneseTokenizer
267
+ @property
268
+ def vocab_size(self):
269
+ if self.subword_tokenizer_type == "sentencepiece":
270
+ return len(self.subword_tokenizer.sp_model)
271
+ return len(self.vocab)
272
+
273
+ ## Copied from BertJapaneseTokenizer
274
+ def get_vocab(self):
275
+ if self.subword_tokenizer_type == "sentencepiece":
276
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
277
+ vocab.update(self.added_tokens_encoder)
278
+ return vocab
279
+ return dict(self.vocab, **self.added_tokens_encoder)
280
+
281
+ ## Copied from BertJapaneseTokenizer
282
+ def _convert_token_to_id(self, token):
283
+ """Converts a token (str) in an id using the vocab."""
284
+ if self.subword_tokenizer_type == "sentencepiece":
285
+ return self.subword_tokenizer.sp_model.PieceToId(token)
286
+ return self.vocab.get(token, self.vocab.get(self.unk_token))
287
+
288
+ ## Copied from BertJapaneseTokenizer
289
+ def _convert_id_to_token(self, index):
290
+ """Converts an index (integer) in a token (str) using the vocab."""
291
+ if self.subword_tokenizer_type == "sentencepiece":
292
+ return self.subword_tokenizer.sp_model.IdToPiece(index)
293
+ return self.ids_to_tokens.get(index, self.unk_token)
294
+
295
+ ## Copied from BertJapaneseTokenizer
296
+ def convert_tokens_to_string(self, tokens):
297
+ """Converts a sequence of tokens (string) in a single string."""
298
+ if self.subword_tokenizer_type == "sentencepiece":
299
+ return self.subword_tokenizer.sp_model.decode(tokens)
300
+ out_string = " ".join(tokens).replace(" ##", "").strip()
301
+ return out_string
302
+
303
+ ## Copied from BertJapaneseTokenizer
304
+ def build_inputs_with_special_tokens(
305
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
306
+ ) -> List[int]:
307
+ """
308
+ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
309
+ adding special tokens. A BERT sequence has the following format:
310
+
311
+ - single sequence: `[CLS] X [SEP]`
312
+ - pair of sequences: `[CLS] A [SEP] B [SEP]`
313
+
314
+ Args:
315
+ token_ids_0 (`List[int]`):
316
+ List of IDs to which the special tokens will be added.
317
+ token_ids_1 (`List[int]`, *optional*):
318
+ Optional second list of IDs for sequence pairs.
319
+
320
+ Returns:
321
+ `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
322
+ """
323
+ if token_ids_1 is None:
324
+ return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
325
+ cls = [self.cls_token_id]
326
+ sep = [self.sep_token_id]
327
+ return cls + token_ids_0 + sep + token_ids_1 + sep
328
+
329
+ ## Copied from BertJapaneseTokenizer
330
+ def get_special_tokens_mask(
331
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
332
+ ) -> List[int]:
333
+ """
334
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
335
+ special tokens using the tokenizer `prepare_for_model` method.
336
+
337
+ Args:
338
+ token_ids_0 (`List[int]`):
339
+ List of IDs.
340
+ token_ids_1 (`List[int]`, *optional*):
341
+ Optional second list of IDs for sequence pairs.
342
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
343
+ Whether or not the token list is already formatted with special tokens for the model.
344
+
345
+ Returns:
346
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
347
+ """
348
+
349
+ if already_has_special_tokens:
350
+ return super().get_special_tokens_mask(
351
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
352
+ )
353
+
354
+ if token_ids_1 is not None:
355
+ return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
356
+ return [1] + ([0] * len(token_ids_0)) + [1]
357
+
358
+ ## Copied from BertJapaneseTokenizer
359
+ def create_token_type_ids_from_sequences(
360
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
361
+ ) -> List[int]:
362
+ """
363
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence
364
+ pair mask has the following format:
365
+
366
+ ```
367
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
368
+ | first sequence | second sequence |
369
+ ```
370
+
371
+ If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
372
+
373
+ Args:
374
+ token_ids_0 (`List[int]`):
375
+ List of IDs.
376
+ token_ids_1 (`List[int]`, *optional*):
377
+ Optional second list of IDs for sequence pairs.
378
+
379
+ Returns:
380
+ `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
381
+ """
382
+ sep = [self.sep_token_id]
383
+ cls = [self.cls_token_id]
384
+ if token_ids_1 is None:
385
+ return len(cls + token_ids_0 + sep) * [0]
386
+ return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
387
+
388
+ ## Copied from LukeTokenizer
389
+ @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
390
+ def __call__(
391
+ self,
392
+ text: Union[TextInput, List[TextInput]],
393
+ text_pair: Optional[Union[TextInput, List[TextInput]]] = None,
394
+ entity_spans: Optional[Union[EntitySpanInput, List[EntitySpanInput]]] = None,
395
+ entity_spans_pair: Optional[Union[EntitySpanInput, List[EntitySpanInput]]] = None,
396
+ entities: Optional[Union[EntityInput, List[EntityInput]]] = None,
397
+ entities_pair: Optional[Union[EntityInput, List[EntityInput]]] = None,
398
+ add_special_tokens: bool = True,
399
+ padding: Union[bool, str, PaddingStrategy] = False,
400
+ truncation: Union[bool, str, TruncationStrategy] = None,
401
+ max_length: Optional[int] = None,
402
+ max_entity_length: Optional[int] = None,
403
+ stride: int = 0,
404
+ is_split_into_words: Optional[bool] = False,
405
+ pad_to_multiple_of: Optional[int] = None,
406
+ padding_side: Optional[bool] = None,
407
+ return_tensors: Optional[Union[str, TensorType]] = None,
408
+ return_token_type_ids: Optional[bool] = None,
409
+ return_attention_mask: Optional[bool] = None,
410
+ return_overflowing_tokens: bool = False,
411
+ return_special_tokens_mask: bool = False,
412
+ return_offsets_mapping: bool = False,
413
+ return_length: bool = False,
414
+ verbose: bool = True,
415
+ **kwargs,
416
+ ) -> BatchEncoding:
417
+ """
418
+ Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of
419
+ sequences, depending on the task you want to prepare them for.
420
+
421
+ Args:
422
+ text (`str`, `List[str]`, `List[List[str]]`):
423
+ The sequence or batch of sequences to be encoded. Each sequence must be a string. Note that this
424
+ tokenizer does not support tokenization based on pretokenized strings.
425
+ text_pair (`str`, `List[str]`, `List[List[str]]`):
426
+ The sequence or batch of sequences to be encoded. Each sequence must be a string. Note that this
427
+ tokenizer does not support tokenization based on pretokenized strings.
428
+ entity_spans (`List[Tuple[int, int]]`, `List[List[Tuple[int, int]]]`, *optional*):
429
+ The sequence or batch of sequences of entity spans to be encoded. Each sequence consists of tuples each
430
+ with two integers denoting character-based start and end positions of entities. If you specify
431
+ `"entity_classification"` or `"entity_pair_classification"` as the `task` argument in the constructor,
432
+ the length of each sequence must be 1 or 2, respectively. If you specify `entities`, the length of each
433
+ sequence must be equal to the length of each sequence of `entities`.
434
+ entity_spans_pair (`List[Tuple[int, int]]`, `List[List[Tuple[int, int]]]`, *optional*):
435
+ The sequence or batch of sequences of entity spans to be encoded. Each sequence consists of tuples each
436
+ with two integers denoting character-based start and end positions of entities. If you specify the
437
+ `task` argument in the constructor, this argument is ignored. If you specify `entities_pair`, the
438
+ length of each sequence must be equal to the length of each sequence of `entities_pair`.
439
+ entities (`List[str]`, `List[List[str]]`, *optional*):
440
+ The sequence or batch of sequences of entities to be encoded. Each sequence consists of strings
441
+ representing entities, i.e., special entities (e.g., [MASK]) or entity titles of Wikipedia (e.g., Los
442
+ Angeles). This argument is ignored if you specify the `task` argument in the constructor. The length of
443
+ each sequence must be equal to the length of each sequence of `entity_spans`. If you specify
444
+ `entity_spans` without specifying this argument, the entity sequence or the batch of entity sequences
445
+ is automatically constructed by filling it with the [MASK] entity.
446
+ entities_pair (`List[str]`, `List[List[str]]`, *optional*):
447
+ The sequence or batch of sequences of entities to be encoded. Each sequence consists of strings
448
+ representing entities, i.e., special entities (e.g., [MASK]) or entity titles of Wikipedia (e.g., Los
449
+ Angeles). This argument is ignored if you specify the `task` argument in the constructor. The length of
450
+ each sequence must be equal to the length of each sequence of `entity_spans_pair`. If you specify
451
+ `entity_spans_pair` without specifying this argument, the entity sequence or the batch of entity
452
+ sequences is automatically constructed by filling it with the [MASK] entity.
453
+ max_entity_length (`int`, *optional*):
454
+ The maximum length of `entity_ids`.
455
+ """
456
+ # Input type checking for clearer error
457
+ is_valid_single_text = isinstance(text, str)
458
+ is_valid_batch_text = isinstance(text, (list, tuple)) and (len(text) == 0 or (isinstance(text[0], str)))
459
+ if not (is_valid_single_text or is_valid_batch_text):
460
+ raise ValueError("text input must be of type `str` (single example) or `List[str]` (batch).")
461
+
462
+ is_valid_single_text_pair = isinstance(text_pair, str)
463
+ is_valid_batch_text_pair = isinstance(text_pair, (list, tuple)) and (
464
+ len(text_pair) == 0 or isinstance(text_pair[0], str)
465
+ )
466
+ if not (text_pair is None or is_valid_single_text_pair or is_valid_batch_text_pair):
467
+ raise ValueError("text_pair input must be of type `str` (single example) or `List[str]` (batch).")
468
+
469
+ is_batched = bool(isinstance(text, (list, tuple)))
470
+
471
+ if is_batched:
472
+ batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
473
+ if entities is None:
474
+ batch_entities_or_entities_pairs = None
475
+ else:
476
+ batch_entities_or_entities_pairs = (
477
+ list(zip(entities, entities_pair)) if entities_pair is not None else entities
478
+ )
479
+
480
+ if entity_spans is None:
481
+ batch_entity_spans_or_entity_spans_pairs = None
482
+ else:
483
+ batch_entity_spans_or_entity_spans_pairs = (
484
+ list(zip(entity_spans, entity_spans_pair)) if entity_spans_pair is not None else entity_spans
485
+ )
486
+
487
+ return self.batch_encode_plus(
488
+ batch_text_or_text_pairs=batch_text_or_text_pairs,
489
+ batch_entity_spans_or_entity_spans_pairs=batch_entity_spans_or_entity_spans_pairs,
490
+ batch_entities_or_entities_pairs=batch_entities_or_entities_pairs,
491
+ add_special_tokens=add_special_tokens,
492
+ padding=padding,
493
+ truncation=truncation,
494
+ max_length=max_length,
495
+ max_entity_length=max_entity_length,
496
+ stride=stride,
497
+ is_split_into_words=is_split_into_words,
498
+ pad_to_multiple_of=pad_to_multiple_of,
499
+ padding_side=padding_side,
500
+ return_tensors=return_tensors,
501
+ return_token_type_ids=return_token_type_ids,
502
+ return_attention_mask=return_attention_mask,
503
+ return_overflowing_tokens=return_overflowing_tokens,
504
+ return_special_tokens_mask=return_special_tokens_mask,
505
+ return_offsets_mapping=return_offsets_mapping,
506
+ return_length=return_length,
507
+ verbose=verbose,
508
+ **kwargs,
509
+ )
510
+ else:
511
+ return self.encode_plus(
512
+ text=text,
513
+ text_pair=text_pair,
514
+ entity_spans=entity_spans,
515
+ entity_spans_pair=entity_spans_pair,
516
+ entities=entities,
517
+ entities_pair=entities_pair,
518
+ add_special_tokens=add_special_tokens,
519
+ padding=padding,
520
+ truncation=truncation,
521
+ max_length=max_length,
522
+ max_entity_length=max_entity_length,
523
+ stride=stride,
524
+ is_split_into_words=is_split_into_words,
525
+ pad_to_multiple_of=pad_to_multiple_of,
526
+ padding_side=padding_side,
527
+ return_tensors=return_tensors,
528
+ return_token_type_ids=return_token_type_ids,
529
+ return_attention_mask=return_attention_mask,
530
+ return_overflowing_tokens=return_overflowing_tokens,
531
+ return_special_tokens_mask=return_special_tokens_mask,
532
+ return_offsets_mapping=return_offsets_mapping,
533
+ return_length=return_length,
534
+ verbose=verbose,
535
+ **kwargs,
536
+ )
537
+
538
+ ## Copied from LukeTokenizer
539
+ def _encode_plus(
540
+ self,
541
+ text: Union[TextInput],
542
+ text_pair: Optional[Union[TextInput]] = None,
543
+ entity_spans: Optional[EntitySpanInput] = None,
544
+ entity_spans_pair: Optional[EntitySpanInput] = None,
545
+ entities: Optional[EntityInput] = None,
546
+ entities_pair: Optional[EntityInput] = None,
547
+ add_special_tokens: bool = True,
548
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
549
+ truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
550
+ max_length: Optional[int] = None,
551
+ max_entity_length: Optional[int] = None,
552
+ stride: int = 0,
553
+ is_split_into_words: Optional[bool] = False,
554
+ pad_to_multiple_of: Optional[int] = None,
555
+ padding_side: Optional[bool] = None,
556
+ return_tensors: Optional[Union[str, TensorType]] = None,
557
+ return_token_type_ids: Optional[bool] = None,
558
+ return_attention_mask: Optional[bool] = None,
559
+ return_overflowing_tokens: bool = False,
560
+ return_special_tokens_mask: bool = False,
561
+ return_offsets_mapping: bool = False,
562
+ return_length: bool = False,
563
+ verbose: bool = True,
564
+ **kwargs,
565
+ ) -> BatchEncoding:
566
+ if return_offsets_mapping:
567
+ raise NotImplementedError(
568
+ "return_offset_mapping is not available when using Python tokenizers. "
569
+ "To use this feature, change your tokenizer to one deriving from "
570
+ "transformers.PreTrainedTokenizerFast. "
571
+ "More information on available tokenizers at "
572
+ "https://github.com/huggingface/transformers/pull/2674"
573
+ )
574
+
575
+ if is_split_into_words:
576
+ raise NotImplementedError("is_split_into_words is not supported in this tokenizer.")
577
+
578
+ (
579
+ first_ids,
580
+ second_ids,
581
+ first_entity_ids,
582
+ second_entity_ids,
583
+ first_entity_token_spans,
584
+ second_entity_token_spans,
585
+ ) = self._create_input_sequence(
586
+ text=text,
587
+ text_pair=text_pair,
588
+ entities=entities,
589
+ entities_pair=entities_pair,
590
+ entity_spans=entity_spans,
591
+ entity_spans_pair=entity_spans_pair,
592
+ **kwargs,
593
+ )
594
+
595
+ # prepare_for_model will create the attention_mask and token_type_ids
596
+ return self.prepare_for_model(
597
+ first_ids,
598
+ pair_ids=second_ids,
599
+ entity_ids=first_entity_ids,
600
+ pair_entity_ids=second_entity_ids,
601
+ entity_token_spans=first_entity_token_spans,
602
+ pair_entity_token_spans=second_entity_token_spans,
603
+ add_special_tokens=add_special_tokens,
604
+ padding=padding_strategy.value,
605
+ truncation=truncation_strategy.value,
606
+ max_length=max_length,
607
+ max_entity_length=max_entity_length,
608
+ stride=stride,
609
+ pad_to_multiple_of=pad_to_multiple_of,
610
+ padding_side=padding_side,
611
+ return_tensors=return_tensors,
612
+ prepend_batch_axis=True,
613
+ return_attention_mask=return_attention_mask,
614
+ return_token_type_ids=return_token_type_ids,
615
+ return_overflowing_tokens=return_overflowing_tokens,
616
+ return_special_tokens_mask=return_special_tokens_mask,
617
+ return_length=return_length,
618
+ verbose=verbose,
619
+ )
620
+
621
+ ## Copied from LukeTokenizer
622
+ def _batch_encode_plus(
623
+ self,
624
+ batch_text_or_text_pairs: Union[List[TextInput], List[TextInputPair]],
625
+ batch_entity_spans_or_entity_spans_pairs: Optional[
626
+ Union[List[EntitySpanInput], List[Tuple[EntitySpanInput, EntitySpanInput]]]
627
+ ] = None,
628
+ batch_entities_or_entities_pairs: Optional[
629
+ Union[List[EntityInput], List[Tuple[EntityInput, EntityInput]]]
630
+ ] = None,
631
+ add_special_tokens: bool = True,
632
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
633
+ truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
634
+ max_length: Optional[int] = None,
635
+ max_entity_length: Optional[int] = None,
636
+ stride: int = 0,
637
+ is_split_into_words: Optional[bool] = False,
638
+ pad_to_multiple_of: Optional[int] = None,
639
+ padding_side: Optional[bool] = None,
640
+ return_tensors: Optional[Union[str, TensorType]] = None,
641
+ return_token_type_ids: Optional[bool] = None,
642
+ return_attention_mask: Optional[bool] = None,
643
+ return_overflowing_tokens: bool = False,
644
+ return_special_tokens_mask: bool = False,
645
+ return_offsets_mapping: bool = False,
646
+ return_length: bool = False,
647
+ verbose: bool = True,
648
+ **kwargs,
649
+ ) -> BatchEncoding:
650
+ if return_offsets_mapping:
651
+ raise NotImplementedError(
652
+ "return_offset_mapping is not available when using Python tokenizers. "
653
+ "To use this feature, change your tokenizer to one deriving from "
654
+ "transformers.PreTrainedTokenizerFast."
655
+ )
656
+
657
+ if is_split_into_words:
658
+ raise NotImplementedError("is_split_into_words is not supported in this tokenizer.")
659
+
660
+ # input_ids is a list of tuples (one for each example in the batch)
661
+ input_ids = []
662
+ entity_ids = []
663
+ entity_token_spans = []
664
+ for index, text_or_text_pair in enumerate(batch_text_or_text_pairs):
665
+ if not isinstance(text_or_text_pair, (list, tuple)):
666
+ text, text_pair = text_or_text_pair, None
667
+ else:
668
+ text, text_pair = text_or_text_pair
669
+
670
+ entities, entities_pair = None, None
671
+ if batch_entities_or_entities_pairs is not None:
672
+ entities_or_entities_pairs = batch_entities_or_entities_pairs[index]
673
+ if entities_or_entities_pairs:
674
+ if isinstance(entities_or_entities_pairs[0], str):
675
+ entities, entities_pair = entities_or_entities_pairs, None
676
+ else:
677
+ entities, entities_pair = entities_or_entities_pairs
678
+
679
+ entity_spans, entity_spans_pair = None, None
680
+ if batch_entity_spans_or_entity_spans_pairs is not None:
681
+ entity_spans_or_entity_spans_pairs = batch_entity_spans_or_entity_spans_pairs[index]
682
+ if len(entity_spans_or_entity_spans_pairs) > 0 and isinstance(
683
+ entity_spans_or_entity_spans_pairs[0], list
684
+ ):
685
+ entity_spans, entity_spans_pair = entity_spans_or_entity_spans_pairs
686
+ else:
687
+ entity_spans, entity_spans_pair = entity_spans_or_entity_spans_pairs, None
688
+
689
+ (
690
+ first_ids,
691
+ second_ids,
692
+ first_entity_ids,
693
+ second_entity_ids,
694
+ first_entity_token_spans,
695
+ second_entity_token_spans,
696
+ ) = self._create_input_sequence(
697
+ text=text,
698
+ text_pair=text_pair,
699
+ entities=entities,
700
+ entities_pair=entities_pair,
701
+ entity_spans=entity_spans,
702
+ entity_spans_pair=entity_spans_pair,
703
+ **kwargs,
704
+ )
705
+ input_ids.append((first_ids, second_ids))
706
+ entity_ids.append((first_entity_ids, second_entity_ids))
707
+ entity_token_spans.append((first_entity_token_spans, second_entity_token_spans))
708
+
709
+ batch_outputs = self._batch_prepare_for_model(
710
+ input_ids,
711
+ batch_entity_ids_pairs=entity_ids,
712
+ batch_entity_token_spans_pairs=entity_token_spans,
713
+ add_special_tokens=add_special_tokens,
714
+ padding_strategy=padding_strategy,
715
+ truncation_strategy=truncation_strategy,
716
+ max_length=max_length,
717
+ max_entity_length=max_entity_length,
718
+ stride=stride,
719
+ pad_to_multiple_of=pad_to_multiple_of,
720
+ padding_side=padding_side,
721
+ return_attention_mask=return_attention_mask,
722
+ return_token_type_ids=return_token_type_ids,
723
+ return_overflowing_tokens=return_overflowing_tokens,
724
+ return_special_tokens_mask=return_special_tokens_mask,
725
+ return_length=return_length,
726
+ return_tensors=return_tensors,
727
+ verbose=verbose,
728
+ )
729
+
730
+ return BatchEncoding(batch_outputs)
731
+
732
+ ## Copied from LukeTokenizer
733
+ def _check_entity_input_format(self, entities: Optional[EntityInput], entity_spans: Optional[EntitySpanInput]):
734
+ if not isinstance(entity_spans, list):
735
+ raise TypeError("entity_spans should be given as a list")
736
+ elif len(entity_spans) > 0 and not isinstance(entity_spans[0], tuple):
737
+ raise ValueError(
738
+ "entity_spans should be given as a list of tuples containing the start and end character indices"
739
+ )
740
+
741
+ if entities is not None:
742
+ if not isinstance(entities, list):
743
+ raise ValueError("If you specify entities, they should be given as a list")
744
+
745
+ if len(entities) > 0 and not isinstance(entities[0], str):
746
+ raise ValueError("If you specify entities, they should be given as a list of entity names")
747
+
748
+ if len(entities) != len(entity_spans):
749
+ raise ValueError("If you specify entities, entities and entity_spans must be the same length")
750
+
751
+ ## Copied from LukeTokenizer
752
+ def _create_input_sequence(
753
+ self,
754
+ text: Union[TextInput],
755
+ text_pair: Optional[Union[TextInput]] = None,
756
+ entities: Optional[EntityInput] = None,
757
+ entities_pair: Optional[EntityInput] = None,
758
+ entity_spans: Optional[EntitySpanInput] = None,
759
+ entity_spans_pair: Optional[EntitySpanInput] = None,
760
+ **kwargs,
761
+ ) -> Tuple[list, list, list, list, list, list]:
762
+ def get_input_ids(text):
763
+ tokens = self.tokenize(text, **kwargs)
764
+ return self.convert_tokens_to_ids(tokens)
765
+
766
+ def get_input_ids_and_entity_token_spans(text, entity_spans):
767
+ if entity_spans is None:
768
+ return get_input_ids(text), None
769
+
770
+ cur = 0
771
+ input_ids = []
772
+ entity_token_spans = [None] * len(entity_spans)
773
+
774
+ split_char_positions = sorted(frozenset(itertools.chain(*entity_spans)))
775
+ char_pos2token_pos = {}
776
+
777
+ for split_char_position in split_char_positions:
778
+ orig_split_char_position = split_char_position
779
+ if (
780
+ split_char_position > 0 and text[split_char_position - 1] == " "
781
+ ): # whitespace should be prepended to the following token
782
+ split_char_position -= 1
783
+ if cur != split_char_position:
784
+ input_ids += get_input_ids(text[cur:split_char_position])
785
+ cur = split_char_position
786
+ char_pos2token_pos[orig_split_char_position] = len(input_ids)
787
+
788
+ input_ids += get_input_ids(text[cur:])
789
+
790
+ entity_token_spans = [
791
+ (char_pos2token_pos[char_start], char_pos2token_pos[char_end]) for char_start, char_end in entity_spans
792
+ ]
793
+
794
+ return input_ids, entity_token_spans
795
+
796
+ first_ids, second_ids = None, None
797
+ first_entity_ids, second_entity_ids = None, None
798
+ first_entity_token_spans, second_entity_token_spans = None, None
799
+
800
+ if self.task is None:
801
+ if entity_spans is None:
802
+ first_ids = get_input_ids(text)
803
+ else:
804
+ self._check_entity_input_format(entities, entity_spans)
805
+
806
+ first_ids, first_entity_token_spans = get_input_ids_and_entity_token_spans(text, entity_spans)
807
+ if entities is None:
808
+ first_entity_ids = [self.entity_mask_token_id] * len(entity_spans)
809
+ else:
810
+ first_entity_ids = [self.entity_vocab.get(entity, self.entity_unk_token_id) for entity in entities]
811
+
812
+ if text_pair is not None:
813
+ if entity_spans_pair is None:
814
+ second_ids = get_input_ids(text_pair)
815
+ else:
816
+ self._check_entity_input_format(entities_pair, entity_spans_pair)
817
+
818
+ second_ids, second_entity_token_spans = get_input_ids_and_entity_token_spans(
819
+ text_pair, entity_spans_pair
820
+ )
821
+ if entities_pair is None:
822
+ second_entity_ids = [self.entity_mask_token_id] * len(entity_spans_pair)
823
+ else:
824
+ second_entity_ids = [
825
+ self.entity_vocab.get(entity, self.entity_unk_token_id) for entity in entities_pair
826
+ ]
827
+
828
+ elif self.task == "entity_classification":
829
+ if not (isinstance(entity_spans, list) and len(entity_spans) == 1 and isinstance(entity_spans[0], tuple)):
830
+ raise ValueError(
831
+ "Entity spans should be a list containing a single tuple "
832
+ "containing the start and end character indices of an entity"
833
+ )
834
+ first_entity_ids = [self.entity_mask_token_id]
835
+ first_ids, first_entity_token_spans = get_input_ids_and_entity_token_spans(text, entity_spans)
836
+
837
+ # add special tokens to input ids
838
+ entity_token_start, entity_token_end = first_entity_token_spans[0]
839
+ first_ids = (
840
+ first_ids[:entity_token_end] + [self.additional_special_tokens_ids[0]] + first_ids[entity_token_end:]
841
+ )
842
+ first_ids = (
843
+ first_ids[:entity_token_start]
844
+ + [self.additional_special_tokens_ids[0]]
845
+ + first_ids[entity_token_start:]
846
+ )
847
+ first_entity_token_spans = [(entity_token_start, entity_token_end + 2)]
848
+
849
+ elif self.task == "entity_pair_classification":
850
+ if not (
851
+ isinstance(entity_spans, list)
852
+ and len(entity_spans) == 2
853
+ and isinstance(entity_spans[0], tuple)
854
+ and isinstance(entity_spans[1], tuple)
855
+ ):
856
+ raise ValueError(
857
+ "Entity spans should be provided as a list of two tuples, "
858
+ "each tuple containing the start and end character indices of an entity"
859
+ )
860
+
861
+ head_span, tail_span = entity_spans
862
+ first_entity_ids = [self.entity_mask_token_id, self.entity_mask2_token_id]
863
+ first_ids, first_entity_token_spans = get_input_ids_and_entity_token_spans(text, entity_spans)
864
+
865
+ head_token_span, tail_token_span = first_entity_token_spans
866
+ token_span_with_special_token_ids = [
867
+ (head_token_span, self.additional_special_tokens_ids[0]),
868
+ (tail_token_span, self.additional_special_tokens_ids[1]),
869
+ ]
870
+ if head_token_span[0] < tail_token_span[0]:
871
+ first_entity_token_spans[0] = (head_token_span[0], head_token_span[1] + 2)
872
+ first_entity_token_spans[1] = (tail_token_span[0] + 2, tail_token_span[1] + 4)
873
+ token_span_with_special_token_ids = reversed(token_span_with_special_token_ids)
874
+ else:
875
+ first_entity_token_spans[0] = (head_token_span[0] + 2, head_token_span[1] + 4)
876
+ first_entity_token_spans[1] = (tail_token_span[0], tail_token_span[1] + 2)
877
+
878
+ for (entity_token_start, entity_token_end), special_token_id in token_span_with_special_token_ids:
879
+ first_ids = first_ids[:entity_token_end] + [special_token_id] + first_ids[entity_token_end:]
880
+ first_ids = first_ids[:entity_token_start] + [special_token_id] + first_ids[entity_token_start:]
881
+
882
+ elif self.task == "entity_span_classification":
883
+ if not (isinstance(entity_spans, list) and len(entity_spans) > 0 and isinstance(entity_spans[0], tuple)):
884
+ raise ValueError(
885
+ "Entity spans should be provided as a list of tuples, "
886
+ "each tuple containing the start and end character indices of an entity"
887
+ )
888
+
889
+ first_ids, first_entity_token_spans = get_input_ids_and_entity_token_spans(text, entity_spans)
890
+ first_entity_ids = [self.entity_mask_token_id] * len(entity_spans)
891
+
892
+ else:
893
+ raise ValueError(f"Task {self.task} not supported")
894
+
895
+ return (
896
+ first_ids,
897
+ second_ids,
898
+ first_entity_ids,
899
+ second_entity_ids,
900
+ first_entity_token_spans,
901
+ second_entity_token_spans,
902
+ )
903
+
904
+ ## Copied from LukeTokenizer
905
+ @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
906
+ def _batch_prepare_for_model(
907
+ self,
908
+ batch_ids_pairs: List[Tuple[List[int], None]],
909
+ batch_entity_ids_pairs: List[Tuple[Optional[List[int]], Optional[List[int]]]],
910
+ batch_entity_token_spans_pairs: List[Tuple[Optional[List[Tuple[int, int]]], Optional[List[Tuple[int, int]]]]],
911
+ add_special_tokens: bool = True,
912
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
913
+ truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
914
+ max_length: Optional[int] = None,
915
+ max_entity_length: Optional[int] = None,
916
+ stride: int = 0,
917
+ pad_to_multiple_of: Optional[int] = None,
918
+ padding_side: Optional[bool] = None,
919
+ return_tensors: Optional[str] = None,
920
+ return_token_type_ids: Optional[bool] = None,
921
+ return_attention_mask: Optional[bool] = None,
922
+ return_overflowing_tokens: bool = False,
923
+ return_special_tokens_mask: bool = False,
924
+ return_length: bool = False,
925
+ verbose: bool = True,
926
+ ) -> BatchEncoding:
927
+ """
928
+ Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It
929
+ adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
930
+ manages a moving window (with user defined stride) for overflowing tokens
931
+
932
+
933
+ Args:
934
+ batch_ids_pairs: list of tokenized input ids or input ids pairs
935
+ batch_entity_ids_pairs: list of entity ids or entity ids pairs
936
+ batch_entity_token_spans_pairs: list of entity spans or entity spans pairs
937
+ max_entity_length: The maximum length of the entity sequence.
938
+ """
939
+
940
+ batch_outputs = {}
941
+ for input_ids, entity_ids, entity_token_span_pairs in zip(
942
+ batch_ids_pairs, batch_entity_ids_pairs, batch_entity_token_spans_pairs
943
+ ):
944
+ first_ids, second_ids = input_ids
945
+ first_entity_ids, second_entity_ids = entity_ids
946
+ first_entity_token_spans, second_entity_token_spans = entity_token_span_pairs
947
+ outputs = self.prepare_for_model(
948
+ first_ids,
949
+ second_ids,
950
+ entity_ids=first_entity_ids,
951
+ pair_entity_ids=second_entity_ids,
952
+ entity_token_spans=first_entity_token_spans,
953
+ pair_entity_token_spans=second_entity_token_spans,
954
+ add_special_tokens=add_special_tokens,
955
+ padding=PaddingStrategy.DO_NOT_PAD.value, # we pad in batch afterward
956
+ truncation=truncation_strategy.value,
957
+ max_length=max_length,
958
+ max_entity_length=max_entity_length,
959
+ stride=stride,
960
+ pad_to_multiple_of=None, # we pad in batch afterward
961
+ padding_side=None, # we pad in batch afterward
962
+ return_attention_mask=False, # we pad in batch afterward
963
+ return_token_type_ids=return_token_type_ids,
964
+ return_overflowing_tokens=return_overflowing_tokens,
965
+ return_special_tokens_mask=return_special_tokens_mask,
966
+ return_length=return_length,
967
+ return_tensors=None, # We convert the whole batch to tensors at the end
968
+ prepend_batch_axis=False,
969
+ verbose=verbose,
970
+ )
971
+
972
+ for key, value in outputs.items():
973
+ if key not in batch_outputs:
974
+ batch_outputs[key] = []
975
+ batch_outputs[key].append(value)
976
+
977
+ batch_outputs = self.pad(
978
+ batch_outputs,
979
+ padding=padding_strategy.value,
980
+ max_length=max_length,
981
+ pad_to_multiple_of=pad_to_multiple_of,
982
+ padding_side=padding_side,
983
+ return_attention_mask=return_attention_mask,
984
+ )
985
+
986
+ batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors)
987
+
988
+ return batch_outputs
989
+
990
+ ## Copied from LukeTokenizer with some lines added
991
+ @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
992
+ def prepare_for_model(
993
+ self,
994
+ ids: List[int],
995
+ pair_ids: Optional[List[int]] = None,
996
+ entity_ids: Optional[List[int]] = None,
997
+ pair_entity_ids: Optional[List[int]] = None,
998
+ entity_token_spans: Optional[List[Tuple[int, int]]] = None,
999
+ pair_entity_token_spans: Optional[List[Tuple[int, int]]] = None,
1000
+ add_special_tokens: bool = True,
1001
+ padding: Union[bool, str, PaddingStrategy] = False,
1002
+ truncation: Union[bool, str, TruncationStrategy] = None,
1003
+ max_length: Optional[int] = None,
1004
+ max_entity_length: Optional[int] = None,
1005
+ stride: int = 0,
1006
+ pad_to_multiple_of: Optional[int] = None,
1007
+ padding_side: Optional[bool] = None,
1008
+ return_tensors: Optional[Union[str, TensorType]] = None,
1009
+ return_token_type_ids: Optional[bool] = None,
1010
+ return_attention_mask: Optional[bool] = None,
1011
+ return_overflowing_tokens: bool = False,
1012
+ return_special_tokens_mask: bool = False,
1013
+ return_offsets_mapping: bool = False,
1014
+ return_length: bool = False,
1015
+ verbose: bool = True,
1016
+ prepend_batch_axis: bool = False,
1017
+ **kwargs,
1018
+ ) -> BatchEncoding:
1019
+ """
1020
+ Prepares a sequence of input id, entity id and entity span, or a pair of sequences of inputs ids, entity ids,
1021
+ entity spans so that it can be used by the model. It adds special tokens, truncates sequences if overflowing
1022
+ while taking into account the special tokens and manages a moving window (with user defined stride) for
1023
+ overflowing tokens. Please Note, for *pair_ids* different than `None` and *truncation_strategy = longest_first*
1024
+ or `True`, it is not possible to return overflowing tokens. Such a combination of arguments will raise an
1025
+ error.
1026
+
1027
+ Args:
1028
+ ids (`List[int]`):
1029
+ Tokenized input ids of the first sequence.
1030
+ pair_ids (`List[int]`, *optional*):
1031
+ Tokenized input ids of the second sequence.
1032
+ entity_ids (`List[int]`, *optional*):
1033
+ Entity ids of the first sequence.
1034
+ pair_entity_ids (`List[int]`, *optional*):
1035
+ Entity ids of the second sequence.
1036
+ entity_token_spans (`List[Tuple[int, int]]`, *optional*):
1037
+ Entity spans of the first sequence.
1038
+ pair_entity_token_spans (`List[Tuple[int, int]]`, *optional*):
1039
+ Entity spans of the second sequence.
1040
+ max_entity_length (`int`, *optional*):
1041
+ The maximum length of the entity sequence.
1042
+ """
1043
+
1044
+ # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
1045
+ padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
1046
+ padding=padding,
1047
+ truncation=truncation,
1048
+ max_length=max_length,
1049
+ pad_to_multiple_of=pad_to_multiple_of,
1050
+ verbose=verbose,
1051
+ **kwargs,
1052
+ )
1053
+
1054
+ # Compute lengths
1055
+ pair = bool(pair_ids is not None)
1056
+ len_ids = len(ids)
1057
+ len_pair_ids = len(pair_ids) if pair else 0
1058
+
1059
+ if return_token_type_ids and not add_special_tokens:
1060
+ raise ValueError(
1061
+ "Asking to return token_type_ids while setting add_special_tokens to False "
1062
+ "results in an undefined behavior. Please set add_special_tokens to True or "
1063
+ "set return_token_type_ids to None."
1064
+ )
1065
+ if (
1066
+ return_overflowing_tokens
1067
+ and truncation_strategy == TruncationStrategy.LONGEST_FIRST
1068
+ and pair_ids is not None
1069
+ ):
1070
+ raise ValueError(
1071
+ "Not possible to return overflowing tokens for pair of sequences with the "
1072
+ "`longest_first`. Please select another truncation strategy than `longest_first`, "
1073
+ "for instance `only_second` or `only_first`."
1074
+ )
1075
+
1076
+ # Load from model defaults
1077
+ if return_token_type_ids is None:
1078
+ return_token_type_ids = "token_type_ids" in self.model_input_names
1079
+ if return_attention_mask is None:
1080
+ return_attention_mask = "attention_mask" in self.model_input_names
1081
+
1082
+ encoded_inputs = {}
1083
+
1084
+ # Compute the total size of the returned word encodings
1085
+ total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
1086
+
1087
+ # Truncation: Handle max sequence length and max_entity_length
1088
+ overflowing_tokens = []
1089
+ if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
1090
+ # truncate words up to max_length
1091
+ ids, pair_ids, overflowing_tokens = self.truncate_sequences(
1092
+ ids,
1093
+ pair_ids=pair_ids,
1094
+ num_tokens_to_remove=total_len - max_length,
1095
+ truncation_strategy=truncation_strategy,
1096
+ stride=stride,
1097
+ )
1098
+
1099
+ if return_overflowing_tokens:
1100
+ encoded_inputs["overflowing_tokens"] = overflowing_tokens
1101
+ encoded_inputs["num_truncated_tokens"] = total_len - max_length
1102
+
1103
+ # Add special tokens
1104
+ if add_special_tokens:
1105
+ sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
1106
+ token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
1107
+ entity_token_offset = 1 # 1 * <s> token
1108
+ pair_entity_token_offset = len(ids) + 3 # 1 * <s> token & 2 * <sep> tokens
1109
+ else:
1110
+ sequence = ids + pair_ids if pair else ids
1111
+ token_type_ids = [0] * len(ids) + ([0] * len(pair_ids) if pair else [])
1112
+ entity_token_offset = 0
1113
+ pair_entity_token_offset = len(ids)
1114
+
1115
+ # Build output dictionary
1116
+ encoded_inputs["input_ids"] = sequence
1117
+ encoded_inputs["position_ids"] = list(range(len(sequence))) ## Added
1118
+ if return_token_type_ids:
1119
+ encoded_inputs["token_type_ids"] = token_type_ids
1120
+ if return_special_tokens_mask:
1121
+ if add_special_tokens:
1122
+ encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
1123
+ else:
1124
+ encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
1125
+
1126
+ # Set max entity length
1127
+ if not max_entity_length:
1128
+ max_entity_length = self.max_entity_length
1129
+
1130
+ if entity_ids is not None:
1131
+ total_entity_len = 0
1132
+ num_invalid_entities = 0
1133
+ valid_entity_ids = [ent_id for ent_id, span in zip(entity_ids, entity_token_spans) if span[1] <= len(ids)]
1134
+ valid_entity_token_spans = [span for span in entity_token_spans if span[1] <= len(ids)]
1135
+
1136
+ total_entity_len += len(valid_entity_ids)
1137
+ num_invalid_entities += len(entity_ids) - len(valid_entity_ids)
1138
+
1139
+ valid_pair_entity_ids, valid_pair_entity_token_spans = None, None
1140
+ if pair_entity_ids is not None:
1141
+ valid_pair_entity_ids = [
1142
+ ent_id
1143
+ for ent_id, span in zip(pair_entity_ids, pair_entity_token_spans)
1144
+ if span[1] <= len(pair_ids)
1145
+ ]
1146
+ valid_pair_entity_token_spans = [span for span in pair_entity_token_spans if span[1] <= len(pair_ids)]
1147
+ total_entity_len += len(valid_pair_entity_ids)
1148
+ num_invalid_entities += len(pair_entity_ids) - len(valid_pair_entity_ids)
1149
+
1150
+ if num_invalid_entities != 0:
1151
+ logger.warning(
1152
+ f"{num_invalid_entities} entities are ignored because their entity spans are invalid due to the"
1153
+ " truncation of input tokens"
1154
+ )
1155
+
1156
+ if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and total_entity_len > max_entity_length:
1157
+ # truncate entities up to max_entity_length
1158
+ valid_entity_ids, valid_pair_entity_ids, overflowing_entities = self.truncate_sequences(
1159
+ valid_entity_ids,
1160
+ pair_ids=valid_pair_entity_ids,
1161
+ num_tokens_to_remove=total_entity_len - max_entity_length,
1162
+ truncation_strategy=truncation_strategy,
1163
+ stride=stride,
1164
+ )
1165
+ valid_entity_token_spans = valid_entity_token_spans[: len(valid_entity_ids)]
1166
+ if valid_pair_entity_token_spans is not None:
1167
+ valid_pair_entity_token_spans = valid_pair_entity_token_spans[: len(valid_pair_entity_ids)]
1168
+
1169
+ if return_overflowing_tokens:
1170
+ encoded_inputs["overflowing_entities"] = overflowing_entities
1171
+ encoded_inputs["num_truncated_entities"] = total_entity_len - max_entity_length
1172
+
1173
+ final_entity_ids = valid_entity_ids + valid_pair_entity_ids if valid_pair_entity_ids else valid_entity_ids
1174
+ encoded_inputs["entity_ids"] = list(final_entity_ids)
1175
+ entity_position_ids = []
1176
+ entity_start_positions = []
1177
+ entity_end_positions = []
1178
+ for token_spans, offset in (
1179
+ (valid_entity_token_spans, entity_token_offset),
1180
+ (valid_pair_entity_token_spans, pair_entity_token_offset),
1181
+ ):
1182
+ if token_spans is not None:
1183
+ for start, end in token_spans:
1184
+ start += offset
1185
+ end += offset
1186
+ position_ids = list(range(start, end))[: self.max_mention_length]
1187
+ position_ids += [-1] * (self.max_mention_length - end + start)
1188
+ entity_position_ids.append(position_ids)
1189
+ entity_start_positions.append(start)
1190
+ entity_end_positions.append(end - 1)
1191
+
1192
+ encoded_inputs["entity_position_ids"] = entity_position_ids
1193
+ if self.task == "entity_span_classification":
1194
+ encoded_inputs["entity_start_positions"] = entity_start_positions
1195
+ encoded_inputs["entity_end_positions"] = entity_end_positions
1196
+
1197
+ if return_token_type_ids:
1198
+ encoded_inputs["entity_token_type_ids"] = [0] * len(encoded_inputs["entity_ids"])
1199
+
1200
+ # Check lengths
1201
+ self._eventual_warn_about_too_long_sequence(encoded_inputs["input_ids"], max_length, verbose)
1202
+
1203
+ # Padding
1204
+ if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
1205
+ encoded_inputs = self.pad(
1206
+ encoded_inputs,
1207
+ max_length=max_length,
1208
+ max_entity_length=max_entity_length,
1209
+ padding=padding_strategy.value,
1210
+ pad_to_multiple_of=pad_to_multiple_of,
1211
+ padding_side=padding_side,
1212
+ return_attention_mask=return_attention_mask,
1213
+ )
1214
+
1215
+ if return_length:
1216
+ encoded_inputs["length"] = len(encoded_inputs["input_ids"])
1217
+
1218
+ batch_outputs = BatchEncoding(
1219
+ encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
1220
+ )
1221
+
1222
+ return batch_outputs
1223
+
1224
+ ## Copied from LukeTokenizer
1225
+ def pad(
1226
+ self,
1227
+ encoded_inputs: Union[
1228
+ BatchEncoding,
1229
+ List[BatchEncoding],
1230
+ Dict[str, EncodedInput],
1231
+ Dict[str, List[EncodedInput]],
1232
+ List[Dict[str, EncodedInput]],
1233
+ ],
1234
+ padding: Union[bool, str, PaddingStrategy] = True,
1235
+ max_length: Optional[int] = None,
1236
+ max_entity_length: Optional[int] = None,
1237
+ pad_to_multiple_of: Optional[int] = None,
1238
+ padding_side: Optional[bool] = None,
1239
+ return_attention_mask: Optional[bool] = None,
1240
+ return_tensors: Optional[Union[str, TensorType]] = None,
1241
+ verbose: bool = True,
1242
+ ) -> BatchEncoding:
1243
+ """
1244
+ Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
1245
+ in the batch. Padding side (left/right) padding token ids are defined at the tokenizer level (with
1246
+ `self.padding_side`, `self.pad_token_id` and `self.pad_token_type_id`) .. note:: If the `encoded_inputs` passed
1247
+ are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless
1248
+ you provide a different tensor type with `return_tensors`. In the case of PyTorch tensors, you will lose the
1249
+ specific device of your tensors however.
1250
+
1251
+ Args:
1252
+ encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `Dict[str, List[int]]`, `Dict[str, List[List[int]]` or `List[Dict[str, List[int]]]`):
1253
+ Tokenized inputs. Can represent one input ([`BatchEncoding`] or `Dict[str, List[int]]`) or a batch of
1254
+ tokenized inputs (list of [`BatchEncoding`], *Dict[str, List[List[int]]]* or *List[Dict[str,
1255
+ List[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloader
1256
+ collate function. Instead of `List[int]` you can have tensors (numpy arrays, PyTorch tensors or
1257
+ TensorFlow tensors), see the note above for the return type.
1258
+ padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
1259
+ Select a strategy to pad the returned sequences (according to the model's padding side and padding
1260
+ index) among:
1261
+
1262
+ - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
1263
+ sequence if provided).
1264
+ - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
1265
+ acceptable input length for the model if that argument is not provided.
1266
+ - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
1267
+ lengths).
1268
+ max_length (`int`, *optional*):
1269
+ Maximum length of the returned list and optionally padding length (see above).
1270
+ max_entity_length (`int`, *optional*):
1271
+ The maximum length of the entity sequence.
1272
+ pad_to_multiple_of (`int`, *optional*):
1273
+ If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
1274
+ the use of Tensor Cores on NVIDIA hardware with compute capability `>= 7.5` (Volta).
1275
+ padding_side:
1276
+ The side on which the model should have padding applied. Should be selected between ['right', 'left'].
1277
+ Default value is picked from the class attribute of the same name.
1278
+ return_attention_mask (`bool`, *optional*):
1279
+ Whether to return the attention mask. If left to the default, will return the attention mask according
1280
+ to the specific tokenizer's default, defined by the `return_outputs` attribute. [What are attention
1281
+ masks?](../glossary#attention-mask)
1282
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
1283
+ If set, will return tensors instead of list of python integers. Acceptable values are:
1284
+
1285
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
1286
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
1287
+ - `'np'`: Return Numpy `np.ndarray` objects.
1288
+ verbose (`bool`, *optional*, defaults to `True`):
1289
+ Whether or not to print more information and warnings.
1290
+ """
1291
+ # If we have a list of dicts, let's convert it in a dict of lists
1292
+ # We do this to allow using this method as a collate_fn function in PyTorch Dataloader
1293
+ if isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping):
1294
+ encoded_inputs = {key: [example[key] for example in encoded_inputs] for key in encoded_inputs[0].keys()}
1295
+
1296
+ # The model's main input name, usually `input_ids`, has be passed for padding
1297
+ if self.model_input_names[0] not in encoded_inputs:
1298
+ raise ValueError(
1299
+ "You should supply an encoding or a list of encodings to this method "
1300
+ f"that includes {self.model_input_names[0]}, but you provided {list(encoded_inputs.keys())}"
1301
+ )
1302
+
1303
+ required_input = encoded_inputs[self.model_input_names[0]]
1304
+
1305
+ if not required_input:
1306
+ if return_attention_mask:
1307
+ encoded_inputs["attention_mask"] = []
1308
+ return encoded_inputs
1309
+
1310
+ # If we have PyTorch/TF/NumPy tensors/arrays as inputs, we cast them as python objects
1311
+ # and rebuild them afterwards if no return_tensors is specified
1312
+ # Note that we lose the specific device the tensor may be on for PyTorch
1313
+
1314
+ first_element = required_input[0]
1315
+ if isinstance(first_element, (list, tuple)):
1316
+ # first_element might be an empty list/tuple in some edge cases so we grab the first non empty element.
1317
+ index = 0
1318
+ while len(required_input[index]) == 0:
1319
+ index += 1
1320
+ if index < len(required_input):
1321
+ first_element = required_input[index][0]
1322
+ # At this state, if `first_element` is still a list/tuple, it's an empty one so there is nothing to do.
1323
+ if not isinstance(first_element, (int, list, tuple)):
1324
+ if is_tf_tensor(first_element):
1325
+ return_tensors = "tf" if return_tensors is None else return_tensors
1326
+ elif is_torch_tensor(first_element):
1327
+ return_tensors = "pt" if return_tensors is None else return_tensors
1328
+ elif isinstance(first_element, np.ndarray):
1329
+ return_tensors = "np" if return_tensors is None else return_tensors
1330
+ else:
1331
+ raise ValueError(
1332
+ f"type of {first_element} unknown: {type(first_element)}. "
1333
+ "Should be one of a python, numpy, pytorch or tensorflow object."
1334
+ )
1335
+
1336
+ for key, value in encoded_inputs.items():
1337
+ encoded_inputs[key] = to_py_obj(value)
1338
+
1339
+ # Convert padding_strategy in PaddingStrategy
1340
+ padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
1341
+ padding=padding, max_length=max_length, verbose=verbose
1342
+ )
1343
+
1344
+ if max_entity_length is None:
1345
+ max_entity_length = self.max_entity_length
1346
+
1347
+ required_input = encoded_inputs[self.model_input_names[0]]
1348
+ if required_input and not isinstance(required_input[0], (list, tuple)):
1349
+ encoded_inputs = self._pad(
1350
+ encoded_inputs,
1351
+ max_length=max_length,
1352
+ max_entity_length=max_entity_length,
1353
+ padding_strategy=padding_strategy,
1354
+ pad_to_multiple_of=pad_to_multiple_of,
1355
+ padding_side=padding_side,
1356
+ return_attention_mask=return_attention_mask,
1357
+ )
1358
+ return BatchEncoding(encoded_inputs, tensor_type=return_tensors)
1359
+
1360
+ batch_size = len(required_input)
1361
+ if any(len(v) != batch_size for v in encoded_inputs.values()):
1362
+ raise ValueError("Some items in the output dictionary have a different batch size than others.")
1363
+
1364
+ if padding_strategy == PaddingStrategy.LONGEST:
1365
+ max_length = max(len(inputs) for inputs in required_input)
1366
+ max_entity_length = (
1367
+ max(len(inputs) for inputs in encoded_inputs["entity_ids"]) if "entity_ids" in encoded_inputs else 0
1368
+ )
1369
+ padding_strategy = PaddingStrategy.MAX_LENGTH
1370
+
1371
+ batch_outputs = {}
1372
+ for i in range(batch_size):
1373
+ inputs = {k: v[i] for k, v in encoded_inputs.items()}
1374
+ outputs = self._pad(
1375
+ inputs,
1376
+ max_length=max_length,
1377
+ max_entity_length=max_entity_length,
1378
+ padding_strategy=padding_strategy,
1379
+ pad_to_multiple_of=pad_to_multiple_of,
1380
+ padding_side=padding_side,
1381
+ return_attention_mask=return_attention_mask,
1382
+ )
1383
+
1384
+ for key, value in outputs.items():
1385
+ if key not in batch_outputs:
1386
+ batch_outputs[key] = []
1387
+ batch_outputs[key].append(value)
1388
+
1389
+ return BatchEncoding(batch_outputs, tensor_type=return_tensors)
1390
+
1391
+ ## Copied from LukeTokenizer with some lines added
1392
+ def _pad(
1393
+ self,
1394
+ encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
1395
+ max_length: Optional[int] = None,
1396
+ max_entity_length: Optional[int] = None,
1397
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
1398
+ pad_to_multiple_of: Optional[int] = None,
1399
+ padding_side: Optional[bool] = None,
1400
+ return_attention_mask: Optional[bool] = None,
1401
+ ) -> dict:
1402
+ """
1403
+ Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
1404
+
1405
+
1406
+ Args:
1407
+ encoded_inputs:
1408
+ Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
1409
+ max_length: maximum length of the returned list and optionally padding length (see below).
1410
+ Will truncate by taking into account the special tokens.
1411
+ max_entity_length: The maximum length of the entity sequence.
1412
+ padding_strategy: PaddingStrategy to use for padding.
1413
+
1414
+
1415
+ - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
1416
+ - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
1417
+ - PaddingStrategy.DO_NOT_PAD: Do not pad
1418
+ The tokenizer padding sides are defined in self.padding_side:
1419
+
1420
+
1421
+ - 'left': pads on the left of the sequences
1422
+ - 'right': pads on the right of the sequences
1423
+ pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
1424
+ This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
1425
+ `>= 7.5` (Volta).
1426
+ padding_side:
1427
+ The side on which the model should have padding applied. Should be selected between ['right', 'left'].
1428
+ Default value is picked from the class attribute of the same name.
1429
+ return_attention_mask:
1430
+ (optional) Set to False to avoid returning attention mask (default: set to model specifics)
1431
+ """
1432
+ entities_provided = bool("entity_ids" in encoded_inputs)
1433
+
1434
+ # Load from model defaults
1435
+ if return_attention_mask is None:
1436
+ return_attention_mask = "attention_mask" in self.model_input_names
1437
+
1438
+ if padding_strategy == PaddingStrategy.LONGEST:
1439
+ max_length = len(encoded_inputs["input_ids"])
1440
+ if entities_provided:
1441
+ max_entity_length = len(encoded_inputs["entity_ids"])
1442
+
1443
+ if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
1444
+ max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
1445
+
1446
+ if (
1447
+ entities_provided
1448
+ and max_entity_length is not None
1449
+ and pad_to_multiple_of is not None
1450
+ and (max_entity_length % pad_to_multiple_of != 0)
1451
+ ):
1452
+ max_entity_length = ((max_entity_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
1453
+
1454
+ needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and (
1455
+ len(encoded_inputs["input_ids"]) != max_length
1456
+ or (entities_provided and len(encoded_inputs["entity_ids"]) != max_entity_length)
1457
+ )
1458
+
1459
+ # Initialize attention mask if not present.
1460
+ if return_attention_mask and "attention_mask" not in encoded_inputs:
1461
+ encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"])
1462
+ if entities_provided and return_attention_mask and "entity_attention_mask" not in encoded_inputs:
1463
+ encoded_inputs["entity_attention_mask"] = [1] * len(encoded_inputs["entity_ids"])
1464
+
1465
+ if needs_to_be_padded:
1466
+ difference = max_length - len(encoded_inputs["input_ids"])
1467
+ padding_side = padding_side if padding_side is not None else self.padding_side
1468
+ if entities_provided:
1469
+ entity_difference = max_entity_length - len(encoded_inputs["entity_ids"])
1470
+ if padding_side == "right":
1471
+ if return_attention_mask:
1472
+ encoded_inputs["attention_mask"] = encoded_inputs["attention_mask"] + [0] * difference
1473
+ if entities_provided:
1474
+ encoded_inputs["entity_attention_mask"] = (
1475
+ encoded_inputs["entity_attention_mask"] + [0] * entity_difference
1476
+ )
1477
+ if "token_type_ids" in encoded_inputs:
1478
+ encoded_inputs["token_type_ids"] = encoded_inputs["token_type_ids"] + [0] * difference
1479
+ if entities_provided:
1480
+ encoded_inputs["entity_token_type_ids"] = (
1481
+ encoded_inputs["entity_token_type_ids"] + [0] * entity_difference
1482
+ )
1483
+ if "special_tokens_mask" in encoded_inputs:
1484
+ encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
1485
+ encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.pad_token_id] * difference
1486
+ encoded_inputs["position_ids"] = encoded_inputs["position_ids"] + [0] * difference ## Added
1487
+ if entities_provided:
1488
+ encoded_inputs["entity_ids"] = (
1489
+ encoded_inputs["entity_ids"] + [self.entity_pad_token_id] * entity_difference
1490
+ )
1491
+ encoded_inputs["entity_position_ids"] = (
1492
+ encoded_inputs["entity_position_ids"] + [[-1] * self.max_mention_length] * entity_difference
1493
+ )
1494
+ if self.task == "entity_span_classification":
1495
+ encoded_inputs["entity_start_positions"] = (
1496
+ encoded_inputs["entity_start_positions"] + [0] * entity_difference
1497
+ )
1498
+ encoded_inputs["entity_end_positions"] = (
1499
+ encoded_inputs["entity_end_positions"] + [0] * entity_difference
1500
+ )
1501
+
1502
+ elif padding_side == "left":
1503
+ if return_attention_mask:
1504
+ encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
1505
+ if entities_provided:
1506
+ encoded_inputs["entity_attention_mask"] = [0] * entity_difference + encoded_inputs[
1507
+ "entity_attention_mask"
1508
+ ]
1509
+ if "token_type_ids" in encoded_inputs:
1510
+ encoded_inputs["token_type_ids"] = [0] * difference + encoded_inputs["token_type_ids"]
1511
+ if entities_provided:
1512
+ encoded_inputs["entity_token_type_ids"] = [0] * entity_difference + encoded_inputs[
1513
+ "entity_token_type_ids"
1514
+ ]
1515
+ if "special_tokens_mask" in encoded_inputs:
1516
+ encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
1517
+ encoded_inputs["input_ids"] = [self.pad_token_id] * difference + encoded_inputs["input_ids"]
1518
+ encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"] ## Added
1519
+ if entities_provided:
1520
+ encoded_inputs["entity_ids"] = [self.entity_pad_token_id] * entity_difference + encoded_inputs[
1521
+ "entity_ids"
1522
+ ]
1523
+ encoded_inputs["entity_position_ids"] = [
1524
+ [-1] * self.max_mention_length
1525
+ ] * entity_difference + encoded_inputs["entity_position_ids"]
1526
+ if self.task == "entity_span_classification":
1527
+ encoded_inputs["entity_start_positions"] = [0] * entity_difference + encoded_inputs[
1528
+ "entity_start_positions"
1529
+ ]
1530
+ encoded_inputs["entity_end_positions"] = [0] * entity_difference + encoded_inputs[
1531
+ "entity_end_positions"
1532
+ ]
1533
+ else:
1534
+ raise ValueError("Invalid padding strategy:" + str(padding_side))
1535
+
1536
+ return encoded_inputs
1537
+
1538
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
1539
+ ## Start of block copied from BertJapaneseTokenizer.save_vocabulary
1540
+ if os.path.isdir(save_directory):
1541
+ if self.subword_tokenizer_type == "sentencepiece":
1542
+ vocab_file = os.path.join(
1543
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["spm_file"]
1544
+ )
1545
+ else:
1546
+ vocab_file = os.path.join(
1547
+ save_directory,
1548
+ (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"],
1549
+ )
1550
+ else:
1551
+ vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
1552
+
1553
+ if self.subword_tokenizer_type == "sentencepiece":
1554
+ with open(vocab_file, "wb") as writer:
1555
+ content_spiece_model = self.subword_tokenizer.sp_model.serialized_model_proto()
1556
+ writer.write(content_spiece_model)
1557
+ else:
1558
+ with open(vocab_file, "w", encoding="utf-8") as writer:
1559
+ index = 0
1560
+ for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
1561
+ if index != token_index:
1562
+ logger.warning(
1563
+ f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
1564
+ " Please check that the vocabulary is not corrupted!"
1565
+ )
1566
+ index = token_index
1567
+ writer.write(token + "\n")
1568
+ index += 1
1569
+ ## End of block copied from BertJapaneseTokenizer.save_vocabulary
1570
+
1571
+ ## Start of block copied from LukeTokenizer.save_vocabulary
1572
+ entity_vocab_file = os.path.join(
1573
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["entity_vocab_file"]
1574
+ )
1575
+
1576
+ with open(entity_vocab_file, "w", encoding="utf-8") as f:
1577
+ f.write(json.dumps(self.entity_vocab, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
1578
+ ## End of block copied from LukeTokenizer.save_vocabulary
1579
+
1580
+ return vocab_file, entity_vocab_file
tokenizer_config.json ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "32768": {
44
+ "content": "<ent>",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "32769": {
52
+ "content": "<ent2>",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ }
59
+ },
60
+ "additional_special_tokens": [
61
+ "<ent>",
62
+ "<ent2>",
63
+ "<ent>",
64
+ "<ent2>",
65
+ "<ent>",
66
+ "<ent2>",
67
+ "<ent>",
68
+ "<ent2>"
69
+ ],
70
+ "auto_map": {
71
+ "AutoTokenizer": [
72
+ "tokenization_luke_bert_japanese.LukeBertJapaneseTokenizer",
73
+ null
74
+ ]
75
+ },
76
+ "clean_up_tokenization_spaces": true,
77
+ "cls_token": "[CLS]",
78
+ "do_lower_case": false,
79
+ "do_subword_tokenize": true,
80
+ "do_word_tokenize": true,
81
+ "entity_mask2_token": "[MASK2]",
82
+ "entity_mask_token": "[MASK]",
83
+ "entity_pad_token": "[PAD]",
84
+ "entity_token_1": "<ent>",
85
+ "entity_token_2": "<ent2>",
86
+ "entity_unk_token": "[UNK]",
87
+ "extra_special_tokens": {},
88
+ "jumanpp_kwargs": null,
89
+ "mask_token": "[MASK]",
90
+ "max_entity_length": 32,
91
+ "max_mention_length": 30,
92
+ "mecab_kwargs": {
93
+ "mecab_dic": "unidic_lite"
94
+ },
95
+ "model_max_length": 512,
96
+ "never_split": null,
97
+ "pad_token": "[PAD]",
98
+ "sep_token": "[SEP]",
99
+ "subword_tokenizer_type": "wordpiece",
100
+ "sudachi_kwargs": null,
101
+ "task": null,
102
+ "tokenizer_class": "LukeBertJapaneseTokenizer",
103
+ "unk_token": "[UNK]",
104
+ "word_tokenizer_type": "mecab"
105
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff