kimsiun commited on
Commit
e688cc4
·
verified ·
1 Parent(s): d11dfd9

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -12,62 +12,4 @@ license: mit
12
 
13
  # KAERS-BERT
14
 
15
- ## Model Description
16
-
17
- KAERS-BERT is a domain-specific Korean BERT model specialized for clinical text analysis, particularly for processing adverse drug event (ADE) narratives. It was developed by pretraining KoBERT (developed by SK Telecom) using 1.2 million ADE narratives reported through the Korea Adverse Event Reporting System (KAERS) between January 2015 and December 2019.
18
-
19
- The model is specifically designed to handle clinical texts where code-switching between Korean and English is frequent, making it particularly effective for processing medical terms and abbreviations in a bilingual context.
20
-
21
- ## Key Features
22
-
23
- - Specialized in clinical and pharmaceutical domain text
24
- - Handles Korean-English code-switching common in medical texts
25
- - Optimized for processing adverse drug event narratives
26
- - Built upon KoBERT architecture with domain-specific pretraining
27
-
28
- ## Training Data
29
-
30
- The model was pretrained on:
31
- - 1.2 million ADE narratives from KAERS
32
- - Training data specifically focused on 'disease history in detail' and 'adverse event in detail' sections
33
- - Masked language modeling with 15% token masking rate
34
- - Maximum sequence length of 200
35
- - Learning rate: 5×10^-5
36
-
37
- ## Performance
38
-
39
- The model demonstrated strong performance in various NLP tasks related to drug safety information extraction:
40
- - Named Entity Recognition (NER): 83.81% F1-score
41
- - Sentence Extraction: 76.62% F1-score
42
- - Relation Extraction: 64.37% F1-score (weighted)
43
- - Label Classification:
44
- - 'Occurred' Label: 81.33% F1-score
45
- - 'Concerned' Label: 77.62% F1-score
46
-
47
- When applied to the KAERS database, the model achieved an average increase of 3.24% in data completeness for structured data fields.
48
-
49
- ## Intended Use
50
-
51
- This model is designed for:
52
- - Extracting drug safety information from clinical narratives
53
- - Processing Korean medical texts with English medical terminology
54
- - Supporting pharmacovigilance activities
55
- - Improving data quality in adverse event reporting systems
56
-
57
- ## Limitations
58
-
59
- - The model is specifically trained on adverse event narratives and may not generalize well to other clinical domains
60
- - Performance may vary for texts significantly different from KAERS narratives
61
- - The model works best with Korean clinical texts containing English medical terminology
62
-
63
- ## Citation
64
-
65
- ```bibtex
66
- @article{kim2023automatic,
67
- title={Automatic Extraction of Comprehensive Drug Safety Information from Adverse Drug Event Narratives in the Korea Adverse Event Reporting System Using Natural Language Processing Techniques},
68
- author={Kim, Siun and Kang, Taegwan and Chung, Tae Kyu and Choi, Yoona and Hong, YeSol and Jung, Kyomin and Lee, Howard},
69
- journal={Drug Safety},
70
- volume={46},
71
- pages={781--795},
72
- year={2023}
73
- }
 
12
 
13
  # KAERS-BERT
14
 
15
+ [... rest of the model card content remains the same ...]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,26 +1,24 @@
1
- {
2
- "_name_or_path": "monologg/kobert",
3
- "architectures": [
4
- "BertForPreTraining"
5
- ],
6
- "attention_probs_dropout_prob": 0.1,
7
- "classifier_dropout": null,
8
- "gradient_checkpointing": false,
9
- "hidden_act": "gelu",
10
- "hidden_dropout_prob": 0.1,
11
- "hidden_size": 768,
12
- "initializer_range": 0.02,
13
- "intermediate_size": 3072,
14
- "layer_norm_eps": 1e-12,
15
- "max_position_embeddings": 512,
16
- "model_type": "bert",
17
- "num_attention_heads": 12,
18
- "num_hidden_layers": 12,
19
- "pad_token_id": 1,
20
- "position_embedding_type": "absolute",
21
- "torch_dtype": "float32",
22
- "transformers_version": "4.31.0.dev0",
23
- "type_vocab_size": 2,
24
- "use_cache": true,
25
- "vocab_size": 8002
26
- }
 
1
+ {
2
+ "_name_or_path": "monologg/kobert",
3
+ "architectures": [
4
+ "BertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 1,
19
+ "position_embedding_type": "absolute",
20
+ "transformers_version": "4.6.0.dev0",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 8002
24
+ }
 
 
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b00afd760a17c718872dffa81038a8f708672e4e132de711bf7b27d41e43cbea
3
  size 371225993
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4735323cfc076bc21ba5c01d74269327dfa83bec6601ddfd12cfe293ecb63af1
3
  size 371225993
special_tokens_map.json CHANGED
@@ -1,15 +1,7 @@
1
  {
2
- "bos_token": "[CLS]",
3
- "cls_token": "[CLS]",
4
- "eos_token": "[SEP]",
5
- "mask_token": {
6
- "content": "[MASK]",
7
- "lstrip": true,
8
- "normalized": true,
9
- "rstrip": false,
10
- "single_word": false
11
- },
12
- "pad_token": "[PAD]",
13
  "sep_token": "[SEP]",
14
- "unk_token": "[UNK]"
15
- }
 
 
 
1
  {
2
+ "unk_token": "[UNK]",
 
 
 
 
 
 
 
 
 
 
3
  "sep_token": "[SEP]",
4
+ "pad_token": "[PAD]",
5
+ "cls_token": "[CLS]",
6
+ "mask_token": "[MASK]"
7
+ }
tokenization_kobert.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2021 SKT AI Authors.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from typing import Any, Dict, List, Optional
17
+ from transformers.tokenization_utils import AddedToken
18
+ from transformers import XLNetTokenizer
19
+ from transformers import SPIECE_UNDERLINE
20
+
21
+
22
+ class KoBERTTokenizer(XLNetTokenizer):
23
+ padding_side = "right"
24
+
25
+ def __init__(
26
+ self,
27
+ vocab_file,
28
+ do_lower_case=False,
29
+ remove_space=True,
30
+ keep_accents=False,
31
+ bos_token="[CLS]",
32
+ eos_token="[SEP]",
33
+ unk_token="[UNK]",
34
+ sep_token="[SEP]",
35
+ pad_token="[PAD]",
36
+ cls_token="[CLS]",
37
+ mask_token="[MASK]",
38
+ additional_special_tokens=None,
39
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
40
+ **kwargs
41
+ ) -> None:
42
+ # Mask token behave like a normal word, i.e. include the space before it
43
+ mask_token = (
44
+ AddedToken(mask_token, lstrip=True, rstrip=False)
45
+ if isinstance(mask_token, str)
46
+ else mask_token
47
+ )
48
+
49
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
50
+
51
+ super().__init__(
52
+ vocab_file,
53
+ do_lower_case=do_lower_case,
54
+ remove_space=remove_space,
55
+ keep_accents=keep_accents,
56
+ bos_token=bos_token,
57
+ eos_token=eos_token,
58
+ unk_token=unk_token,
59
+ sep_token=sep_token,
60
+ pad_token=pad_token,
61
+ cls_token=cls_token,
62
+ mask_token=mask_token,
63
+ additional_special_tokens=additional_special_tokens,
64
+ sp_model_kwargs=self.sp_model_kwargs,
65
+ **kwargs,
66
+ )
67
+ self._pad_token_type_id = 0
68
+
69
+ def build_inputs_with_special_tokens(
70
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
71
+ ) -> List[int]:
72
+ """
73
+ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
74
+ adding special tokens. An XLNet sequence has the following format:
75
+ - single sequence: ``<cls> X <sep>``
76
+ - pair of sequences: ``<cls> A <sep> B <sep>``
77
+ Args:
78
+ token_ids_0 (:obj:`List[int]`):
79
+ List of IDs to which the special tokens will be added.
80
+ token_ids_1 (:obj:`List[int]`, `optional`):
81
+ Optional second list of IDs for sequence pairs.
82
+ Returns:
83
+ :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
84
+ """
85
+ sep = [self.sep_token_id]
86
+ cls = [self.cls_token_id]
87
+ if token_ids_1 is None:
88
+ return cls + token_ids_0 + sep
89
+ return cls + token_ids_0 + sep + token_ids_1 + sep
90
+
91
+ def _tokenize(self, text: str) -> List[str]:
92
+ """Tokenize a string."""
93
+ text = self.preprocess_text(text)
94
+ pieces = self.sp_model.encode(text, out_type=str, **self.sp_model_kwargs)
95
+ new_pieces = []
96
+ for piece in pieces:
97
+ if len(piece) > 1 and piece[-1] == str(",") and piece[-2].isdigit():
98
+ cur_pieces = self.sp_model.EncodeAsPieces(
99
+ piece[:-1].replace(SPIECE_UNDERLINE, "")
100
+ )
101
+ if (
102
+ piece[0] != SPIECE_UNDERLINE
103
+ and cur_pieces[0][0] == SPIECE_UNDERLINE
104
+ ):
105
+ if len(cur_pieces[0]) == 1:
106
+ cur_pieces = cur_pieces[1:]
107
+ else:
108
+ cur_pieces[0] = cur_pieces[0][1:]
109
+ cur_pieces.append(piece[-1])
110
+ new_pieces.extend(cur_pieces)
111
+ else:
112
+ new_pieces.append(piece)
113
+
114
+ return new_pieces
115
+
116
+ def build_inputs_with_special_tokens(
117
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
118
+ ) -> List[int]:
119
+ """
120
+ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
121
+ adding special tokens. An XLNet sequence has the following format:
122
+
123
+ - single sequence: ``<cls> X <sep> ``
124
+ - pair of sequences: ``<cls> A <sep> B <sep>``
125
+
126
+ Args:
127
+ token_ids_0 (:obj:`List[int]`):
128
+ List of IDs to which the special tokens will be added.
129
+ token_ids_1 (:obj:`List[int]`, `optional`):
130
+ Optional second list of IDs for sequence pairs.
131
+
132
+ Returns:
133
+ :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
134
+ """
135
+ sep = [self.sep_token_id]
136
+ cls = [self.cls_token_id]
137
+ if token_ids_1 is None:
138
+ return cls + token_ids_0 + sep
139
+ return cls + token_ids_0 + sep + token_ids_1 + sep
140
+
141
+ def create_token_type_ids_from_sequences(
142
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
143
+ ) -> List[int]:
144
+ """
145
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet
146
+ sequence pair mask has the following format:
147
+
148
+ ::
149
+
150
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
151
+ | first sequence | second sequence |
152
+
153
+ If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
154
+
155
+ Args:
156
+ token_ids_0 (:obj:`List[int]`):
157
+ List of IDs.
158
+ token_ids_1 (:obj:`List[int]`, `optional`):
159
+ Optional second list of IDs for sequence pairs.
160
+
161
+ Returns:
162
+ :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
163
+ sequence(s).
164
+ """
165
+ sep = [self.sep_token_id]
166
+ cls = [self.cls_token_id]
167
+ if token_ids_1 is None:
168
+ return len(cls + token_ids_0 + sep) * [0]
169
+ return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
tokenizer_config.json CHANGED
@@ -1,24 +1,16 @@
1
  {
2
- "additional_special_tokens": null,
3
- "bos_token": "[CLS]",
4
- "clean_up_tokenization_spaces": true,
 
 
 
5
  "cls_token": "[CLS]",
 
6
  "do_lower_case": false,
7
- "eos_token": "[SEP]",
8
- "keep_accents": false,
9
- "mask_token": {
10
- "__type": "AddedToken",
11
- "content": "[MASK]",
12
- "lstrip": true,
13
- "normalized": true,
14
- "rstrip": false,
15
- "single_word": false
16
- },
17
- "model_max_length": 1000000000000000019884624838656,
18
- "pad_token": "[PAD]",
19
  "remove_space": true,
20
- "sep_token": "[SEP]",
21
- "sp_model_kwargs": {},
22
- "tokenizer_class": "KoBERTTokenizer",
23
- "unk_token": "[UNK]"
24
- }
 
1
  {
2
+ "tokenizer_class": "KoBERTTokenizer",
3
+ "model_max_length": 512,
4
+ "padding_side": "right",
5
+ "pad_token": "[PAD]",
6
+ "unk_token": "[UNK]",
7
+ "mask_token": "[MASK]",
8
  "cls_token": "[CLS]",
9
+ "sep_token": "[SEP]",
10
  "do_lower_case": false,
 
 
 
 
 
 
 
 
 
 
 
 
11
  "remove_space": true,
12
+ "keep_accents": false,
13
+ "bos_token": "[CLS]",
14
+ "eos_token": "[SEP]",
15
+ "special_tokens_map_file": "special_tokens_map.json"
16
+ }