emanuelaboros commited on
Commit
ffbff05
·
1 Parent(s): 075c00d
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - en
5
+ - fr
6
+ - de
7
+ tags:
8
+ - v1.0.0
9
+ ---
10
+
11
+ The **Impresso NER model** is based on the stacked Transformer architecture published in [CoNLL 2020](https://aclanthology.org/2020.conll-1.35/) trained on the Impresso HIPE-2020 portion of the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data). It recognizes entity types such as person, location, and organization while supporting the complete [HIPE typology](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md), including coarse and fine-grained entity types as well as components like names, titles, and roles. Additionally, the NER model's backbone ([dbmdz/bert-medium-historic-multilingual-cased](https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased)) was trained on various European historical datasets, giving it a broader language capability. This training included data from the Europeana and British Library collections across multiple languages: German, French, English, Finnish, and Swedish. Due to this multilingual backbone, the NER model may also recognize entities in other languages beyond French and German.
12
+
13
+ #### How to use
14
+
15
+ You can use this model with Transformers *pipeline* for NER.
16
+
17
+ <!-- Provide a longer summary of what this model is. -->
18
+ ```python
19
+ # Import necessary Python modules from the Transformers library
20
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
21
+ from transformers import pipeline
22
+
23
+ # Define the model name to be used for token classification, we use the Impresso NER
24
+ # that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
25
+ MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
26
+
27
+ # Load the tokenizer corresponding to the specified model name
28
+ ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
29
+
30
+ ner_pipeline = pipeline("generic-ner", model=MODEL_NAME,
31
+ tokenizer=ner_tokenizer,
32
+ trust_remote_code=True,
33
+ device='cpu')
34
+
35
+ sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
36
+
37
+ entities = ner_pipeline(sentence)
38
+ print(entities)
39
+ ```
40
+
41
+ ```
42
+ [
43
+ {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
44
+ {'type': 'loc', 'confidence_ner': 90.75, 'surface': 'Europe', 'lOffset': 69, 'rOffset': 75},
45
+ {'type': 'loc', 'confidence_ner': 75.45, 'surface': 'Royaume de France', 'lOffset': 80, 'rOffset': 97},
46
+ {'type': 'pers', 'confidence_ner': 85.27, 'surface': 'roi Philippe VI', 'lOffset': 181, 'rOffset': 196, 'title': 'roi', 'name': 'roi Philippe VI'},
47
+ {'type': 'loc', 'confidence_ner': 30.59, 'surface': 'Louvre', 'lOffset': 210, 'rOffset': 216},
48
+ {'type': 'loc', 'confidence_ner': 94.46, 'surface': 'Paris', 'lOffset': 266, 'rOffset': 271},
49
+ {'type': 'pers', 'confidence_ner': 96.1, 'surface': 'chancelier Guillaume de Nogaret', 'lOffset': 350, 'rOffset': 381, 'title': 'chancelier', 'name': 'chancelier Guillaume de Nogaret'},
50
+ {'type': 'loc', 'confidence_ner': 49.35, 'surface': 'Royaume', 'lOffset': 80, 'rOffset': 87},
51
+ {'type': 'loc', 'confidence_ner': 24.18, 'surface': 'France', 'lOffset': 91, 'rOffset': 97}
52
+ ]
53
+ ```
54
+
55
+
56
+ ### BibTeX entry and citation info
57
+
58
+ ```
59
+ @inproceedings{boros2020alleviating,
60
+ title={Alleviating digitization errors in named entity recognition for historical documents},
61
+ author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
62
+ booktitle={Proceedings of the 24th conference on computational natural language learning},
63
+ pages={431--441},
64
+ year={2020}
65
+ }
66
+ ```
__init__.py ADDED
File without changes
config.json ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "experiments_final/model_dbmdz_bert_medium_historic_multilingual_cased_max_sequence_length_512_epochs_5_run_extended_suffix_baseline/checkpoint-450",
3
+ "architectures": [
4
+ "ExtendedMultitaskModelForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_stacked.ImpressoConfig",
9
+ "AutoModelForTokenClassification": "modeling_stacked.ExtendedMultitaskModelForTokenClassification"
10
+ },
11
+ "classifier_dropout": null,
12
+ "custom_pipelines": {
13
+ "generic-ner": {
14
+ "impl": "generic_ner.MultitaskTokenClassificationPipeline",
15
+ "pt": "AutoModelForTokenClassification"
16
+ }
17
+ },
18
+ "hidden_act": "gelu",
19
+ "hidden_dropout_prob": 0.1,
20
+ "hidden_size": 512,
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 2048,
23
+ "label_map": {
24
+ "NE-COARSE-LIT": {
25
+ "B-loc": 8,
26
+ "B-org": 0,
27
+ "B-pers": 7,
28
+ "B-prod": 4,
29
+ "B-time": 5,
30
+ "I-loc": 1,
31
+ "I-org": 2,
32
+ "I-pers": 9,
33
+ "I-prod": 10,
34
+ "I-time": 6,
35
+ "O": 3
36
+ },
37
+ "NE-COARSE-METO": {
38
+ "B-loc": 3,
39
+ "B-org": 0,
40
+ "B-time": 5,
41
+ "I-loc": 4,
42
+ "I-org": 2,
43
+ "O": 1
44
+ },
45
+ "NE-FINE-COMP": {
46
+ "B-comp.demonym": 8,
47
+ "B-comp.function": 5,
48
+ "B-comp.name": 1,
49
+ "B-comp.qualifier": 9,
50
+ "B-comp.title": 2,
51
+ "I-comp.demonym": 7,
52
+ "I-comp.function": 3,
53
+ "I-comp.name": 0,
54
+ "I-comp.qualifier": 10,
55
+ "I-comp.title": 4,
56
+ "O": 6
57
+ },
58
+ "NE-FINE-LIT": {
59
+ "B-loc.add.elec": 32,
60
+ "B-loc.add.phys": 5,
61
+ "B-loc.adm.nat": 34,
62
+ "B-loc.adm.reg": 39,
63
+ "B-loc.adm.sup": 12,
64
+ "B-loc.adm.town": 33,
65
+ "B-loc.fac": 36,
66
+ "B-loc.oro": 19,
67
+ "B-loc.phys.geo": 13,
68
+ "B-loc.phys.hydro": 28,
69
+ "B-loc.unk": 4,
70
+ "B-org.adm": 3,
71
+ "B-org.ent": 24,
72
+ "B-org.ent.pressagency": 37,
73
+ "B-pers.coll": 9,
74
+ "B-pers.ind": 0,
75
+ "B-pers.ind.articleauthor": 20,
76
+ "B-prod.doctr": 2,
77
+ "B-prod.media": 10,
78
+ "B-time.date.abs": 23,
79
+ "I-loc.add.elec": 22,
80
+ "I-loc.add.phys": 6,
81
+ "I-loc.adm.nat": 11,
82
+ "I-loc.adm.reg": 35,
83
+ "I-loc.adm.sup": 15,
84
+ "I-loc.adm.town": 8,
85
+ "I-loc.fac": 27,
86
+ "I-loc.oro": 21,
87
+ "I-loc.phys.geo": 25,
88
+ "I-loc.phys.hydro": 17,
89
+ "I-loc.unk": 40,
90
+ "I-org.adm": 29,
91
+ "I-org.ent": 1,
92
+ "I-org.ent.pressagency": 14,
93
+ "I-pers.coll": 26,
94
+ "I-pers.ind": 16,
95
+ "I-pers.ind.articleauthor": 31,
96
+ "I-prod.doctr": 30,
97
+ "I-prod.media": 38,
98
+ "I-time.date.abs": 7,
99
+ "O": 18
100
+ },
101
+ "NE-FINE-METO": {
102
+ "B-loc.adm.town": 6,
103
+ "B-loc.fac": 3,
104
+ "B-loc.oro": 5,
105
+ "B-org.adm": 1,
106
+ "B-org.ent": 7,
107
+ "B-time.date.abs": 9,
108
+ "I-loc.fac": 8,
109
+ "I-org.adm": 2,
110
+ "I-org.ent": 0,
111
+ "O": 4
112
+ },
113
+ "NE-NESTED": {
114
+ "B-loc.adm.nat": 13,
115
+ "B-loc.adm.reg": 15,
116
+ "B-loc.adm.sup": 10,
117
+ "B-loc.adm.town": 9,
118
+ "B-loc.fac": 18,
119
+ "B-loc.oro": 17,
120
+ "B-loc.phys.geo": 11,
121
+ "B-loc.phys.hydro": 1,
122
+ "B-org.adm": 4,
123
+ "B-org.ent": 20,
124
+ "B-pers.coll": 7,
125
+ "B-pers.ind": 2,
126
+ "B-prod.media": 23,
127
+ "I-loc.adm.nat": 8,
128
+ "I-loc.adm.reg": 14,
129
+ "I-loc.adm.town": 6,
130
+ "I-loc.fac": 0,
131
+ "I-loc.oro": 19,
132
+ "I-loc.phys.geo": 21,
133
+ "I-loc.phys.hydro": 22,
134
+ "I-org.adm": 5,
135
+ "I-org.ent": 3,
136
+ "I-pers.ind": 12,
137
+ "I-prod.media": 24,
138
+ "O": 16
139
+ }
140
+ },
141
+ "layer_norm_eps": 1e-12,
142
+ "max_position_embeddings": 512,
143
+ "model_type": "stacked_bert",
144
+ "num_attention_heads": 8,
145
+ "num_hidden_layers": 8,
146
+ "pad_token_id": 0,
147
+ "position_embedding_type": "absolute",
148
+ "pretrained_config": {
149
+ "_name_or_path": "dbmdz/bert-medium-historic-multilingual-cased",
150
+ "add_cross_attention": false,
151
+ "architectures": [
152
+ "BertForMaskedLM"
153
+ ],
154
+ "attention_probs_dropout_prob": 0.1,
155
+ "bad_words_ids": null,
156
+ "begin_suppress_tokens": null,
157
+ "bos_token_id": null,
158
+ "chunk_size_feed_forward": 0,
159
+ "classifier_dropout": null,
160
+ "cross_attention_hidden_size": null,
161
+ "decoder_start_token_id": null,
162
+ "diversity_penalty": 0.0,
163
+ "do_sample": false,
164
+ "early_stopping": false,
165
+ "encoder_no_repeat_ngram_size": 0,
166
+ "eos_token_id": null,
167
+ "exponential_decay_length_penalty": null,
168
+ "finetuning_task": null,
169
+ "forced_bos_token_id": null,
170
+ "forced_eos_token_id": null,
171
+ "hidden_act": "gelu",
172
+ "hidden_dropout_prob": 0.1,
173
+ "hidden_size": 512,
174
+ "id2label": {
175
+ "0": "LABEL_0",
176
+ "1": "LABEL_1"
177
+ },
178
+ "initializer_range": 0.02,
179
+ "intermediate_size": 2048,
180
+ "is_decoder": false,
181
+ "is_encoder_decoder": false,
182
+ "label2id": {
183
+ "LABEL_0": 0,
184
+ "LABEL_1": 1
185
+ },
186
+ "layer_norm_eps": 1e-12,
187
+ "length_penalty": 1.0,
188
+ "max_length": 20,
189
+ "max_position_embeddings": 512,
190
+ "min_length": 0,
191
+ "model_type": "bert",
192
+ "no_repeat_ngram_size": 0,
193
+ "num_attention_heads": 8,
194
+ "num_beam_groups": 1,
195
+ "num_beams": 1,
196
+ "num_hidden_layers": 8,
197
+ "num_return_sequences": 1,
198
+ "output_attentions": false,
199
+ "output_hidden_states": false,
200
+ "output_scores": false,
201
+ "pad_token_id": 0,
202
+ "position_embedding_type": "absolute",
203
+ "prefix": null,
204
+ "problem_type": null,
205
+ "pruned_heads": {},
206
+ "remove_invalid_values": false,
207
+ "repetition_penalty": 1.0,
208
+ "return_dict": true,
209
+ "return_dict_in_generate": false,
210
+ "sep_token_id": null,
211
+ "suppress_tokens": null,
212
+ "task_specific_params": null,
213
+ "temperature": 1.0,
214
+ "tf_legacy_loss": false,
215
+ "tie_encoder_decoder": false,
216
+ "tie_word_embeddings": true,
217
+ "tokenizer_class": null,
218
+ "top_k": 50,
219
+ "top_p": 1.0,
220
+ "torch_dtype": null,
221
+ "torchscript": false,
222
+ "type_vocab_size": 2,
223
+ "typical_p": 1.0,
224
+ "use_bfloat16": false,
225
+ "use_cache": true,
226
+ "vocab_size": 32000
227
+ },
228
+ "torch_dtype": "float32",
229
+ "transformers_version": "4.40.0.dev0",
230
+ "type_vocab_size": 2,
231
+ "use_cache": true,
232
+ "vocab_size": 32000
233
+ }
configuration_stacked.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+ import torch
3
+
4
+ class ImpressoConfig(PretrainedConfig):
5
+ model_type = "stacked_bert"
6
+
7
+ def __init__(
8
+ self,
9
+ vocab_size=30522,
10
+ hidden_size=768,
11
+ num_hidden_layers=12,
12
+ num_attention_heads=12,
13
+ intermediate_size=3072,
14
+ hidden_act="gelu",
15
+ hidden_dropout_prob=0.1,
16
+ attention_probs_dropout_prob=0.1,
17
+ max_position_embeddings=512,
18
+ type_vocab_size=2,
19
+ initializer_range=0.02,
20
+ layer_norm_eps=1e-12,
21
+ pad_token_id=0,
22
+ position_embedding_type="absolute",
23
+ use_cache=True,
24
+ classifier_dropout=None,
25
+ pretrained_config=None,
26
+ values_override=None,
27
+ label_map=None,
28
+ **kwargs,
29
+ ):
30
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
31
+
32
+ self.vocab_size = vocab_size
33
+ self.hidden_size = hidden_size
34
+ self.num_hidden_layers = num_hidden_layers
35
+ self.num_attention_heads = num_attention_heads
36
+ self.hidden_act = hidden_act
37
+ self.intermediate_size = intermediate_size
38
+ self.hidden_dropout_prob = hidden_dropout_prob
39
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
40
+ self.max_position_embeddings = max_position_embeddings
41
+ self.type_vocab_size = type_vocab_size
42
+ self.initializer_range = initializer_range
43
+ self.layer_norm_eps = layer_norm_eps
44
+ self.position_embedding_type = position_embedding_type
45
+ self.use_cache = use_cache
46
+ self.classifier_dropout = classifier_dropout
47
+ self.pretrained_config = pretrained_config
48
+ self.label_map = label_map
49
+
50
+ self.values_override = values_override or {}
51
+ self.outputs = {
52
+ "logits": {"shape": [None, None, self.hidden_size], "dtype": "float32"}
53
+ }
54
+
55
+ @classmethod
56
+ def is_torch_support_available(cls):
57
+ """
58
+ Indicate whether Torch support is available for this configuration.
59
+ Required for compatibility with certain parts of the Transformers library.
60
+ """
61
+ return True
62
+
63
+ @classmethod
64
+ def patch_ops(self):
65
+ """
66
+ A method required by some Hugging Face utilities to modify operator mappings.
67
+ Currently, it performs no operation and is included for compatibility.
68
+ Args:
69
+ ops: A dictionary of operations to potentially patch.
70
+ Returns:
71
+ The (unmodified) ops dictionary.
72
+ """
73
+ return None
74
+
75
+ def generate_dummy_inputs(self, tokenizer, batch_size=1, seq_length=8, framework="pt"):
76
+ """
77
+ Generate dummy inputs for testing or export.
78
+ Args:
79
+ tokenizer: The tokenizer used to tokenize inputs.
80
+ batch_size: Number of input samples in the batch.
81
+ seq_length: Length of each sequence.
82
+ framework: Framework ("pt" for PyTorch, "tf" for TensorFlow).
83
+ Returns:
84
+ Dummy inputs as a dictionary.
85
+ """
86
+ if framework == "pt":
87
+ input_ids = torch.randint(
88
+ low=0,
89
+ high=self.vocab_size,
90
+ size=(batch_size, seq_length),
91
+ dtype=torch.long
92
+ )
93
+ attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
94
+ return {"input_ids": input_ids, "attention_mask": attention_mask}
95
+ else:
96
+ raise ValueError("Framework '{}' not supported.".format(framework))
97
+
98
+ # Register the configuration with the transformers library
99
+ ImpressoConfig.register_for_auto_class()
generic_ner.py CHANGED
@@ -4,7 +4,6 @@ import numpy as np
4
  import torch
5
  import nltk
6
 
7
- # new test
8
  nltk.download("averaged_perceptron_tagger")
9
  nltk.download("averaged_perceptron_tagger_eng")
10
  nltk.download("stopwords")
@@ -688,7 +687,6 @@ def remove_trailing_stopwords(entities):
688
  print(f"Remained entities: {len(new_entities)}")
689
  return new_entities
690
 
691
-
692
  class MultitaskTokenClassificationPipeline(Pipeline):
693
 
694
  def _sanitize_parameters(self, **kwargs):
 
4
  import torch
5
  import nltk
6
 
 
7
  nltk.download("averaged_perceptron_tagger")
8
  nltk.download("averaged_perceptron_tagger_eng")
9
  nltk.download("stopwords")
 
687
  print(f"Remained entities: {len(new_entities)}")
688
  return new_entities
689
 
 
690
  class MultitaskTokenClassificationPipeline(Pipeline):
691
 
692
  def _sanitize_parameters(self, **kwargs):
label_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"NE-COARSE-LIT": {"B-org": 0, "I-loc": 1, "I-org": 2, "O": 3, "B-prod": 4, "B-time": 5, "I-time": 6, "B-pers": 7, "B-loc": 8, "I-pers": 9, "I-prod": 10}, "NE-COARSE-METO": {"B-org": 0, "O": 1, "I-org": 2, "B-loc": 3, "I-loc": 4, "B-time": 5}, "NE-FINE-LIT": {"B-pers.ind": 0, "I-org.ent": 1, "B-prod.doctr": 2, "B-org.adm": 3, "B-loc.unk": 4, "B-loc.add.phys": 5, "I-loc.add.phys": 6, "I-time.date.abs": 7, "I-loc.adm.town": 8, "B-pers.coll": 9, "B-prod.media": 10, "I-loc.adm.nat": 11, "B-loc.adm.sup": 12, "B-loc.phys.geo": 13, "I-org.ent.pressagency": 14, "I-loc.adm.sup": 15, "I-pers.ind": 16, "I-loc.phys.hydro": 17, "O": 18, "B-loc.oro": 19, "B-pers.ind.articleauthor": 20, "I-loc.oro": 21, "I-loc.add.elec": 22, "B-time.date.abs": 23, "B-org.ent": 24, "I-loc.phys.geo": 25, "I-pers.coll": 26, "I-loc.fac": 27, "B-loc.phys.hydro": 28, "I-org.adm": 29, "I-prod.doctr": 30, "I-pers.ind.articleauthor": 31, "B-loc.add.elec": 32, "B-loc.adm.town": 33, "B-loc.adm.nat": 34, "I-loc.adm.reg": 35, "B-loc.fac": 36, "B-org.ent.pressagency": 37, "I-prod.media": 38, "B-loc.adm.reg": 39, "I-loc.unk": 40}, "NE-FINE-METO": {"I-org.ent": 0, "B-org.adm": 1, "I-org.adm": 2, "B-loc.fac": 3, "O": 4, "B-loc.oro": 5, "B-loc.adm.town": 6, "B-org.ent": 7, "I-loc.fac": 8, "B-time.date.abs": 9}, "NE-FINE-COMP": {"I-comp.name": 0, "B-comp.name": 1, "B-comp.title": 2, "I-comp.function": 3, "I-comp.title": 4, "B-comp.function": 5, "O": 6, "I-comp.demonym": 7, "B-comp.demonym": 8, "B-comp.qualifier": 9, "I-comp.qualifier": 10}, "NE-NESTED": {"I-loc.fac": 0, "B-loc.phys.hydro": 1, "B-pers.ind": 2, "I-org.ent": 3, "B-org.adm": 4, "I-org.adm": 5, "I-loc.adm.town": 6, "B-pers.coll": 7, "I-loc.adm.nat": 8, "B-loc.adm.town": 9, "B-loc.adm.sup": 10, "B-loc.phys.geo": 11, "I-pers.ind": 12, "B-loc.adm.nat": 13, "I-loc.adm.reg": 14, "B-loc.adm.reg": 15, "O": 16, "B-loc.oro": 17, "B-loc.fac": 18, "I-loc.oro": 19, "B-org.ent": 20, "I-loc.phys.geo": 21, "I-loc.phys.hydro": 22, "B-prod.media": 23, "I-prod.media": 24}}
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03a807b124debff782406c816eacb7ced1f2e25b9a5198b27e1616a41faa0662
3
+ size 193971960
modeling_stacked.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.modeling_outputs import TokenClassifierOutput
2
+ import torch
3
+ import torch.nn as nn
4
+ from transformers import PreTrainedModel, AutoModel, AutoConfig, BertConfig
5
+ from torch.nn import CrossEntropyLoss
6
+ from typing import Optional, Tuple, Union
7
+ import logging, json, os
8
+
9
+ from .configuration_stacked import ImpressoConfig
10
+
11
+ logger = logging.getLogger(__name__)
12
+
13
+
14
+ def get_info(label_map):
15
+ num_token_labels_dict = {task: len(labels) for task, labels in label_map.items()}
16
+ return num_token_labels_dict
17
+
18
+
19
+ class ExtendedMultitaskModelForTokenClassification(PreTrainedModel):
20
+
21
+ config_class = ImpressoConfig
22
+ _keys_to_ignore_on_load_missing = [r"position_ids"]
23
+
24
+ def __init__(self, config):
25
+ super().__init__(config)
26
+ self.num_token_labels_dict = get_info(config.label_map)
27
+ self.config = config
28
+
29
+ self.bert = AutoModel.from_pretrained(
30
+ config.pretrained_config["_name_or_path"], config=config.pretrained_config
31
+ )
32
+ if "classifier_dropout" not in config.__dict__:
33
+ classifier_dropout = 0.1
34
+ else:
35
+ classifier_dropout = (
36
+ config.classifier_dropout
37
+ if config.classifier_dropout is not None
38
+ else config.hidden_dropout_prob
39
+ )
40
+ self.dropout = nn.Dropout(classifier_dropout)
41
+
42
+ # Additional transformer layers
43
+ self.transformer_encoder = nn.TransformerEncoder(
44
+ nn.TransformerEncoderLayer(
45
+ d_model=config.hidden_size, nhead=config.num_attention_heads
46
+ ),
47
+ num_layers=2,
48
+ )
49
+
50
+ # For token classification, create a classifier for each task
51
+ self.token_classifiers = nn.ModuleDict(
52
+ {
53
+ task: nn.Linear(config.hidden_size, num_labels)
54
+ for task, num_labels in self.num_token_labels_dict.items()
55
+ }
56
+ )
57
+
58
+ # Initialize weights and apply final processing
59
+ self.post_init()
60
+
61
+ def forward(
62
+ self,
63
+ input_ids: Optional[torch.Tensor] = None,
64
+ attention_mask: Optional[torch.Tensor] = None,
65
+ token_type_ids: Optional[torch.Tensor] = None,
66
+ position_ids: Optional[torch.Tensor] = None,
67
+ head_mask: Optional[torch.Tensor] = None,
68
+ inputs_embeds: Optional[torch.Tensor] = None,
69
+ labels: Optional[torch.Tensor] = None,
70
+ token_labels: Optional[dict] = None,
71
+ output_attentions: Optional[bool] = None,
72
+ output_hidden_states: Optional[bool] = None,
73
+ return_dict: Optional[bool] = None,
74
+ ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
75
+ r"""
76
+ token_labels (`dict` of `torch.LongTensor` of shape `(batch_size, seq_length)`, *optional*):
77
+ Labels for computing the token classification loss. Keys should match the tasks.
78
+ """
79
+ return_dict = (
80
+ return_dict if return_dict is not None else self.config.use_return_dict
81
+ )
82
+
83
+ bert_kwargs = {
84
+ "input_ids": input_ids,
85
+ "attention_mask": attention_mask,
86
+ "token_type_ids": token_type_ids,
87
+ "position_ids": position_ids,
88
+ "head_mask": head_mask,
89
+ "inputs_embeds": inputs_embeds,
90
+ "output_attentions": output_attentions,
91
+ "output_hidden_states": output_hidden_states,
92
+ "return_dict": return_dict,
93
+ }
94
+
95
+ if any(
96
+ keyword in self.config.name_or_path.lower()
97
+ for keyword in ["llama", "deberta"]
98
+ ):
99
+ bert_kwargs.pop("token_type_ids")
100
+ bert_kwargs.pop("head_mask")
101
+
102
+ outputs = self.bert(**bert_kwargs)
103
+
104
+ # For token classification
105
+ token_output = outputs[0]
106
+ token_output = self.dropout(token_output)
107
+
108
+ # Pass through additional transformer layers
109
+ token_output = self.transformer_encoder(token_output.transpose(0, 1)).transpose(
110
+ 0, 1
111
+ )
112
+
113
+ # Collect the logits and compute the loss for each task
114
+ task_logits = {}
115
+ total_loss = 0
116
+ for task, classifier in self.token_classifiers.items():
117
+ logits = classifier(token_output)
118
+ task_logits[task] = logits
119
+ if token_labels and task in token_labels:
120
+ loss_fct = CrossEntropyLoss()
121
+ loss = loss_fct(
122
+ logits.view(-1, self.num_token_labels_dict[task]),
123
+ token_labels[task].view(-1),
124
+ )
125
+ total_loss += loss
126
+
127
+ if not return_dict:
128
+ output = (task_logits,) + outputs[2:]
129
+ return ((total_loss,) + output) if total_loss != 0 else output
130
+
131
+ return TokenClassifierOutput(
132
+ loss=total_loss,
133
+ logits=task_logits,
134
+ hidden_states=outputs.hidden_states,
135
+ attentions=outputs.attentions,
136
+ )
push_to_hf.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import shutil
3
+ import argparse
4
+ from transformers import (
5
+ AutoTokenizer,
6
+ AutoConfig,
7
+ AutoModelForTokenClassification,
8
+ BertConfig,
9
+ )
10
+ from huggingface_hub import HfApi, Repository
11
+
12
+ # import json
13
+ from .configuration_stacked import ImpressoConfig
14
+ from .modeling_stacked import ExtendedMultitaskModelForTokenClassification
15
+ import subprocess
16
+
17
+
18
+ def get_latest_checkpoint(checkpoint_dir):
19
+ checkpoints = [
20
+ d
21
+ for d in os.listdir(checkpoint_dir)
22
+ if os.path.isdir(os.path.join(checkpoint_dir, d))
23
+ and d.startswith("checkpoint-")
24
+ ]
25
+ checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[-1]), reverse=True)
26
+ return os.path.join(checkpoint_dir, checkpoints[0])
27
+
28
+
29
+ def get_info(label_map):
30
+ num_token_labels_dict = {task: len(labels) for task, labels in label_map.items()}
31
+ return num_token_labels_dict
32
+
33
+
34
+ def push_model_to_hub(checkpoint_dir, repo_name, script_path):
35
+ checkpoint_path = get_latest_checkpoint(checkpoint_dir)
36
+ config = ImpressoConfig.from_pretrained(checkpoint_path)
37
+ config.pretrained_config = AutoConfig.from_pretrained(config.name_or_path)
38
+ config.save_pretrained("stacked_bert")
39
+ config = ImpressoConfig.from_pretrained("stacked_bert")
40
+
41
+ model = ExtendedMultitaskModelForTokenClassification.from_pretrained(
42
+ checkpoint_path, config=config
43
+ )
44
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
45
+ local_repo_path = "./repo"
46
+ repo_url = HfApi().create_repo(repo_id=repo_name, exist_ok=True)
47
+ repo = Repository(local_dir=local_repo_path, clone_from=repo_url)
48
+
49
+ try:
50
+ # Try to pull the latest changes from the remote repository using subprocess
51
+ subprocess.run(["git", "pull"], check=True, cwd=local_repo_path)
52
+ except subprocess.CalledProcessError as e:
53
+ # If fast-forward is not possible, reset the local branch to match the remote branch
54
+ subprocess.run(
55
+ ["git", "reset", "--hard", "origin/main"],
56
+ check=True,
57
+ cwd=local_repo_path,
58
+ )
59
+
60
+ # Copy all Python files to the local repository directory
61
+ current_dir = os.path.dirname(os.path.abspath(__file__))
62
+ for filename in os.listdir(current_dir):
63
+ if filename.endswith(".py") or filename.endswith(".json"):
64
+ shutil.copy(
65
+ os.path.join(current_dir, filename),
66
+ os.path.join(local_repo_path, filename),
67
+ )
68
+
69
+ ImpressoConfig.register_for_auto_class()
70
+ AutoConfig.register("stacked_bert", ImpressoConfig)
71
+ AutoModelForTokenClassification.register(
72
+ ImpressoConfig, ExtendedMultitaskModelForTokenClassification
73
+ )
74
+ ExtendedMultitaskModelForTokenClassification.register_for_auto_class(
75
+ "AutoModelForTokenClassification"
76
+ )
77
+
78
+ model.save_pretrained(local_repo_path)
79
+ tokenizer.save_pretrained(local_repo_path)
80
+
81
+ # Add, commit and push the changes to the repository
82
+ subprocess.run(["git", "add", "."], check=True, cwd=local_repo_path)
83
+ subprocess.run(
84
+ ["git", "commit", "-m", "Initial commit including model and configuration"],
85
+ check=True,
86
+ cwd=local_repo_path,
87
+ )
88
+ subprocess.run(["git", "push"], check=True, cwd=local_repo_path)
89
+
90
+ # Push the model to the hub (this includes the README template)
91
+ model.push_to_hub(repo_name)
92
+ tokenizer.push_to_hub(repo_name)
93
+
94
+ print(f"Model and repo pushed to: {repo_url}")
95
+
96
+
97
+ if __name__ == "__main__":
98
+ parser = argparse.ArgumentParser(description="Push NER model to Hugging Face Hub")
99
+ parser.add_argument(
100
+ "--model_type",
101
+ type=str,
102
+ required=True,
103
+ help="Type of the model (e.g., stacked-bert)",
104
+ )
105
+ parser.add_argument(
106
+ "--language",
107
+ type=str,
108
+ required=True,
109
+ help="Language of the model (e.g., multilingual)",
110
+ )
111
+ parser.add_argument(
112
+ "--checkpoint_dir",
113
+ type=str,
114
+ required=True,
115
+ help="Directory containing checkpoint folders",
116
+ )
117
+ parser.add_argument(
118
+ "--script_path", type=str, required=True, help="Path to the models.py script"
119
+ )
120
+ args = parser.parse_args()
121
+ repo_name = f"impresso-project/ner-{args.model_type}-{args.language}"
122
+ push_model_to_hub(args.checkpoint_dir, repo_name, args.script_path)
123
+ # PIPELINE_REGISTRY.register_pipeline(
124
+ # "generic-ner",
125
+ # pipeline_class=MultitaskTokenClassificationPipeline,
126
+ # pt_model=ExtendedMultitaskModelForTokenClassification,
127
+ # )
128
+ # model.config.custom_pipelines = {
129
+ # "generic-ner": {
130
+ # "impl": "generic_ner.MultitaskTokenClassificationPipeline",
131
+ # "pt": ["ExtendedMultitaskModelForTokenClassification"],
132
+ # "tf": [],
133
+ # }
134
+ # }
135
+ # classifier = pipeline(
136
+ # "generic-ner", model=model, tokenizer=tokenizer, label_map=label_map
137
+ # )
138
+ # from pprint import pprint
139
+ #
140
+ # pprint(
141
+ # classifier(
142
+ # "1. Le public est averti que Charlotte née Bourgoin, femme-de Joseph Digiez, et Maurice Bourgoin, enfant mineur représenté par le sieur Jaques Charles Gicot son curateur, ont été admis par arrêt du Conseil d'Etat du 5 décembre 1797, à solliciter une renonciation générale et absolue aux biens et aux dettes présentes et futures de Jean-Baptiste Bourgoin leur père."
143
+ # )
144
+ # )
145
+ # repo.push_to_hub(commit_message="Initial commit of the trained NER model with code")
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
test.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Import necessary modules from the transformers library
2
+ from transformers import pipeline
3
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
4
+
5
+ # Define the model name to be used for token classification, we use the Impresso NER
6
+ # that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
7
+ MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
8
+
9
+ # Load the tokenizer corresponding to the specified model name
10
+ ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
11
+
12
+ ner_pipeline = pipeline(
13
+ "generic-ner",
14
+ model=MODEL_NAME,
15
+ tokenizer=ner_tokenizer,
16
+ trust_remote_code=True,
17
+ device="cpu",
18
+ )
19
+ sentences = [
20
+ """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles,
21
+ where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly,
22
+ debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun,
23
+ regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia,
24
+ George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State,
25
+ were drafting policies for the newly established American government following the signing of the Constitution."""
26
+ ]
27
+
28
+ print(sentences[0])
29
+
30
+
31
+ # Helper function to print entities one per row
32
+ def print_nicely(entities):
33
+ for entity in entities:
34
+ print(
35
+ f"Entity: {entity['entity']} | Confidence: {entity['score']:.2f}% | Text: {entity['word'].strip()} | Start: {entity['start']} | End: {entity['end']}"
36
+ )
37
+
38
+
39
+ # Visualize stacked entities for each sentence
40
+ for sentence in sentences:
41
+ results = ner_pipeline(sentence)
42
+
43
+ # Extract coarse and fine entities
44
+ for key in results.keys():
45
+ # Visualize the coarse entities
46
+ print_nicely(results[key])
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "max_len": 512,
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": false,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff