Token Classification
Flair
PyTorch
Spanish
sequence-tagger-model
jgrivolla commited on
Commit
e1ac646
·
verified ·
1 Parent(s): fb1f685

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -3
README.md CHANGED
@@ -1,3 +1,126 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - flair
5
+ - token-classification
6
+ - sequence-tagger-model
7
+ language: es
8
+ datasets:
9
+ - conll2003
10
+ - BSC-LT/NextProcurement-NER-Spanish-UTE-Company-annotated
11
+ widget:
12
+ - text: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:"
13
+ ---
14
+ ## Recognition of UTEs and company mentions in Flair
15
+
16
+ This is a model trained using [Flair](https://github.com/flairNLP/flair/) to recognise mentions of UTEs (Unión Temporal de Empresas) and companies in public tenders.
17
+
18
+ It is a finetune of the flair/ner-spanish-large model (retrained from scratch to include additional tags).
19
+
20
+ ```
21
+ Results:
22
+ - F-score (micro) 0.7431
23
+ - F-score (macro) 0.7429
24
+ - Accuracy 0.5944
25
+
26
+ By class:
27
+ precision recall f1-score support
28
+
29
+ UTE 0.7568 0.7887 0.7724 71
30
+ SINGLE_COMPANY 0.6538 0.7846 0.7133 65
31
+
32
+ micro avg 0.7039 0.7868 0.7431 136
33
+ macro avg 0.7053 0.7867 0.7429 136
34
+ weighted avg 0.7076 0.7868 0.7442 136
35
+ ```
36
+
37
+ Based on document-level XLM-R embeddings and [FLERT](https://arxiv.org/pdf/2011.06993v1.pdf/).
38
+
39
+ ---
40
+
41
+ ### Demo: How to use in Flair
42
+
43
+ Requires: **[Flair](https://github.com/flairNLP/flair/)** (`pip install flair`)
44
+
45
+ ```python
46
+ from flair.data import Sentence
47
+ from flair.models import SequenceTagger
48
+ # load tagger
49
+ tagger = SequenceTagger.load("BSC-LT/NextProcurement-NER-Spanish-UTE-Company")
50
+ # make example sentence
51
+ sentence = Sentence("PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:")
52
+ # predict NER tags
53
+ tagger.predict(sentence)
54
+ # print sentence
55
+ print(sentence)
56
+ # print predicted NER spans
57
+ print('The following NER tags are found:')
58
+ # iterate over entities and print
59
+ for entity in sentence.get_spans('ner'):
60
+ print(entity)
61
+ ```
62
+
63
+ This yields the following output (**TODO: update**):
64
+ ```
65
+ Span [1,2]: "George Washington" [− Labels: PER (1.0)]
66
+ Span [5]: "Washington" [− Labels: LOC (1.0)]
67
+ ```
68
+
69
+ So, the entities "*George Washington*" (labeled as a **person**) and "*Washington*" (labeled as a **location**) are found in the sentence "*George Washington fue a Washington*".
70
+
71
+
72
+ ---
73
+
74
+ ### Training: Script to train this model
75
+
76
+ The following Flair script was used to train this model (**TODO: update**):
77
+
78
+ ```python
79
+ import torch
80
+ # 1. get the corpus
81
+ from flair.datasets import CONLL_03_SPANISH
82
+ corpus = CONLL_03_SPANISH()
83
+ # 2. what tag do we want to predict?
84
+ tag_type = 'ner'
85
+ # 3. make the tag dictionary from the corpus
86
+ tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
87
+ # 4. initialize fine-tuneable transformer embeddings WITH document context
88
+ from flair.embeddings import TransformerWordEmbeddings
89
+ embeddings = TransformerWordEmbeddings(
90
+ model='xlm-roberta-large',
91
+ layers="-1",
92
+ subtoken_pooling="first",
93
+ fine_tune=True,
94
+ use_context=True,
95
+ )
96
+ # 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection)
97
+ from flair.models import SequenceTagger
98
+ tagger = SequenceTagger(
99
+ hidden_size=256,
100
+ embeddings=embeddings,
101
+ tag_dictionary=tag_dictionary,
102
+ tag_type='ner',
103
+ use_crf=False,
104
+ use_rnn=False,
105
+ reproject_embeddings=False,
106
+ )
107
+ # 6. initialize trainer with AdamW optimizer
108
+ from flair.trainers import ModelTrainer
109
+ trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
110
+ # 7. run training with XLM parameters (20 epochs, small LR)
111
+ from torch.optim.lr_scheduler import OneCycleLR
112
+ trainer.train('resources/taggers/ner-spanish-large',
113
+ learning_rate=5.0e-6,
114
+ mini_batch_size=4,
115
+ mini_batch_chunk_size=1,
116
+ max_epochs=20,
117
+ scheduler=OneCycleLR,
118
+ embeddings_storage_mode='none',
119
+ weight_decay=0.,
120
+ )
121
+ )
122
+ ```
123
+
124
+
125
+
126
+ ---