--- license: apache-2.0 tags: - flair - token-classification - sequence-tagger-model language: es datasets: - conll2003 - BSC-LT/NextProcurement-NER-Spanish-UTE-Company-annotated widget: - text: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" --- ## Recognition of UTEs and company mentions in Flair This is a model trained using [Flair](https://github.com/flairNLP/flair/) to recognise mentions of UTEs (Unión Temporal de Empresas) and companies in public tenders. It is a finetune of the flair/ner-spanish-large model (retrained from scratch to include additional tags). ``` Results: - F-score (micro) 0.7431 - F-score (macro) 0.7429 - Accuracy 0.5944 By class: precision recall f1-score support UTE 0.7568 0.7887 0.7724 71 SINGLE_COMPANY 0.6538 0.7846 0.7133 65 micro avg 0.7039 0.7868 0.7431 136 macro avg 0.7053 0.7867 0.7429 136 weighted avg 0.7076 0.7868 0.7442 136 ``` Based on document-level XLM-R embeddings and [FLERT](https://arxiv.org/pdf/2011.06993v1.pdf/). --- ### Demo: How to use in Flair Requires: **[Flair](https://github.com/flairNLP/flair/)** (`pip install flair`) ```python from flair.data import Sentence from flair.models import SequenceTagger # load tagger tagger = SequenceTagger.load("BSC-LT/NextProcurement-NER-Spanish-UTE-Company") # make example sentence sentence = Sentence("PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:") # predict NER tags tagger.predict(sentence) # print sentence print(sentence) # print predicted NER spans print('The following NER tags are found:') # iterate over entities and print for entity in sentence.get_spans('ner'): print(entity) ``` This yields the following output (**TODO: update**): ``` Span [1,2]: "George Washington" [− Labels: PER (1.0)] Span [5]: "Washington" [− Labels: LOC (1.0)] ``` So, the entities "*George Washington*" (labeled as a **person**) and "*Washington*" (labeled as a **location**) are found in the sentence "*George Washington fue a Washington*". --- ### Training: Script to train this model The following Flair script was used to train this model (**TODO: update**): ```python import torch # 1. get the corpus from flair.datasets import CONLL_03_SPANISH corpus = CONLL_03_SPANISH() # 2. what tag do we want to predict? tag_type = 'ner' # 3. make the tag dictionary from the corpus tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type) # 4. initialize fine-tuneable transformer embeddings WITH document context from flair.embeddings import TransformerWordEmbeddings embeddings = TransformerWordEmbeddings( model='xlm-roberta-large', layers="-1", subtoken_pooling="first", fine_tune=True, use_context=True, ) # 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection) from flair.models import SequenceTagger tagger = SequenceTagger( hidden_size=256, embeddings=embeddings, tag_dictionary=tag_dictionary, tag_type='ner', use_crf=False, use_rnn=False, reproject_embeddings=False, ) # 6. initialize trainer with AdamW optimizer from flair.trainers import ModelTrainer trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW) # 7. run training with XLM parameters (20 epochs, small LR) from torch.optim.lr_scheduler import OneCycleLR trainer.train('resources/taggers/ner-spanish-large', learning_rate=5.0e-6, mini_batch_size=4, mini_batch_chunk_size=1, max_epochs=20, scheduler=OneCycleLR, embeddings_storage_mode='none', weight_decay=0., ) ) ``` ---