README.md · osiria/bert-italian-cased-ner at ad6eb91df1b62a34c5dae61e28e4f5c567c24ee0

metadata

license: apache-2.0
language:
  - it
widget:
  - text: >-
      Mi chiamo Marco Rossi, vivo a Roma e lavoro per l'Agenzia Spaziale
      Italiana
    example_title: Example 1

Task: Named Entity Recognition
Model: BERT
Lang: IT

Model description

This is a BERT [1] cased model for the Italian language, fine-tuned for Named Entity Recognition (Person, Location, Organization and Miscellanea classes) on the WikiNER dataset [2], using BERT-ITALIAN (bert-base-italian-cased) as a pre-trained model.

This is a cased, base size BERT model. If you are looking for a lighter (but slightly less accurate) cased model, you can refer to: https://huggingface.co/osiria/distilbert-italian-cased-ner

If you are looking for an uncased model, you can refer to: https://huggingface.co/osiria/bert-italian-uncased-ner

Training and Performances

The model is trained to perform entity recognition over 4 classes: PER (persons), LOC (locations), ORG (organizations), MISC (miscellanea, mainly events, products and services). It has been fine-tuned for Named Entity Recognition, using the WikiNER Italian dataset plus an additional custom dataset of manually annotated Wikipedia paragraphs. The WikiNER dataset has been splitted in 102.352 training instances and 25.588 test instances, and the model has been trained for 1 epoch with a constant learning rate of 1e-5.

The performances on the test set are reported in the following table:

Recall	Precision	F1
93.35	92.22	92.78

The metrics have been computed at the token level and then macro-averaged over the 4 classes.

Then, since WikiNER is an automatically annotated (silver standard) dataset, which sometimes contains imperfect annotations, an additional fine-tuning on ~3.500 manually annotated paragraphs has been performed.

Quick usage

from transformers import BertTokenizerFast, BertForTokenClassification

tokenizer = BertTokenizerFast.from_pretrained("osiria/bert-italian-cased-ner")
model = BertForTokenClassification.from_pretrained("osiria/bert-italian-cased-ner")

from transformers import pipeline
ner = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy="first")

ner("Mi chiamo Marco Rossi, vivo a Roma e lavoro per l'Agenzia Spaziale Italiana nella missione Prisma")

[{'entity_group': 'PER',
  'score': 0.99910736,
  'word': 'Marco Rossi',
  'start': 10,
  'end': 21},
 {'entity_group': 'LOC',
  'score': 0.9973786,
  'word': 'Roma',
  'start': 30,
  'end': 34},
 {'entity_group': 'ORG',
  'score': 0.9987071,
  'word': 'Agenzia Spaziale Italiana',
  'start': 50,
  'end': 75},
 {'entity_group': 'MISC',
  'score': 0.9625836,
  'word': 'Prisma',
  'start': 91,
  'end': 97}]

You can also try the model online using this web app: https://huggingface.co/spaces/osiria/bert-italian-cased-ner

References

[1] https://arxiv.org/abs/1810.04805

[2] https://www.sciencedirect.com/science/article/pii/S0004370212000276

Limitations

This model is mainly trained on Wikipedia, so it's particularly suitable for natively digital text from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). However, it may show limitations when it comes to chaotic text, containing errors and slang expressions (like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).

License

The model is released under Apache-2.0 license