Model Card for impresso-project/nel-mgenre-multilingual
The Impresso multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) proposed by De Cao et al, a sequence-to-sequence architecture for entity disambiguation based on mBART. It uses constrained generation to output entity names mapped to Wikidata/QIDs.
This model was adapted for historical texts and fine-tuned on the HIPE-2022 dataset, which includes a variety of historical document types and languages.
Model Details
Model Description
Model Description
- Developed by: EPFL from the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
- Model type: mBART-based sequence-to-sequence model with constrained beam search for named entity linking
- Languages: Multilingual (100+ languages, optimized for French, German, and English)
- License: AGPL v3+
- Finetuned from:
facebook/mgenre-wiki
Model Architecture
- Architecture: mBART-based seq2seq with constrained beam search
Training Details
Training Data
The model was trained on the following datasets:
Dataset alias | README | Document type | Languages | Suitable for | Project | License |
---|---|---|---|---|---|---|
ajmc | link | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | AjMC | |
hipe2020 | link | historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | CLEF-HIPE-2020 | |
topres19th | link | historical newspapers | en | NERC-Coarse, EL | Living with Machines | |
newseye | link | historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL | NewsEye | |
sonar | link | historical newspapers | de | NERC-Coarse, EL | SoNAR |
How to Use
from transformers import AutoTokenizer, pipeline
NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)
nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
tokenizer=nel_tokenizer,
trust_remote_code=True,
device='cpu')
sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
print(nel_pipeline(sentence))
Output Format
[
{
'surface': 'Dreyfvs',
'wkd_id': 'Q171826',
'wkpedia_pagename': 'Alfred Dreyfus',
'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus',
'type': 'UNK',
'confidence_nel': 99.98,
'lOffset': 24,
'rOffset': 33}]
The type of the entity is UNK
because the model was not trained on the entity type. The confidence_nel
score indicates the model's confidence in the prediction.
Use Cases
- Entity disambiguation in noisy OCR settings
- Linking historical names to modern Wikidata entities
- Assisting downstream event extraction and biography generation from historical archives
Limitations
- Sensitive to tokenisation and malformed spans
- Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts
- Focused on historical entity mentions — performance may vary on modern texts
Environmental Impact
- Hardware: 1x A100 (80GB) for finetuning
- Training time: ~12 hours
- Estimated CO₂ Emissions: ~2.3 kg CO₂eq
Contact
- Website: https://impresso-project.ch
- Downloads last month
- 23,588
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support