Model Card for impresso-project/nel-mgenre-multilingual

The Impresso multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) proposed by De Cao et al, a sequence-to-sequence architecture for entity disambiguation based on mBART. It uses constrained generation to output entity names mapped to Wikidata/QIDs.

This model was adapted for historical texts and fine-tuned on the HIPE-2022 dataset, which includes a variety of historical document types and languages.

Model Details

Model Description

Model Description

  • Developed by: EPFL from the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
  • Model type: mBART-based sequence-to-sequence model with constrained beam search for named entity linking
  • Languages: Multilingual (100+ languages, optimized for French, German, and English)
  • License: AGPL v3+
  • Finetuned from: facebook/mgenre-wiki

Model Architecture

  • Architecture: mBART-based seq2seq with constrained beam search

Training Details

Training Data

The model was trained on the following datasets:

Dataset alias README Document type Languages Suitable for Project License
ajmc link classical commentaries de, fr, en NERC-Coarse, NERC-Fine, EL AjMC License: CC BY 4.0
hipe2020 link historical newspapers de, fr, en NERC-Coarse, NERC-Fine, EL CLEF-HIPE-2020 License: CC BY-NC-SA 4.0
topres19th link historical newspapers en NERC-Coarse, EL Living with Machines License: CC BY-NC-SA 4.0
newseye link historical newspapers de, fi, fr, sv NERC-Coarse, NERC-Fine, EL NewsEye License: CC BY 4.0
sonar link historical newspapers de NERC-Coarse, EL SoNAR License: CC BY 4.0

How to Use

from transformers import AutoTokenizer, pipeline

NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)

nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
                        tokenizer=nel_tokenizer,
                        trust_remote_code=True,
                        device='cpu')

sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
print(nel_pipeline(sentence))

Output Format

[
    {
        'surface': 'Dreyfvs', 
        'wkd_id': 'Q171826', 
        'wkpedia_pagename': 'Alfred Dreyfus', 
        'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus', 
        'type': 'UNK', 
        'confidence_nel': 99.98, 
        'lOffset': 24, 
        'rOffset': 33}]

The type of the entity is UNK because the model was not trained on the entity type. The confidence_nel score indicates the model's confidence in the prediction.

Use Cases

  • Entity disambiguation in noisy OCR settings
  • Linking historical names to modern Wikidata entities
  • Assisting downstream event extraction and biography generation from historical archives

Limitations

  • Sensitive to tokenisation and malformed spans
  • Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts
  • Focused on historical entity mentions — performance may vary on modern texts

Environmental Impact

  • Hardware: 1x A100 (80GB) for finetuning
  • Training time: ~12 hours
  • Estimated CO₂ Emissions: ~2.3 kg CO₂eq

Contact

Impresso Logo

Downloads last month
23,588
Safetensors
Model size
617M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using impresso-project/nel-mgenre-multilingual 1