language:
- fr
pipeline_tag: token-classification
tags:
- medical
- ner
- nlp
- pseudonymisation
license: bsd-3-clause
library_name: edsnlp
model-index:
- name: AP-HP/eds-pseudo-public
results:
- task:
type: token-classification
dataset:
name: AP-HP Pseudo Test
type: private
metrics:
- type: precision
name: Token Scores / ADRESSE / Precision
value: 0.981694715087097
- type: recall
name: Token Scores / ADRESSE / Recall
value: 0.9693877551020401
- type: f1
name: Token Scores / ADRESSE / F1
value: 0.975502420419539
- type: recall
name: Token Scores / ADRESSE / Redact
value: 0.9763848396501451
- type: accuracy
name: Token Scores / ADRESSE / Redact Full
value: 0.9665697674418601
- type: precision
name: Token Scores / DATE / Precision
value: 0.9899177066870131
- type: recall
name: Token Scores / DATE / Recall
value: 0.984285249810339
- type: f1
name: Token Scores / DATE / F1
value: 0.9870934434692821
- type: recall
name: Token Scores / DATE / Redact
value: 0.9884035981359051
- type: accuracy
name: Token Scores / DATE / Redact Full
value: 0.859011627906976
- type: precision
name: Token Scores / DATE_NAISSANCE / Precision
value: 0.9753867791842471
- type: recall
name: Token Scores / DATE_NAISSANCE / Recall
value: 0.968913726859937
- type: f1
name: Token Scores / DATE_NAISSANCE / F1
value: 0.972139477834238
- type: recall
name: Token Scores / DATE_NAISSANCE / Redact
value: 0.9933636046105481
- type: accuracy
name: Token Scores / DATE_NAISSANCE / Redact Full
value: 0.9941860465116271
- type: precision
name: Token Scores / IPP / Precision
value: 0.918987341772151
- type: recall
name: Token Scores / IPP / Recall
value: 0.9075000000000001
- type: f1
name: Token Scores / IPP / F1
value: 0.9132075471698111
- type: recall
name: Token Scores / IPP / Redact
value: 0.985
- type: accuracy
name: Token Scores / IPP / Redact Full
value: 0.9927325581395341
- type: precision
name: Token Scores / MAIL / Precision
value: 0.9609144542772861
- type: recall
name: Token Scores / MAIL / Recall
value: 0.9977029096477791
- type: f1
name: Token Scores / MAIL / F1
value: 0.978963185574755
- type: recall
name: Token Scores / MAIL / Redact
value: 0.9977029096477791
- type: accuracy
name: Token Scores / MAIL / Redact Full
value: 0.9970930232558141
- type: precision
name: Token Scores / NDA / Precision
value: 0.921428571428571
- type: recall
name: Token Scores / NDA / Recall
value: 0.834951456310679
- type: f1
name: Token Scores / NDA / F1
value: 0.8760611205432931
- type: recall
name: Token Scores / NDA / Redact
value: 0.87378640776699
- type: accuracy
name: Token Scores / NDA / Redact Full
value: 0.9723837209302321
- type: precision
name: Token Scores / NOM / Precision
value: 0.9439770896724531
- type: recall
name: Token Scores / NOM / Recall
value: 0.9525013545241101
- type: f1
name: Token Scores / NOM / F1
value: 0.948220064724919
- type: recall
name: Token Scores / NOM / Redact
value: 0.981578472096803
- type: accuracy
name: Token Scores / NOM / Redact Full
value: 0.895348837209302
- type: precision
name: Token Scores / PRENOM / Precision
value: 0.9348837209302321
- type: recall
name: Token Scores / PRENOM / Recall
value: 0.9663461538461531
- type: f1
name: Token Scores / PRENOM / F1
value: 0.950354609929078
- type: recall
name: Token Scores / PRENOM / Redact
value: 0.99002849002849
- type: accuracy
name: Token Scores / PRENOM / Redact Full
value: 0.9316860465116271
- type: precision
name: Token Scores / SECU / Precision
value: 0.882838283828382
- type: recall
name: Token Scores / SECU / Recall
value: 1
- type: f1
name: Token Scores / SECU / F1
value: 0.9377738825591581
- type: recall
name: Token Scores / SECU / Redact
value: 1
- type: accuracy
name: Token Scores / SECU / Redact Full
value: 1
- type: precision
name: Token Scores / TEL / Precision
value: 0.9746407438715131
- type: recall
name: Token Scores / TEL / Recall
value: 0.9993932564791541
- type: f1
name: Token Scores / TEL / F1
value: 0.9868618136688491
- type: recall
name: Token Scores / TEL / Redact
value: 0.999479934124989
- type: accuracy
name: Token Scores / TEL / Redact Full
value: 0.99563953488372
- type: precision
name: Token Scores / VILLE / Precision
value: 0.96684350132626
- type: recall
name: Token Scores / VILLE / Recall
value: 0.9376205787781351
- type: f1
name: Token Scores / VILLE / F1
value: 0.9520078354554351
- type: recall
name: Token Scores / VILLE / Redact
value: 0.9511254019292601
- type: accuracy
name: Token Scores / VILLE / Redact Full
value: 0.9113372093023251
- type: precision
name: Token Scores / ZIP / Precision
value: 0.9675036927621861
- type: recall
name: Token Scores / ZIP / Recall
value: 1
- type: f1
name: Token Scores / ZIP / F1
value: 0.983483483483483
- type: recall
name: Token Scores / ZIP / Redact
value: 1
- type: accuracy
name: Token Scores / ZIP / Redact Full
value: 1
- type: precision
name: Token Scores / micro / Precision
value: 0.970393736698084
- type: recall
name: Token Scores / micro / Recall
value: 0.9783320880510371
- type: f1
name: Token Scores / micro / F1
value: 0.9743467434960551
- type: recall
name: Token Scores / micro / Redact
value: 0.9884667701208881
- type: accuracy
name: Token Scores / micro / Redact Full
value: 0.6308139534883721
extra_gated_fields:
Organisation: text
Intended use of the model:
type: select
options:
- NLP Research
- Education
- Commercial Product
- Clinical Data Warehouse
- label: Other
value: other
EDS-Pseudo
This project aims at detecting identifying entities documents, and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS).
The model is built on top of edsnlp, and consists in a
hybrid model (rule-based + deep learning) for which we provide
rules (eds-pseudo/pipes
)
and a training recipe train.py
.
We also provide some fictitious
templates (templates.txt
) and a script to
generate a synthetic
dataset generate_dataset.py
.
The entities that are detected are listed below.
Label | Description |
---|---|
ADRESSE |
Street address, eg 33 boulevard de Picpus |
DATE |
Any absolute date other than a birthdate |
DATE_NAISSANCE |
Birthdate |
HOPITAL |
Hospital name, eg Hôpital Rothschild |
IPP |
Internal AP-HP identifier for patients, displayed as a number |
MAIL |
Email address |
NDA |
Internal AP-HP identifier for visits, displayed as a number |
NOM |
Any last name (patients, doctors, third parties) |
PRENOM |
Any first name (patients, doctors, etc) |
SECU |
Social security number |
TEL |
Any phone number |
VILLE |
Any city |
ZIP |
Any zip code |
Downloading the public pre-trained model
The public pretrained model is available on the HuggingFace model hub at
AP-HP/eds-pseudo-public and was trained on synthetic data
(see generate_dataset.py
). You can also
test it directly on the demo.
Install the latest version of edsnlp
pip install "edsnlp[ml]" -U
Get access to the model at AP-HP/eds-pseudo-public
Create and copy a huggingface token with permission "READ" at https://huggingface.co/settings/tokens?new_token=true
Register the token (only once) on your machine
import huggingface_hub huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)
Load the model
import edsnlp nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True) doc = nlp( "En 2015, M. Charles-François-Bienvenu " "Myriel était évêque de Digne. C’était un vieillard " "d’environ soixante-quinze ans ; il occupait le " "siège de Digne depuis 2006." ) for ent in doc.ents: print(ent, ent.label_, str(ent._.date))
To apply the model on many documents using one or more GPUs, refer to the documentation of edsnlp.
Metrics
AP-HP Pseudo Test Token Scores | Precision | Recall | F1 | Redact | Redact Full |
---|---|---|---|---|---|
ADRESSE | 98.2 | 96.9 | 97.6 | 97.6 | 96.7 |
DATE | 99 | 98.4 | 98.7 | 98.8 | 85.9 |
DATE_NAISSANCE | 97.5 | 96.9 | 97.2 | 99.3 | 99.4 |
IPP | 91.9 | 90.8 | 91.3 | 98.5 | 99.3 |
96.1 | 99.8 | 97.9 | 99.8 | 99.7 | |
NDA | 92.1 | 83.5 | 87.6 | 87.4 | 97.2 |
NOM | 94.4 | 95.3 | 94.8 | 98.2 | 89.5 |
PRENOM | 93.5 | 96.6 | 95 | 99 | 93.2 |
SECU | 88.3 | 100 | 93.8 | 100 | 100 |
TEL | 97.5 | 99.9 | 98.7 | 99.9 | 99.6 |
VILLE | 96.7 | 93.8 | 95.2 | 95.1 | 91.1 |
ZIP | 96.8 | 100 | 98.3 | 100 | 100 |
micro | 97 | 97.8 | 97.4 | 98.8 | 63.1 |
Installation to reproduce
If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:
git clone https://github.com/aphp/eds-pseudo.git
cd eds-pseudo
And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager like Poetry.
poetry install
How to use without machine learning
import edsnlp
nlp = edsnlp.blank("eds")
# Some text cleaning
nlp.add_pipe("eds.normalizer")
# Various simple rules
nlp.add_pipe(
"eds_pseudo.simple_rules",
config={"pattern_keys": ["TEL", "MAIL", "SECU", "PERSON"]},
)
# Address detection
nlp.add_pipe("eds_pseudo.addresses")
# Date detection
nlp.add_pipe("eds_pseudo.dates")
# Contextual rules (requires a dict of info about the patient)
nlp.add_pipe("eds_pseudo.context")
# Apply it to a text
doc = nlp(
"En 2015, M. Charles-François-Bienvenu "
"Myriel était évêque de Digne. C’était un vieillard "
"d’environ soixante-quinze ans ; il occupait le "
"siège de Digne depuis 2006."
)
for ent in doc.ents:
print(ent, ent.label_)
# 2015 DATE
# Charles-François-Bienvenu NOM
# Myriel PRENOM
# 2006 DATE
How to train
Before training a model, you should update the configs/config.cfg and pyproject.toml files to fit your needs.
Put your data in the data/dataset
folder (or edit the paths configs/config.cfg
file to point
to data/gen_dataset/train.jsonl
).
Then, run the training script
python scripts/train.py --config configs/config.cfg --seed 43
This will train a model and save it in artifacts/model-last
. You can evaluate it on the test set (defaults
to data/dataset/test.jsonl
) with:
python scripts/evaluate.py --config configs/config.cfg
To package it, run:
python scripts/package.py
This will create a dist/eds-pseudo-aphp-***.whl
file that you can install with pip install dist/eds-pseudo-aphp-***
.
You can use it in your code:
import edsnlp
# Either from the model path directly
nlp = edsnlp.load("artifacts/model-last")
# Or from the wheel file
import eds_pseudo_aphp
nlp = eds_pseudo_aphp.load()
Documentation
Visit the documentation for more information!
Publication
Please find our publication at the following link: https://doi.org/mkfv.
If you use EDS-Pseudo, please cite us as below:
@article{eds_pseudo,
title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse},
author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},
journal={Methods of Information in Medicine},
year={2024},
publisher={Georg Thieme Verlag KG}
}
Acknowledgement
We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.