Token Classification
GLiNER
PyTorch
eriknovak commited on
Commit
4ac4f51
1 Parent(s): f6fb544

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -12
README.md CHANGED
@@ -22,11 +22,28 @@ GLiNER is a Named Entity Recognition (NER) model capable of identifying any enti
22
  This model has been trained by fine-tuning `urchade/gliner_multi_pii-v1` on the synthetic dataset covering PPIs for the domains: `healthcare`, `finance`, `legal`, `banking` and `general`.
23
 
24
  This model is capable of recognizing various types of *personally identifiable information* (PII), including but not limited to these entity types: `person`, `organization`, `phone number`, `address`, `passport number`, `email`, `credit card number`, `social security number`, `health insurance id number`, `date of birth`, `mobile phone number`, `bank account number`, `medication`, `cpf`, `driver's license number`, `tax identification number`, `medical condition`, `identity card number`, `national id number`, `ip address`, `email address`, `iban`, `credit card expiration date`, `username`, `health insurance number`, `registration number`, `student id number`, `insurance number`, `flight number`, `landline phone number`, `blood type`, `cvv`, `reservation number`, `digital signature`, `social media handle`, `license plate number`, `cnpj`, `postal code`, `passport number`, `serial number`, `vehicle registration number`, `credit card brand`, `fax number`, `visa number`, `insurance company`, `identity document number`, `transaction number`, `national health insurance number`, `cvc`, `birth certificate number`, `train ticket number`, `passport expiration date`, and `social security number`.
25
-
26
 
27
- ## English example
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ```python
 
 
 
 
 
 
30
  text = """
31
  Medical Record
32
 
@@ -46,17 +63,20 @@ Next Examination Date:
46
  15-11-2024
47
  """
48
 
49
- # Labels for entity prediction
 
50
  labels = ["name", "social security number", "date of birth", "date"]
51
 
52
- # Perform entity prediction
53
- entities = trained_model.predict_entities(text, labels, threshold=0.5)
54
 
55
- # Display predicted entities and their labels
56
  for entity in entities:
57
  print(entity["text"], "=>", entity["label"])
58
  ```
59
 
 
 
60
  ```text
61
  John Doe => name
62
  15-01-1985 => date of birth
@@ -66,9 +86,17 @@ John Doe => name
66
  15-11-2024 => date
67
  ```
68
 
69
- ## Dutch example
 
 
70
 
71
  ```python
 
 
 
 
 
 
72
  text = """
73
  Medisch dossier
74
 
@@ -89,17 +117,20 @@ Volgende onderzoekdatum:
89
  15-11-2024
90
  """
91
 
92
- # Labels for entity prediction
 
93
  labels = ["naam", "bmurgerservicenummer", "geboortedatum", "datum"]
94
 
95
- # Perform entity prediction
96
- entities = trained_model.predict_entities(text, labels, threshold=0.2)
97
 
98
- # Display predicted entities and their labels
99
  for entity in entities:
100
  print(entity["text"], "=>", entity["label"])
101
  ```
102
 
 
 
103
  ```text
104
  Jan de Vries => naam
105
  15-01-1985 => geboortedatum
@@ -107,4 +138,8 @@ Jan de Vries => naam
107
  987-65-4321 => bmurgerservicenummer
108
  Jan de Vries => naam
109
  15-11-2024 => datum
110
- ```
 
 
 
 
 
22
  This model has been trained by fine-tuning `urchade/gliner_multi_pii-v1` on the synthetic dataset covering PPIs for the domains: `healthcare`, `finance`, `legal`, `banking` and `general`.
23
 
24
  This model is capable of recognizing various types of *personally identifiable information* (PII), including but not limited to these entity types: `person`, `organization`, `phone number`, `address`, `passport number`, `email`, `credit card number`, `social security number`, `health insurance id number`, `date of birth`, `mobile phone number`, `bank account number`, `medication`, `cpf`, `driver's license number`, `tax identification number`, `medical condition`, `identity card number`, `national id number`, `ip address`, `email address`, `iban`, `credit card expiration date`, `username`, `health insurance number`, `registration number`, `student id number`, `insurance number`, `flight number`, `landline phone number`, `blood type`, `cvv`, `reservation number`, `digital signature`, `social media handle`, `license plate number`, `cnpj`, `postal code`, `passport number`, `serial number`, `vehicle registration number`, `credit card brand`, `fax number`, `visa number`, `insurance company`, `identity document number`, `transaction number`, `national health insurance number`, `cvc`, `birth certificate number`, `train ticket number`, `passport expiration date`, and `social security number`.
 
25
 
 
26
 
27
+ ## Usage
28
+
29
+ To use the model, one must use the [GLiNER](https://github.com/urchade/GLiNER) library. Once installed, the user can load the model and use it to discern the entities within the text.
30
+
31
+ ```bash
32
+ pip install gliner
33
+ ```
34
+
35
+ What follows are some examples of its intended use.
36
+
37
+
38
+ ### Extract entities from English medical text
39
+
40
  ```python
41
+ from gliner import GLiNER
42
+
43
+ # initialize the GLiNER using this model
44
+ model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")
45
+
46
+ # prepare the text for entity extraction
47
  text = """
48
  Medical Record
49
 
 
63
  15-11-2024
64
  """
65
 
66
+ # prepare the labels/entities to be extracted
67
+ # this model should work best when entity types are in lowercase
68
  labels = ["name", "social security number", "date of birth", "date"]
69
 
70
+ # perform entity extraction
71
+ entities = model.predict_entities(text, labels, threshold=0.5)
72
 
73
+ # display predicted entities and their labels
74
  for entity in entities:
75
  print(entity["text"], "=>", entity["label"])
76
  ```
77
 
78
+ **Expected output**
79
+
80
  ```text
81
  John Doe => name
82
  15-01-1985 => date of birth
 
86
  15-11-2024 => date
87
  ```
88
 
89
+
90
+
91
+ ### Extract entities from Dutch medical text
92
 
93
  ```python
94
+ from gliner import GLiNER
95
+
96
+ # initialize the GLiNER using this model
97
+ model = GLiNER.from_pretrained("E3-JSI/gliner-multi-pii-domains-v1")
98
+
99
+ # prepare the text for entity extraction
100
  text = """
101
  Medisch dossier
102
 
 
117
  15-11-2024
118
  """
119
 
120
+ # prepare the labels/entities to be extracted
121
+ # this model should work best when entity types are in lowercase
122
  labels = ["naam", "bmurgerservicenummer", "geboortedatum", "datum"]
123
 
124
+ # perform entity extraction
125
+ entities = model.predict_entities(text, labels, threshold=0.2)
126
 
127
+ # display predicted entities and their labels
128
  for entity in entities:
129
  print(entity["text"], "=>", entity["label"])
130
  ```
131
 
132
+ **Expected output**
133
+
134
  ```text
135
  Jan de Vries => naam
136
  15-01-1985 => geboortedatum
 
138
  987-65-4321 => bmurgerservicenummer
139
  Jan de Vries => naam
140
  15-11-2024 => datum
141
+ ```
142
+
143
+ ## Aknowledgements
144
+
145
+ Funded by the European Union. UK participants in Horizon Europe Project PREPARE are supported by UKRI grant number 10086219 (Trilateral Research). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Health and Digital Executive Agency (HADEA) or UKRI. Neither the European Union nor the granting authority nor UKRI can be held responsible for them. Grant Agreement 101080288 PREPARE HORIZON-HLTH-2022-TOOL-12-01.