Update README.md
Browse files
README.md
CHANGED
@@ -58,21 +58,24 @@ The result is a medium-size biomedical corpus for Spanish composed of about 963M
|
|
58 |
|
59 |
## Evaluation and results
|
60 |
|
61 |
-
The
|
62 |
|
63 |
- [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
|
64 |
|
65 |
- [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
|
66 |
|
67 |
- ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
|
|
|
|
|
|
|
68 |
|
69 |
-
|
|
|
|
|
|
|
|
|
70 |
|
71 |
-
|
72 |
-
|---------------------------|----------------------------|-------------------------------|-------------------------|
|
73 |
-
| PharmaCoNER | **89.48** - **87.85** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
|
74 |
-
| CANTEMIST | **83.87** - **81.70** - **86.17** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
|
75 |
-
| ICTUSnet | **88.12** - **85.56** - **90.83** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
|
76 |
|
77 |
|
78 |
## Intended uses & limitations
|
@@ -86,57 +89,6 @@ To be announced soon.
|
|
86 |
|
87 |
---
|
88 |
|
89 |
-
## How to use
|
90 |
-
|
91 |
-
```python
|
92 |
-
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
93 |
-
|
94 |
-
tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
|
95 |
-
|
96 |
-
model = AutoModelForMaskedLM.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
|
97 |
-
|
98 |
-
from transformers import pipeline
|
99 |
-
|
100 |
-
unmasker = pipeline('fill-mask', model="PlanTL-GOB-ES/roberta-base-biomedical-es")
|
101 |
-
|
102 |
-
unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
|
103 |
-
```
|
104 |
-
```
|
105 |
-
# Output
|
106 |
-
[
|
107 |
-
{
|
108 |
-
"sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
|
109 |
-
"score": 0.9855039715766907,
|
110 |
-
"token": 3529,
|
111 |
-
"token_str": " hipertensión"
|
112 |
-
},
|
113 |
-
{
|
114 |
-
"sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
|
115 |
-
"score": 0.0039140828885138035,
|
116 |
-
"token": 1945,
|
117 |
-
"token_str": " diabetes"
|
118 |
-
},
|
119 |
-
{
|
120 |
-
"sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
|
121 |
-
"score": 0.002484665485098958,
|
122 |
-
"token": 11483,
|
123 |
-
"token_str": " hipotensión"
|
124 |
-
},
|
125 |
-
{
|
126 |
-
"sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
|
127 |
-
"score": 0.0023484621196985245,
|
128 |
-
"token": 12238,
|
129 |
-
"token_str": " Hipertensión"
|
130 |
-
},
|
131 |
-
{
|
132 |
-
"sequence": " El único antecedente personal a reseñar era la presión arterial.",
|
133 |
-
"score": 0.0008009297889657319,
|
134 |
-
"token": 2267,
|
135 |
-
"token_str": " presión"
|
136 |
-
}
|
137 |
-
]
|
138 |
-
```
|
139 |
-
|
140 |
## Funding
|
141 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
142 |
|
|
|
58 |
|
59 |
## Evaluation and results
|
60 |
|
61 |
+
The models have been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
|
62 |
|
63 |
- [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
|
64 |
|
65 |
- [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
|
66 |
|
67 |
- ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
|
68 |
+
|
69 |
+
We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
|
70 |
+
The table below shows the F1 scores obtained:
|
71 |
|
72 |
+
| Tasks/Models | bsc-bio-es | bsc-bio-ehr-es | XLM-R-Galén | BETO-Galén | mBERT-Galén | mBERT | BioBERT | roberta-base-bne |
|
73 |
+
|--------------|--------------|----------------|--------------------|--------------|--------------|--------------|--------------|------------------|
|
74 |
+
| PharmaCoNER | 0.8907 | **0.8913** | 0.8754 | 0.8537 | 0.8594 | 0.8671 | 0.8545 | 0.8474 |
|
75 |
+
| CANTEMIST | 0.8220 | **0.8340** | 0.8078 | 0.8153 | 0.8168 | 0.8116 | 0.8070 | 0.7875 |
|
76 |
+
| ICTUSnet | 0.8727 | **0.8756** | 0.8716 | 0.8498 | 0.8509 | 0.8631 | 0.8521 | 0.8677 |
|
77 |
|
78 |
+
The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
|
|
|
|
|
|
|
|
|
79 |
|
80 |
|
81 |
## Intended uses & limitations
|
|
|
89 |
|
90 |
---
|
91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
## Funding
|
93 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
94 |
|