Update readme
Browse files
README.md
CHANGED
@@ -14,3 +14,129 @@ widget:
|
|
14 |
- text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
|
15 |
---
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
- text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
|
15 |
---
|
16 |
|
17 |
+
# Biomedical language model for Spanish
|
18 |
+
|
19 |
+
## BibTeX citation
|
20 |
+
|
21 |
+
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
|
22 |
+
|
23 |
+
```bibtex
|
24 |
+
@misc{carrino2021biomedical,
|
25 |
+
title={Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario},
|
26 |
+
author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Asier Gutiérrez-Fandiño and Joan Llop-Palao and Marc Pàmies and Aitor Gonzalez-Agirre and Marta Villegas},
|
27 |
+
year={2021},
|
28 |
+
eprint={2109.03570},
|
29 |
+
archivePrefix={arXiv},
|
30 |
+
primaryClass={cs.CL}
|
31 |
+
}
|
32 |
+
```
|
33 |
+
|
34 |
+
## Model and tokenization
|
35 |
+
This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
|
36 |
+
biomedical-clinical corpus collected from several sources (see next section).
|
37 |
+
|
38 |
+
## Training corpora and preprocessing
|
39 |
+
|
40 |
+
The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers:
|
41 |
+
|
42 |
+
| Name | No. tokens | Description |
|
43 |
+
|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
44 |
+
| [Medical crawler](https://zenodo.org/record/4561970) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
|
45 |
+
| Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical case. Note that a clinical case report is different from a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. |
|
46 |
+
| [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
|
47 |
+
| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
|
48 |
+
| Wikipedia_life_sciences | 13,890,501 | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021 |
|
49 |
+
| Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P". |
|
50 |
+
| [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency. |
|
51 |
+
| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
|
52 |
+
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
53 |
+
|
54 |
+
To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
|
55 |
+
|
56 |
+
- data parsing in different formats
|
57 |
+
- sentence splitting
|
58 |
+
- language detection
|
59 |
+
- filtering of ill-formed sentences
|
60 |
+
- deduplication of repetitive contents
|
61 |
+
- keep the original document boundaries
|
62 |
+
|
63 |
+
Finally, the corpora are concatenated and further global deduplication among the corpora have been applied.
|
64 |
+
The result is a medium-size biomedical corpus for Spanish composed of about 860M tokens.
|
65 |
+
|
66 |
+
## Evaluation and results
|
67 |
+
|
68 |
+
The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
|
69 |
+
|
70 |
+
- [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
|
71 |
+
|
72 |
+
- [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
|
73 |
+
|
74 |
+
- ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
|
75 |
+
|
76 |
+
The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
|
77 |
+
|
78 |
+
| F1 - Precision - Recall | roberta-base-biomedical-es | mBERT | BETO |
|
79 |
+
|---------------------------|----------------------------|-------------------------------|-------------------------|
|
80 |
+
| PharmaCoNER | **89.48** - **87.85** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
|
81 |
+
| CANTEMIST | **83.87** - **81.70** - **86.17** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
|
82 |
+
| ICTUSnet | **88.12** - **85.56** - **90.83** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
|
83 |
+
|
84 |
+
|
85 |
+
## Intended uses & limitations
|
86 |
+
|
87 |
+
The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
|
88 |
+
|
89 |
+
However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
|
90 |
+
|
91 |
+
---
|
92 |
+
|
93 |
+
## How to use
|
94 |
+
|
95 |
+
```python
|
96 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
97 |
+
|
98 |
+
tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
|
99 |
+
|
100 |
+
model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
|
101 |
+
|
102 |
+
from transformers import pipeline
|
103 |
+
|
104 |
+
unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
|
105 |
+
|
106 |
+
unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
|
107 |
+
```
|
108 |
+
```
|
109 |
+
# Output
|
110 |
+
[
|
111 |
+
{
|
112 |
+
"sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
|
113 |
+
"score": 0.9855039715766907,
|
114 |
+
"token": 3529,
|
115 |
+
"token_str": " hipertensión"
|
116 |
+
},
|
117 |
+
{
|
118 |
+
"sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
|
119 |
+
"score": 0.0039140828885138035,
|
120 |
+
"token": 1945,
|
121 |
+
"token_str": " diabetes"
|
122 |
+
},
|
123 |
+
{
|
124 |
+
"sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
|
125 |
+
"score": 0.002484665485098958,
|
126 |
+
"token": 11483,
|
127 |
+
"token_str": " hipotensión"
|
128 |
+
},
|
129 |
+
{
|
130 |
+
"sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
|
131 |
+
"score": 0.0023484621196985245,
|
132 |
+
"token": 12238,
|
133 |
+
"token_str": " Hipertensión"
|
134 |
+
},
|
135 |
+
{
|
136 |
+
"sequence": " El único antecedente personal a reseñar era la presión arterial.",
|
137 |
+
"score": 0.0008009297889657319,
|
138 |
+
"token": 2267,
|
139 |
+
"token_str": " presión"
|
140 |
+
}
|
141 |
+
]
|
142 |
+
```
|