BSC-LT
/

roberta-base-biomedical-clinical-es

 - text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
 ---
+# Biomedical language model for Spanish
+## BibTeX  citation
+If you use any of these resources (datasets or models) in your work, please cite our latest paper:
+```bibtex
+@misc{carrino2021biomedical,
+      title={Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario},
+      author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Asier Gutiérrez-Fandiño and Joan Llop-Palao and Marc Pàmies and Aitor Gonzalez-Agirre and Marta Villegas},
+      year={2021},
+      eprint={2109.03570},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+## Model and tokenization
+This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
+biomedical-clinical corpus collected from several sources (see next section).
+## Training corpora and preprocessing
+The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers:
+| Name                                                                                    | No. tokens  | Description                                                                                                                                                                                                                                          |
+|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Medical crawler](https://zenodo.org/record/4561970)                                    | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains.                                                                                                                                                                                 |
+| Clinical cases misc.                                                                    | 102,855,267 | A miscellany of medical content, essentially clinical case. Note that a clinical case report is different from a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document.                                                                                                                                                                                 |
+| [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler)                        | 60,007,289  | Publications written in Spanish crawled from the Spanish SciELO server in 2017.                                                                                                                                       |
+| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442  | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines.                                                                                       |
+| Wikipedia_life_sciences                                                                 | 13,890,501  | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021                                                                                                                                                                      |
+| Patents                                                                                 | 13,463,387  | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P".                                                        |
+| [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip)                  | 5,377,448   | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency.                                                                                                                            |
+| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR)                        | 4,166,077   | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature.  The collection of parallel resources are aggregated from the MedlinePlus source. |
+| PubMed                                                                                  | 1,858,966   | Open-access articles from the PubMed repository crawled in 2017.                                                                                                                                              |
+To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
+- data parsing in different formats
+  - sentence splitting
+  - language detection
+  - filtering of ill-formed sentences
+  - deduplication of repetitive contents
+  - keep the original document boundaries
+Finally, the corpora are concatenated and further global deduplication among the corpora have been applied.
+The result is a medium-size biomedical corpus for Spanish composed of about 860M tokens.
+## Evaluation and results
+The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
+ - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
+ - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
+ - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
+The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
+| F1 - Precision - Recall | roberta-base-biomedical-es | mBERT                   | BETO                    |
+|---------------------------|----------------------------|-------------------------------|-------------------------|
+| PharmaCoNER               | **89.48** - **87.85** - **91.18**    | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
+| CANTEMIST                 | **83.87** - **81.70** - **86.17**    | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
+| ICTUSnet                  | **88.12** - **85.56** - **90.83**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
+## Intended uses & limitations
+The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
+However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
+---
+## How to use
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
+model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
+from transformers import pipeline
+unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
+unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
+```
+```
+# Output
+[
+  {
+    "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
+    "score": 0.9855039715766907,
+    "token": 3529,
+    "token_str": " hipertensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
+    "score": 0.0039140828885138035,
+    "token": 1945,
+    "token_str": " diabetes"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
+    "score": 0.002484665485098958,
+    "token": 11483,
+    "token_str": " hipotensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
+    "score": 0.0023484621196985245,
+    "token": 12238,
+    "token_str": " Hipertensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la presión arterial.",
+    "score": 0.0008009297889657319,
+    "token": 2267,
+    "token_str": " presión"
+  }
+]
+```