mmarimon commited on
Commit
e01cca2
·
1 Parent(s): e3bc7cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -66
README.md CHANGED
@@ -14,10 +14,94 @@ widget:
14
  ---
15
 
16
  # Biomedical language model for Spanish
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- ## Tokenization and model pretraining
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
23
  **biomedical** corpus in Spanish collected from several sources (see next section).
@@ -25,7 +109,7 @@ The training corpus has been tokenized using a byte version of [Byte-Pair Encodi
25
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
26
 
27
 
28
- ## Training corpora and preprocessing
29
 
30
  The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
31
  To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
@@ -74,13 +158,27 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
74
  | ICTUSnet | **88.12** - **85.56** - **90.83** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
75
 
76
 
77
- ## Intended uses & limitations
78
 
79
- The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
80
 
81
- However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 
 
 
 
82
 
83
- ## Cite
 
 
 
 
 
 
 
 
 
 
 
84
  If you use our models, please cite our latest preprint:
85
 
86
  ```bibtex
@@ -111,69 +209,11 @@ If you use our Medical Crawler corpus, please cite the preprint:
111
 
112
  ```
113
 
114
- ---
115
-
116
- ## How to use
117
-
118
- ```python
119
- from transformers import AutoTokenizer, AutoModelForMaskedLM
120
-
121
- tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
122
-
123
- model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
124
-
125
- from transformers import pipeline
126
-
127
- unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
128
-
129
- unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
130
- ```
131
- ```
132
- # Output
133
- [
134
- {
135
- "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
136
- "score": 0.9855039715766907,
137
- "token": 3529,
138
- "token_str": " hipertensión"
139
- },
140
- {
141
- "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
142
- "score": 0.0039140828885138035,
143
- "token": 1945,
144
- "token_str": " diabetes"
145
- },
146
- {
147
- "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
148
- "score": 0.002484665485098958,
149
- "token": 11483,
150
- "token_str": " hipotensión"
151
- },
152
- {
153
- "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
154
- "score": 0.0023484621196985245,
155
- "token": 12238,
156
- "token_str": " Hipertensión"
157
- },
158
- {
159
- "sequence": " El único antecedente personal a reseñar era la presión arterial.",
160
- "score": 0.0008009297889657319,
161
- "token": 2267,
162
- "token_str": " presión"
163
- }
164
- ]
165
- ```
166
- ## Copyright
167
-
168
- Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
169
-
170
- ## Licensing information
171
 
172
- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
173
 
174
- ## Funding
175
 
176
- This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
177
 
178
  ### Disclaimer
179
 
 
14
  ---
15
 
16
  # Biomedical language model for Spanish
17
+
18
+ ## Table of contents
19
+ <details>
20
+ <summary>Click to expand</summary>
21
+
22
+ - [Model Description](#model-description)
23
+ - [Intended Uses and Limitations](#intended-use)
24
+ - [How to Use](#how-to-use)
25
+ - [Limitations and bias](#limitations-and-bias)
26
+ - [Training](#training)
27
+ - [Tokenization and model pretraining](#Tokenization-pretraining)
28
+ - [Training corpora and preprocessing](#training-corpora-preprocessing)
29
+ - [Evaluation and results](#evaluation)
30
+ - [Additional Information](#additional-information)
31
+ - [Contact Information](#contact-information)
32
+ - [Copyright](#copyright)
33
+ - [Licensing Information](#licensing-information)
34
+ - [Funding](#funding)
35
+ - [Citation Information](#citation-information)
36
+ - [Contributions](#contributions)
37
+ - [Disclaimer](#disclaimer)
38
+
39
+ </details>
40
+
41
+ ## Model description
42
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
43
 
44
+ ## Intended uses & limitations
45
+
46
+ The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
47
+
48
+ However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
49
+
50
+
51
+ ## How to use
52
+
53
+ ```python
54
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
55
+
56
+ tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
57
+
58
+ model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
59
 
60
+ from transformers import pipeline
61
+
62
+ unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
63
+
64
+ unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
65
+ ```
66
+ ```
67
+ # Output
68
+ [
69
+ {
70
+ "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
71
+ "score": 0.9855039715766907,
72
+ "token": 3529,
73
+ "token_str": " hipertensión"
74
+ },
75
+ {
76
+ "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
77
+ "score": 0.0039140828885138035,
78
+ "token": 1945,
79
+ "token_str": " diabetes"
80
+ },
81
+ {
82
+ "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
83
+ "score": 0.002484665485098958,
84
+ "token": 11483,
85
+ "token_str": " hipotensión"
86
+ },
87
+ {
88
+ "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
89
+ "score": 0.0023484621196985245,
90
+ "token": 12238,
91
+ "token_str": " Hipertensión"
92
+ },
93
+ {
94
+ "sequence": " El único antecedente personal a reseñar era la presión arterial.",
95
+ "score": 0.0008009297889657319,
96
+ "token": 2267,
97
+ "token_str": " presión"
98
+ }
99
+ ]
100
+ ```
101
+
102
+ ## Training
103
+
104
+ ### Tokenization and model pretraining
105
 
106
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
107
  **biomedical** corpus in Spanish collected from several sources (see next section).
 
109
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
110
 
111
 
112
+ ### Training corpora and preprocessing
113
 
114
  The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
115
  To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
 
158
  | ICTUSnet | **88.12** - **85.56** - **90.83** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
159
 
160
 
 
161
 
162
+ ## Additional information
163
 
164
+ ### Contact Information
165
+
166
+ For further information, send an email to <[email protected]>
167
+
168
+ ### Copyright
169
 
170
+ Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
171
+
172
+ ### Licensing information
173
+
174
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
175
+
176
+ ### Funding
177
+
178
+ This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
179
+
180
+
181
+ ## Citation Information
182
  If you use our models, please cite our latest preprint:
183
 
184
  ```bibtex
 
209
 
210
  ```
211
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
 
213
+ ### Contributions
214
 
215
+ [N/A]
216
 
 
217
 
218
  ### Disclaimer
219