mmarimon commited on
Commit
1e543ad
·
1 Parent(s): 4d374fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -28
README.md CHANGED
@@ -16,16 +16,51 @@ widget:
16
  ---
17
 
18
  # Biomedical-clinical language model for Spanish
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
20
 
21
- ## Tokenization and model pretraining
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
23
  **biomedical-clinical** corpus in Spanish collected from several sources (see next section).
24
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
25
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
26
 
27
- ## Training corpora and preprocessing
28
-
29
  The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus collected from more than 278K clinical documents and notes. To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus uncleaned. Essentially, the cleaning operations used are:
30
 
31
  - data parsing in different formats
@@ -53,10 +88,7 @@ Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus
53
  | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
54
 
55
 
56
-
57
- ## Evaluation and results
58
-
59
-
60
  The model has been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
61
 
62
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
@@ -76,13 +108,26 @@ The table below shows the F1 scores obtained:
76
 
77
 
78
  The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
79
- ## Intended uses & limitations
80
 
81
- The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
82
 
83
- However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
- ## Cite
86
  If you use these models, please cite our work:
87
 
88
  ```bibtext
@@ -109,31 +154,21 @@ If you use these models, please cite our work:
109
  }
110
  ```
111
 
112
- ---
113
-
114
- ## Copyright
115
-
116
- Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
117
-
118
- ## Licensing information
119
-
120
- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
121
-
122
- ## Funding
123
-
124
- This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
125
 
126
- ## Disclaimer
 
127
 
128
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
129
 
130
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.
131
 
132
- In no event shall the owner of the models (SEDIA – State Secretariat for digitalization and artificial intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
133
 
134
 
135
  Los modelos publicados en este repositorio tienen una finalidad generalista y están a disposición de terceros. Estos modelos pueden tener sesgos y/u otro tipo de distorsiones indeseables.
136
 
137
  Cuando terceros desplieguen o proporcionen sistemas y/o servicios a otras partes usando alguno de estos modelos (o utilizando sistemas basados en estos modelos) o se conviertan en usuarios de los modelos, deben tener en cuenta que es su responsabilidad mitigar los riesgos derivados de su uso y, en todo caso, cumplir con la normativa aplicable, incluyendo la normativa en materia de uso de inteligencia artificial.
138
 
139
- En ningún caso el propietario de los modelos (SEDIA – Secretaría de Estado de Digitalización e Inteligencia Artificial) ni el creador (BSC – Barcelona Supercomputing Center) serán responsables de los resultados derivados del uso que hagan terceros de estos modelos.
 
 
16
  ---
17
 
18
  # Biomedical-clinical language model for Spanish
19
+
20
+ ## Table of contents
21
+ <details>
22
+ <summary>Click to expand</summary>
23
+
24
+ - [Model description](#model-description)
25
+ - [Intended uses and limitations](#intended-use)
26
+ - [How to use](#how-to-use)
27
+ - [Limitations and bias](#limitations-and-bias)
28
+ - [Training](#training)
29
+ - [Evaluation](#evaluation)
30
+ - [Additional information](#additional-information)
31
+ - [Author](#author)
32
+ - [Contact information](#contact-information)
33
+ - [Copyright](#copyright)
34
+ - [Licensing information](#licensing-information)
35
+ - [Funding](#funding)
36
+ - [Citing information](#citing-information)
37
+ - [Disclaimer](#disclaimer)
38
+
39
+ </details>
40
+
41
+ ## Model description
42
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
43
 
44
+
45
+ ## Intended uses and limitations
46
+ The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section). However, it is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
47
+
48
+
49
+ ## How to use
50
+
51
+
52
+ ## Limitations and bias
53
+ At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
54
+
55
+ ## Training
56
+
57
+ ### Tokenization and model pretraining
58
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
59
  **biomedical-clinical** corpus in Spanish collected from several sources (see next section).
60
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
61
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
62
 
63
+ ### Training corpora and preprocessing
 
64
  The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus collected from more than 278K clinical documents and notes. To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus uncleaned. Essentially, the cleaning operations used are:
65
 
66
  - data parsing in different formats
 
88
  | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
89
 
90
 
91
+ ## Evaluation
 
 
 
92
  The model has been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
93
 
94
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
 
108
 
109
 
110
  The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
 
111
 
 
112
 
113
+ ## Additional information
114
+
115
+ ### Author
116
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ([email protected])
117
+
118
+ ### Contact information
119
+ For further information, send an email to <[email protected]>
120
+
121
+ ### Copyright
122
+ Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
123
+
124
+ ### Licensing information
125
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
126
+
127
+ ### Funding
128
+ This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
129
 
130
+ ### Citing information
131
  If you use these models, please cite our work:
132
 
133
  ```bibtext
 
154
  }
155
  ```
156
 
157
+ ### Disclaimer
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
+ <details>
160
+ <summary>Click to expand</summary>
161
 
162
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
163
 
164
+ When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
165
 
166
+ In no event shall the owner of the models (SEDIA – State Secretariat for Digitalization and Artificial Intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
167
 
168
 
169
  Los modelos publicados en este repositorio tienen una finalidad generalista y están a disposición de terceros. Estos modelos pueden tener sesgos y/u otro tipo de distorsiones indeseables.
170
 
171
  Cuando terceros desplieguen o proporcionen sistemas y/o servicios a otras partes usando alguno de estos modelos (o utilizando sistemas basados en estos modelos) o se conviertan en usuarios de los modelos, deben tener en cuenta que es su responsabilidad mitigar los riesgos derivados de su uso y, en todo caso, cumplir con la normativa aplicable, incluyendo la normativa en materia de uso de inteligencia artificial.
172
 
173
+ En ningún caso el propietario de los modelos (SEDIA – Secretaría de Estado de Digitalización e Inteligencia Artificial) ni el creador (BSC – Barcelona Supercomputing Center) serán responsables de los resultados derivados del uso que hagan terceros de estos modelos.
174
+ </details>