Li commited on
Commit
803a396
1 Parent(s): 8b995fb

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -4
README.md CHANGED
@@ -11,21 +11,125 @@ Bioformer-16L has 16 layers (transformer blocks) with a hidden embedding size of
11
 
12
  **The usage of Bioformer-16L is the same as a standard BERT model. The documentation of BERT can be found [here](https://huggingface.co/docs/transformers/model_doc/bert).**
13
 
14
- ## Vocabulary of Bioformer
15
  Bioformer-16L uses a cased WordPiece vocabulary trained from a biomedical corpus, which included all PubMed abstracts (33 million, as of Feb 1, 2021) and 1 million PMC full-text articles. PMC has 3.6 million articles but we down-sampled them to 1 million such that the total size of PubMed abstracts and PMC full-text articles are approximately equal. To mitigate the out-of-vocabulary issue and include special symbols (e.g. male and female symbols) in biomedical literature, we trained Bioformer’s vocabulary from the Unicode text of the two resources. The vocabulary size of Bioformer-16L is 32768 (2^15), which is similar to that of the original BERT.
16
 
17
- ## Pre-training of Bioformer
18
  Bioformer-16L was pre-trained from scratch on the same corpus as the vocabulary (33 million PubMed abstracts + 1 million PMC full-text articles). For the masked language modeling (MLM) objective, we used whole-word masking with a masking rate of 15%. There are debates on whether the next sentence prediction (NSP) objective could improve the performance on downstream tasks. We include it in our pre-training experiment in case the prediction of the next sentence is needed by end-users. Sentence segmentation of all training text was performed using [SciSpacy](https://allenai.github.io/scispacy/).
19
 
20
  Pre-training of Bioformer-16L was performed on a single Cloud TPU device (TPUv2, 8 cores, 8GB memory per core). The maximum input sequence length was fixed to 512, and the batch size was set to 256. We pre-trained Bioformer-16L for 2 million steps, which took about 11 days.
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ## Link
23
  [Bioformer-8L](https://huggingface.co/bioformers/bioformer-8L)
24
 
25
  ## Acknowledgment
26
- Training and evaluation of Bioformer-16L is supported by the Google TPU Research Cloud (TRC) program, the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health (NIH), and NIH/NLM grants LM012895 and 1K99LM014024-01.
 
27
 
28
  ## Questions
29
  If you have any questions, please submit an issue here: https://github.com/WGLab/bioformer/issues
30
 
31
- You may also send an email to Li Fang ([email protected], https://fangli80.github.io/).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  **The usage of Bioformer-16L is the same as a standard BERT model. The documentation of BERT can be found [here](https://huggingface.co/docs/transformers/model_doc/bert).**
13
 
14
+ ## Vocabulary of Bioformer-16L
15
  Bioformer-16L uses a cased WordPiece vocabulary trained from a biomedical corpus, which included all PubMed abstracts (33 million, as of Feb 1, 2021) and 1 million PMC full-text articles. PMC has 3.6 million articles but we down-sampled them to 1 million such that the total size of PubMed abstracts and PMC full-text articles are approximately equal. To mitigate the out-of-vocabulary issue and include special symbols (e.g. male and female symbols) in biomedical literature, we trained Bioformer’s vocabulary from the Unicode text of the two resources. The vocabulary size of Bioformer-16L is 32768 (2^15), which is similar to that of the original BERT.
16
 
17
+ ## Pre-training of Bioformer-16L
18
  Bioformer-16L was pre-trained from scratch on the same corpus as the vocabulary (33 million PubMed abstracts + 1 million PMC full-text articles). For the masked language modeling (MLM) objective, we used whole-word masking with a masking rate of 15%. There are debates on whether the next sentence prediction (NSP) objective could improve the performance on downstream tasks. We include it in our pre-training experiment in case the prediction of the next sentence is needed by end-users. Sentence segmentation of all training text was performed using [SciSpacy](https://allenai.github.io/scispacy/).
19
 
20
  Pre-training of Bioformer-16L was performed on a single Cloud TPU device (TPUv2, 8 cores, 8GB memory per core). The maximum input sequence length was fixed to 512, and the batch size was set to 256. We pre-trained Bioformer-16L for 2 million steps, which took about 11 days.
21
 
22
+ ## Usage
23
+
24
+ Prerequisites: python3, pytorch, transformers and datasets
25
+
26
+ We have tested the following commands on Python v3.9.16, PyTorch v1.13.1+cu117, Datasets v2.9.0 and Transformers v4.26.
27
+
28
+ To install pytorch, please refer to instructions [here](https://pytorch.org/get-started/locally).
29
+
30
+ To install the `transformers` and `datasets` library:
31
+ ```
32
+ pip install transformers
33
+ pip install datasets
34
+ ```
35
+
36
+ ### Filling mask
37
+
38
+ ```
39
+ from transformers import pipeline
40
+ unmasker8L = pipeline('fill-mask', model='bioformers/bioformer-8L')
41
+ unmasker8L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")
42
+
43
+ unmasker16L = pipeline('fill-mask', model='bioformers/bioformer-16L')
44
+ unmasker16L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")
45
+
46
+ ```
47
+
48
+ Output of `bioformer-8L`:
49
+
50
+ ```
51
+ [{'score': 0.3207533359527588,
52
+ 'token': 13473,
53
+ 'token_str': 'Diabetes',
54
+ 'sequence': 'Diabetes refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},
55
+
56
+ {'score': 0.19234347343444824,
57
+ 'token': 17740,
58
+ 'token_str': 'Obesity',
59
+ 'sequence': 'Obesity refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},
60
+
61
+ {'score': 0.09200277179479599,
62
+ 'token': 10778,
63
+ 'token_str': 'T2DM',
64
+ 'sequence': 'T2DM refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},
65
+
66
+ {'score': 0.08494312316179276,
67
+ 'token': 2228,
68
+ 'token_str': 'It',
69
+ 'sequence': 'It refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},
70
+
71
+ {'score': 0.0412776917219162,
72
+ 'token': 22263,
73
+ 'token_str':
74
+ 'Hypertension',
75
+ 'sequence': 'Hypertension refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}]
76
+ ```
77
+
78
+ Output of `bioformer-16L`:
79
+
80
+ ```
81
+ [{'score': 0.7262957692146301,
82
+ 'token': 13473,
83
+ 'token_str': 'Diabetes',
84
+ 'sequence': 'Diabetes refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},
85
+
86
+ {'score': 0.124954953789711,
87
+ 'token': 10778,
88
+ 'token_str': 'T2DM',
89
+ 'sequence': 'T2DM refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},
90
+
91
+ {'score': 0.04062706232070923,
92
+ 'token': 2228,
93
+ 'token_str': 'It',
94
+ 'sequence': 'It refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},
95
+
96
+ {'score': 0.022694870829582214,
97
+ 'token': 17740,
98
+ 'token_str': 'Obesity',
99
+ 'sequence': 'Obesity refers to a group of diseases that affect how the body uses blood sugar ( glucose )'},
100
+
101
+ {'score': 0.009743048809468746,
102
+ 'token': 13960,
103
+ 'token_str': 'T2D',
104
+ 'sequence': 'T2D refers to a group of diseases that affect how the body uses blood sugar ( glucose )'}]
105
+ ```
106
+
107
  ## Link
108
  [Bioformer-8L](https://huggingface.co/bioformers/bioformer-8L)
109
 
110
  ## Acknowledgment
111
+
112
+ Training and evaluation of Bioformer-8L is supported by the Google TPU Research Cloud (TRC) program, the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health (NIH), and NIH/NLM grants LM012895 and 1K99LM014024-01.
113
 
114
  ## Questions
115
  If you have any questions, please submit an issue here: https://github.com/WGLab/bioformer/issues
116
 
117
+ You can also send an email to Li Fang ([email protected], https://fangli80.github.io/).
118
+
119
+
120
+ ## Citation
121
+
122
+ You can cite our preprint on arXiv:
123
+
124
+ Fang L, Chen Q, Wei C-H, Lu Z, Wang K: Bioformer: an efficient transformer language model for biomedical text mining. arXiv preprint arXiv:2302.01588 (2023). DOI: https://doi.org/10.48550/arXiv.2302.01588
125
+
126
+
127
+ BibTeX format:
128
+ ```
129
+ @ARTICLE{fangli2023bioformer,
130
+ author = {{Fang}, Li and {Chen}, Qingyu and {Wei}, Chih-Hsuan and {Lu}, Zhiyong and {Wang}, Kai},
131
+ title = "{Bioformer: an efficient transformer language model for biomedical text mining}",
132
+ journal = {arXiv preprint arXiv:2302.01588},
133
+ year = {2023}
134
+ }
135
+ ```