julien-c HF staff commited on
Commit
d40b3b7
1 Parent(s): cc2307d

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16/README.md

Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - bert
6
+ - bluebert
7
+ license:
8
+ - PUBLIC DOMAIN NOTICE
9
+ datasets:
10
+ - PubMed
11
+ - MIMIC-III
12
+
13
+ ---
14
+
15
+ # BlueBert-Base, Uncased, PubMed and MIMIC-III
16
+
17
+ ## Model description
18
+
19
+ A BERT model pre-trained on PubMed abstracts and clinical notes ([MIMIC-III](https://mimic.physionet.org/)).
20
+
21
+ ## Intended uses & limitations
22
+
23
+ #### How to use
24
+
25
+ Please see https://github.com/ncbi-nlp/bluebert
26
+
27
+ ## Training data
28
+
29
+ We provide [preprocessed PubMed texts](https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/pubmed_uncased_sentence_nltk.txt.tar.gz) that were used to pre-train the BlueBERT models.
30
+ The corpus contains ~4000M words extracted from the [PubMed ASCII code version](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/).
31
+
32
+ Pre-trained model: https://huggingface.co/bert-large-uncased
33
+
34
+ ## Training procedure
35
+
36
+ * lowercasing the text
37
+ * removing speical chars `\x00`-`\x7F`
38
+ * tokenizing the text using the [NLTK Treebank tokenizer](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
39
+
40
+ Below is a code snippet for more details.
41
+
42
+ ```python
43
+ value = value.lower()
44
+ value = re.sub(r'[\r\n]+', ' ', value)
45
+ value = re.sub(r'[^\x00-\x7F]+', ' ', value)
46
+
47
+ tokenized = TreebankWordTokenizer().tokenize(value)
48
+ sentence = ' '.join(tokenized)
49
+ sentence = re.sub(r"\s's\b", "'s", sentence)
50
+ ```
51
+
52
+ ### BibTeX entry and citation info
53
+
54
+ ```bibtex
55
+ @InProceedings{peng2019transfer,
56
+ author = {Yifan Peng and Shankai Yan and Zhiyong Lu},
57
+ title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
58
+ booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
59
+ year = {2019},
60
+ pages = {58--65},
61
+ }
62
+ ```
63
+
64
+ ### Acknowledgments
65
+
66
+ This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of
67
+ Medicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number 4R00LM013001-01.
68
+
69
+ We are also grateful to the authors of BERT and ELMo to make the data and codes publicly available.
70
+
71
+ We would like to thank Dr Sun Kim for processing the PubMed texts.
72
+
73
+ ### Disclaimer
74
+
75
+ This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced
76
+ on this website is not intended for direct diagnostic use or medical decision-making without review and oversight
77
+ by a clinical professional. Individuals should not change their health behavior solely on the basis of information
78
+ produced on this website. NIH does not independently verify the validity or utility of the information produced
79
+ by this tool. If you have questions about the information produced on this website, please see a health care
80
+ professional. More information about NCBI's disclaimer policy is available.