Automatic correction of README.md metadata. Email [email protected] for any question

a0ce2a0 over 3 years ago

1.62 kB

	---
	language:
	- en
	tags:
	- bluebert
	license: cc0-1.0
	datasets:
	- pubmed

	---

	# BlueBert-Base, Uncased, PubMed

	## Model description

	A BERT model pre-trained on PubMed abstracts

	## Intended uses & limitations

	#### How to use

	Please see https://github.com/ncbi-nlp/bluebert

	## Training data

	We provide [preprocessed PubMed texts](https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/pubmed_uncased_sentence_nltk.txt.tar.gz) that were used to pre-train the BlueBERT models.
	The corpus contains ~4000M words extracted from the [PubMed ASCII code version](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/).

	Pre-trained model: https://huggingface.co/bert-base-uncased

	## Training procedure

	* lowercasing the text
	* removing speical chars `\x00`-`\x7F`
	* tokenizing the text using the [NLTK Treebank tokenizer](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)

	Below is a code snippet for more details.

	```python
	value = value.lower()
	value = re.sub(r'[\r\n]+', ' ', value)
	value = re.sub(r'[^\x00-\x7F]+', ' ', value)

	tokenized = TreebankWordTokenizer().tokenize(value)
	sentence = ' '.join(tokenized)
	sentence = re.sub(r"\s's\b", "'s", sentence)
	```

	### BibTeX entry and citation info

	```bibtex
	@InProceedings{peng2019transfer,
	author = {Yifan Peng and Shankai Yan and Zhiyong Lu},
	title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
	booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
	year = {2019},
	pages = {58--65},
	}
	```