Charangan commited on
Commit
6aef6fc
1 Parent(s): 052777d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -1
README.md CHANGED
@@ -9,8 +9,27 @@ tags:
9
 
10
  # MedBERT Model
11
 
12
- MedBERT is a newly pre-trained transformer-based language model for biomedical named entity recognition: initialised with Bio_ClinicalBERT & pre-trained on N2C2, BioNLP and CRAFT community datasets.
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## How to use
16
 
@@ -20,6 +39,9 @@ tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT")
20
  model = AutoModel.from_pretrained("Charangan/MedBERT")
21
  ```
22
 
 
 
 
23
 
24
  ## Citation
25
  ```
 
9
 
10
  # MedBERT Model
11
 
12
+ MedBERT is a newly pre-trained transformer-based language model for biomedical named entity recognition: initialized with [Bio_ClinicalBERT](https://arxiv.org/abs/1904.03323) & pre-trained on N2C2, BioNLP, and CRAFT community datasets.
13
 
14
+ ## Pretraining
15
+
16
+ ### Data
17
+ The `MedBERT` model was trained on N2C2, BioNLP, and CRAFT community datasets.
18
+
19
+ | Dataset | Description |
20
+ | ------------- | ------------- |
21
+ | [NLP Clinical Challenges (N2C2)](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/) | A collection of clinical notes released in N2C2 2018 and N2C2 2022 challenges|
22
+ | [BioNLP](http://bionlp.sourceforge.net/index.shtml) | It contains the articles released under the BioNLP project. The articles cover multiple biomedical disciplines such as molecular biology, IE for protein and DNA modifications, biomolecular mechanisms of infectious diseases, habitats of bacteria mentioned, and bacterial molecular interactions and regulations |
23
+ | [CRAFT](https://www.researchgate.net/publication/318175988_The_Colorado_Richly_Annotated_Full_Text_CRAFT_Corpus_Multi-Model_Annotation_in_the_Biomedical_Domain) | It consists of 67 full-text open-access biomedical journal articles from PubMed Central that covers a wide range of biomedical domains including biochemistry and molecular biology, genetics, developmental biology, and computational biology |
24
+ | Wikipedia | Crawled medical-related articles |
25
+
26
+
27
+ ### Procedures
28
+ The model was trained using code from [Google's BERT repository](https://github.com/google-research/bert). Model parameters were initialized with Bio_ClinicalBERT.
29
+
30
+ ### Hyperparameters
31
+ We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 1·10−4 for pre-training our models. The models trained for 200,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15
32
+ and max predictions per sequence = 22).
33
 
34
  ## How to use
35
 
 
39
  model = AutoModel.from_pretrained("Charangan/MedBERT")
40
  ```
41
 
42
+ ## More Information
43
+
44
+ Refer to the original paper, [MedBERT: A Pre-trained Language Model for Biomedical Named Entity Recognition](https://ieeexplore.ieee.org/abstract/document/9980157) (APSIPA Conference 2022) for additional details and performance of biomedical NER tasks.
45
 
46
  ## Citation
47
  ```