--- base_model: - google-bert/bert-base-uncased datasets: - ddrg/named_math_formulas language: - en library_name: transformers license: apache-2.0 pipeline_tag: fill-mask tags: - math --- # **medBERT-base** This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts. ## **Model Architecture** - **Base Model**: `bert-base-uncased` - **Task**: Masked Language Modeling (MLM) for medical texts - **Tokenizer**: BERT's WordPiece tokenizer ## **Usage** ### **Loading the Pre-trained Model** You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library: ''' from transformers import BertTokenizer, BertForMaskedLM import torch tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base') model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda") input_text = "The patient was diagnosed with gastric cancer after a thorough examination." masked_text = input_text.replace("gastric cancer", tokenizer.mask_token) inputs = tokenizer(masked_text, return_tensors='pt').to("cuda") outputs = model(**inputs) predicted_token_id = torch.argmax(outputs.logits, dim=-1) predicted_token = tokenizer.decode(predicted_token_id[0, inputs['input_ids'].shape[1] - 1]) print(predicted_token) ''' ### **Fine-tuning the Model** To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps: 1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format. 2. Tokenize the dataset and apply masking. 3. Train the model using the provided training loop. Here's the training code: https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb ## **Training Details** ### **Hyperparameters** - **Batch Size**: 16 - **Learning Rate**: 5e-5 - **Number of Epochs**: 1 - **Max Sequence Length**: 512 tokens ### **Dataset** - **Dataset Name**: *gayanin/pubmed-gastro-maskfilling* - **Task**: Masked Language Modeling (MLM) on medical texts ## **Acknowledgements** - The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training. - This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models