---
base_model:
- google-bert/bert-base-uncased
datasets:
- ddrg/named_math_formulas
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: fill-mask
tags:
- math
---

# **medBERT-base**

This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts.

## **Model Architecture**
- **Base Model**: `bert-base-uncased`
- **Task**: Masked Language Modeling (MLM) for medical texts
- **Tokenizer**: BERT's WordPiece tokenizer

## **Usage**

### **Loading the Pre-trained Model**

You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library:

'''
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base')
model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda")

input_text = "The patient was diagnosed with gastric cancer after a thorough examination."
masked_text = input_text.replace("gastric cancer", tokenizer.mask_token)

inputs = tokenizer(masked_text, return_tensors='pt').to("cuda")

outputs = model(**inputs)

predicted_token_id = torch.argmax(outputs.logits, dim=-1)

predicted_token = tokenizer.decode(predicted_token_id[0, inputs['input_ids'].shape[1] - 1])
print(predicted_token)
'''

### **Fine-tuning the Model**

To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps:

1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format.
2. Tokenize the dataset and apply masking.
3. Train the model using the provided training loop.

Here's the training code:

https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb

## **Training Details**

### **Hyperparameters**
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Number of Epochs**: 1
- **Max Sequence Length**: 512 tokens

### **Dataset**
- **Dataset Name**: *gayanin/pubmed-gastro-maskfilling*
- **Task**: Masked Language Modeling (MLM) on medical texts

## **Acknowledgements**

- The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training.
- This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models