|
--- |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
datasets: |
|
- gayanin/pubmed-gastro-maskfilling |
|
language: |
|
- en |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: fill-mask |
|
tags: |
|
- math |
|
--- |
|
|
|
![medBERT-logo](medBERT.png) |
|
|
|
# **medBERT-base** |
|
|
|
This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts. |
|
|
|
## **Model Architecture** |
|
- **Base Model**: `bert-base-uncased` |
|
- **Task**: Masked Language Modeling (MLM) for medical texts |
|
- **Tokenizer**: BERT's WordPiece tokenizer |
|
|
|
## **Usage** |
|
|
|
### **Loading the Pre-trained Model** |
|
|
|
You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library: |
|
|
|
```py |
|
from transformers import BertTokenizer, BertForMaskedLM |
|
import torch |
|
|
|
tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base') |
|
model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda") |
|
|
|
input_text = "Response to neoadjuvant chemotherapy best predicts survival [MASK] curative resection of gastric cancer." |
|
inputs = tokenizer(input_text, return_tensors='pt').to("cuda") |
|
|
|
outputs = model(**inputs) |
|
|
|
masked_index = (inputs['input_ids'][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0].item() |
|
|
|
top_k = 5 |
|
logits = outputs.logits[0, masked_index] |
|
top_k_ids = torch.topk(logits, k=top_k).indices.tolist() |
|
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_ids) |
|
|
|
print("Top 5 prediction:") |
|
for i, token in enumerate(top_k_tokens): |
|
print(f"{i + 1}: {token}") |
|
``` |
|
|
|
_Top 5 prediction:_ |
|
_1: from_ |
|
_2: of_ |
|
_3: after_ |
|
_4: by_ |
|
_5: through_ |
|
|
|
### **Fine-tuning the Model** |
|
|
|
To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps: |
|
|
|
1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format. |
|
2. Tokenize the dataset and apply masking. |
|
3. Train the model using the provided training loop. |
|
|
|
Here's the training code: |
|
|
|
https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb |
|
|
|
## **Training Details** |
|
|
|
### **Hyperparameters** |
|
- **Batch Size**: 16 |
|
- **Learning Rate**: 5e-5 |
|
- **Number of Epochs**: 1 |
|
- **Max Sequence Length**: 512 tokens |
|
|
|
### **Dataset** |
|
- **Dataset Name**: *gayanin/pubmed-gastro-maskfilling* |
|
- **Task**: Masked Language Modeling (MLM) on medical texts |
|
|
|
## **Acknowledgements** |
|
|
|
- The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training. |
|
- This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models |
|
|
|
<h3 align="left">Support:</h3> |
|
<p><a href="https://www.buymeacoffee.com/suayptalha"> <img align="left" src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" height="50" width="210" alt="suayptalha" /></a></p><br><br> |