File size: 2,778 Bytes
4552615
 
374bfed
 
 
 
9436e9a
8c90832
8cfe399
 
062d945
9436e9a
 
 
699e237
9436e9a
699e237
dc082c8
d729fe3
0199e15
 
b98e003
 
 
699e237
 
 
0199e15
4d5a795
699e237
 
 
 
 
 
 
 
 
 
 
 
 
3422451
0199e15
 
 
699e237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: mit
language:
- en
tags:
- medical
- vision
widget:
- src: "https://d168r5mdg5gtkq.cloudfront.net/medpix/img/full/synpic9078.jpg"
  candidate_labels: "Chest X-Ray, Brain MRI, Abdomen CT Scan"
  example_title: "Abdomen CT Scan"
---
# Model Card for PubMedCLIP

PubMedCLIP is a fine-tuned version of [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for the medical domain.

## Model Description
PubMedCLIP was trained on the [Radiology Objects in COntext (ROCO)](https://github.com/razorx89/roco-dataset) dataset, a large-scale multimodal medical imaging dataset.
The ROCO dataset includes diverse imaging modalities (such as X-Ray, MRI, ultrasound, fluoroscopy, etc.) from various human body regions (such as head, spine, chest, abdomen, etc.) 
captured from open-access [PubMed](https://pubmed.ncbi.nlm.nih.gov/) articles.<br> 

PubMedCLIP was trained for 50 epochs with a batch size of 64 using the Adam optimizer with a learning rate of 10−5. 
The authors have released three different pre-trained models at this [link](https://1drv.ms/u/s!ApXgPqe9kykTgwD4Np3-f7ODAot8?e=zLVlJ2) 
which use ResNet-50, ResNet-50x4 and ViT32 as image encoders. This repository includes only the ViT32 variant of the PubMedCLIP model.<br> 

- **Repository:** [PubMedCLIP Official GitHub Repository](https://github.com/sarahESL/PubMedCLIP)
- **Paper:** [Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?](https://arxiv.org/abs/2112.13906)

## Usage

```python
import requests
from PIL import Image

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("flaviagiammarino/pubmed-clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("flaviagiammarino/pubmed-clip-vit-base-patch32")

url = "https://d168r5mdg5gtkq.cloudfront.net/medpix/img/full/synpic9078.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["Chest X-Ray", "Brain MRI", "Abdomen CT Scan"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
```

## Additional Information

### Licensing Information
The authors have released the model code and pre-trained checkpoints under the [MIT License](https://github.com/sarahESL/PubMedCLIP/blob/main/LICENSE).

### Citation Information
```
@article{eslami2021does,
  title={Does clip benefit visual question answering in the medical domain as much as it does in the general domain?},
  author={Eslami, Sedigheh and de Melo, Gerard and Meinel, Christoph},
  journal={arXiv preprint arXiv:2112.13906},
  year={2021}
}
```