flaviagiammarino
/

pubmed-clip-vit-base-patch32

@@ -17,15 +17,16 @@ PubMedCLIP is a fine-tuned version of [CLIP](https://huggingface.co/docs/transfo
 ## Model Description
 PubMedCLIP was trained on the [Radiology Objects in COntext (ROCO)](https://github.com/razorx89/roco-dataset) dataset, a large-scale multimodal medical imaging dataset.
 The ROCO dataset includes diverse imaging modalities (such as ultrasound, X-Ray, MRI, etc.) from various human body regions (such as head, neck, spine, etc.)
-captured from open-access [PubMed](https://pubmed.ncbi.nlm.nih.gov/) articles. The authors of PubMedCLIP have released three different pre-trained models at
 this [link](https://1drv.ms/u/s!ApXgPqe9kykTgwD4Np3-f7ODAot8?e=zLVlJ2) which use ResNet-50, ResNet-50x4 and ViT32 as image encoders.
-This repository includes only the ViT32 variant of the PubMedCLIP model.
 - **Repository:** [PubMedCLIP Official GitHub Repository](https://github.com/sarahESL/PubMedCLIP)
 - **Paper:** [Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?](https://arxiv.org/abs/2112.13906)
-- **Dataset:** [Radiology Objects in COntext (ROCO)](https://github.com/razorx89/roco-dataset)
-## Use
 ```python
 import requests
@@ -38,10 +39,11 @@ processor = CLIPProcessor.from_pretrained("flaviagiammarino/pubmed-clip-vit-base
 url = "https://d168r5mdg5gtkq.cloudfront.net/medpix/img/full/synpic9078.jpg"
 image = Image.open(requests.get(url, stream=True).raw)
-text = ["Chest X-Ray", "Brain MRI", "Abdominal CT Scan"]
-inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
-probs = model(**inputs).logits_per_image.softmax(dim=1)
 ```
 ## Additional Information

 ## Model Description
 PubMedCLIP was trained on the [Radiology Objects in COntext (ROCO)](https://github.com/razorx89/roco-dataset) dataset, a large-scale multimodal medical imaging dataset.
 The ROCO dataset includes diverse imaging modalities (such as ultrasound, X-Ray, MRI, etc.) from various human body regions (such as head, neck, spine, etc.)
+captured from open-access [PubMed](https://pubmed.ncbi.nlm.nih.gov/) articles.<br>
+The authors of PubMedCLIP have released three different pre-trained models at
 this [link](https://1drv.ms/u/s!ApXgPqe9kykTgwD4Np3-f7ODAot8?e=zLVlJ2) which use ResNet-50, ResNet-50x4 and ViT32 as image encoders.
+This repository includes only the ViT32 variant of the PubMedCLIP model.<br>
 - **Repository:** [PubMedCLIP Official GitHub Repository](https://github.com/sarahESL/PubMedCLIP)
 - **Paper:** [Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?](https://arxiv.org/abs/2112.13906)
+## Use with Transformers
 ```python
 import requests
 url = "https://d168r5mdg5gtkq.cloudfront.net/medpix/img/full/synpic9078.jpg"
 image = Image.open(requests.get(url, stream=True).raw)
+inputs = processor(text=["Chest X-Ray", "Brain MRI", "Abdominal CT Scan"], images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image # this is the image-text similarity score
+probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
 ```
 ## Additional Information