aimagelab
/

safeclip_vit-h_14

@@ -22,6 +22,17 @@ Based on the CLIP model, Safe-CLIP is fine-tuned to serve the association betwee
 ## NSFW Definition
 In our work, with inspiration taken from this [paper](https://arxiv.org/abs/2211.05105), we define NSFW as a finite and fixed set concepts that are considered inappropriate, offensive, or harmful to individuals. These concepts are divided into twenty categories: _hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality and cruelty_.
 ## Model Details
 Safe-CLIP is a fine-tuned version of [CLIP](https://huggingface.co/docs/transformers/en/model_doc/clip) model. The model fine-tuning is done through the ViSU (Visual Safe and Unsafe) Dataset, introduced in the same [paper](https://arxiv.org/abs/2311.16254).
@@ -31,12 +42,12 @@ ViSU contains quadruplets of elements: safe and NSFW sentence pairs along with c
 **Variations** Safe-CLIP comes in four versions to improve the compatibility across some of the most popular vision-and-language models employed for I2T and T2I generation tasks. More details are reported in the next table.
-| | StableDiffusion compatibility | LLaVA compatibility |
-|--------------------------|:-----------------------------:|:-------------------:|
-| safe-CLIP ViT-L-14 | 1.4 | ? |
-| safe-CLIP ViT-L-14-336px | - | 1.5 1.6 |
-| safe-CLIP ViT-H-14 | - | - |
-| safe-CLIP SD 2.0 | 2.0 | - |
 **Model Release Date** 9 July 2024.
@@ -46,23 +57,14 @@ You can also find the donwstream-tasks example codes in the repository of the pa
 ## Applications
 Safe-CLIP can be employed in various applications where safety and appropriateness are critical, including cross-modal retrieval, text-to-image, and image-to-text generation. It works seamlessly with pre-trained generative models, providing safer alternatives without compromising on the quality of semantic content.
-#### Use with Transformers
-See the snippet below for usage with Transformers:
-```python
->>> from transformers import CLIPModel
->>> model_id = "aimagelab/safeclip_vit-h_14"
->>> model = CLIPModel.from_pretrained(model_id)
-```
 ## Downstream Use
 #### Zero-shot classification example
 ```python
->>> from transformers import CLIPModel
 >>> model_id = "aimagelab/safeclip_vit-h_14"

 ## NSFW Definition
 In our work, with inspiration taken from this [paper](https://arxiv.org/abs/2211.05105), we define NSFW as a finite and fixed set concepts that are considered inappropriate, offensive, or harmful to individuals. These concepts are divided into twenty categories: _hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality and cruelty_.
+#### Use with Transformers
+See the snippet below for usage with Transformers:
+```python
+>>> from transformers import CLIPModel
+>>> model_id = "aimagelab/safeclip_vit-h_14"
+>>> model = CLIPModel.from_pretrained(model_id)
+```
 ## Model Details
 Safe-CLIP is a fine-tuned version of [CLIP](https://huggingface.co/docs/transformers/en/model_doc/clip) model. The model fine-tuning is done through the ViSU (Visual Safe and Unsafe) Dataset, introduced in the same [paper](https://arxiv.org/abs/2311.16254).
 **Variations** Safe-CLIP comes in four versions to improve the compatibility across some of the most popular vision-and-language models employed for I2T and T2I generation tasks. More details are reported in the next table.
+| | StableDiffusion compatibility | LLaVA compatibility |
+|--------------------------|:-----------------------------:|:----------------------------------------------------:|
+| safe-CLIP ViT-L-14 | 1.4 | llama-2-13b-chat-lightning-preview |
+| safe-CLIP ViT-L-14-336px | - | 1.5 - 1.6 |
+| safe-CLIP ViT-H-14 | - | - |
+| safe-CLIP SD 2.0 | 2.0 | - |
 **Model Release Date** 9 July 2024.
 ## Applications
 Safe-CLIP can be employed in various applications where safety and appropriateness are critical, including cross-modal retrieval, text-to-image, and image-to-text generation. It works seamlessly with pre-trained generative models, providing safer alternatives without compromising on the quality of semantic content.
 ## Downstream Use
+More example codes in the official Safe-CLIP [repo](https://github.com/aimagelab/safe-clip).
 #### Zero-shot classification example
 ```python
+>>> from transformers import CLIPModel, CLIPProcessor
+>>> from PIL import Image
 >>> model_id = "aimagelab/safeclip_vit-h_14"