Update README.md
Browse files
README.md
CHANGED
@@ -22,6 +22,17 @@ Based on the CLIP model, Safe-CLIP is fine-tuned to serve the association betwee
|
|
22 |
## NSFW Definition
|
23 |
In our work, with inspiration taken from this [paper](https://arxiv.org/abs/2211.05105), we define NSFW as a finite and fixed set concepts that are considered inappropriate, offensive, or harmful to individuals. These concepts are divided into twenty categories: _hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality and cruelty_.
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
## Model Details
|
26 |
|
27 |
Safe-CLIP is a fine-tuned version of [CLIP](https://huggingface.co/docs/transformers/en/model_doc/clip) model. The model fine-tuning is done through the ViSU (Visual Safe and Unsafe) Dataset, introduced in the same [paper](https://arxiv.org/abs/2311.16254).
|
@@ -31,12 +42,12 @@ ViSU contains quadruplets of elements: safe and NSFW sentence pairs along with c
|
|
31 |
|
32 |
**Variations** Safe-CLIP comes in four versions to improve the compatibility across some of the most popular vision-and-language models employed for I2T and T2I generation tasks. More details are reported in the next table.
|
33 |
|
34 |
-
| | StableDiffusion compatibility | LLaVA compatibility
|
35 |
-
|
36 |
-
| safe-CLIP ViT-L-14 | 1.4 |
|
37 |
-
| safe-CLIP ViT-L-14-336px | - |
|
38 |
-
| safe-CLIP ViT-H-14 | - |
|
39 |
-
| safe-CLIP SD 2.0 | 2.0 |
|
40 |
|
41 |
**Model Release Date** 9 July 2024.
|
42 |
|
@@ -46,23 +57,14 @@ You can also find the donwstream-tasks example codes in the repository of the pa
|
|
46 |
## Applications
|
47 |
Safe-CLIP can be employed in various applications where safety and appropriateness are critical, including cross-modal retrieval, text-to-image, and image-to-text generation. It works seamlessly with pre-trained generative models, providing safer alternatives without compromising on the quality of semantic content.
|
48 |
|
49 |
-
#### Use with Transformers
|
50 |
-
See the snippet below for usage with Transformers:
|
51 |
-
|
52 |
-
```python
|
53 |
-
>>> from transformers import CLIPModel
|
54 |
-
|
55 |
-
>>> model_id = "aimagelab/safeclip_vit-h_14"
|
56 |
-
|
57 |
-
>>> model = CLIPModel.from_pretrained(model_id)
|
58 |
-
```
|
59 |
-
|
60 |
|
61 |
## Downstream Use
|
|
|
62 |
|
63 |
#### Zero-shot classification example
|
64 |
```python
|
65 |
-
>>> from transformers import CLIPModel
|
|
|
66 |
|
67 |
>>> model_id = "aimagelab/safeclip_vit-h_14"
|
68 |
|
|
|
22 |
## NSFW Definition
|
23 |
In our work, with inspiration taken from this [paper](https://arxiv.org/abs/2211.05105), we define NSFW as a finite and fixed set concepts that are considered inappropriate, offensive, or harmful to individuals. These concepts are divided into twenty categories: _hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality and cruelty_.
|
24 |
|
25 |
+
#### Use with Transformers
|
26 |
+
See the snippet below for usage with Transformers:
|
27 |
+
|
28 |
+
```python
|
29 |
+
>>> from transformers import CLIPModel
|
30 |
+
|
31 |
+
>>> model_id = "aimagelab/safeclip_vit-h_14"
|
32 |
+
>>> model = CLIPModel.from_pretrained(model_id)
|
33 |
+
```
|
34 |
+
|
35 |
+
|
36 |
## Model Details
|
37 |
|
38 |
Safe-CLIP is a fine-tuned version of [CLIP](https://huggingface.co/docs/transformers/en/model_doc/clip) model. The model fine-tuning is done through the ViSU (Visual Safe and Unsafe) Dataset, introduced in the same [paper](https://arxiv.org/abs/2311.16254).
|
|
|
42 |
|
43 |
**Variations** Safe-CLIP comes in four versions to improve the compatibility across some of the most popular vision-and-language models employed for I2T and T2I generation tasks. More details are reported in the next table.
|
44 |
|
45 |
+
| | StableDiffusion compatibility | LLaVA compatibility |
|
46 |
+
|--------------------------|:-----------------------------:|:----------------------------------------------------:|
|
47 |
+
| safe-CLIP ViT-L-14 | 1.4 | llama-2-13b-chat-lightning-preview |
|
48 |
+
| safe-CLIP ViT-L-14-336px | - | 1.5 - 1.6 |
|
49 |
+
| safe-CLIP ViT-H-14 | - | - |
|
50 |
+
| safe-CLIP SD 2.0 | 2.0 | - |
|
51 |
|
52 |
**Model Release Date** 9 July 2024.
|
53 |
|
|
|
57 |
## Applications
|
58 |
Safe-CLIP can be employed in various applications where safety and appropriateness are critical, including cross-modal retrieval, text-to-image, and image-to-text generation. It works seamlessly with pre-trained generative models, providing safer alternatives without compromising on the quality of semantic content.
|
59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
|
61 |
## Downstream Use
|
62 |
+
More example codes in the official Safe-CLIP [repo](https://github.com/aimagelab/safe-clip).
|
63 |
|
64 |
#### Zero-shot classification example
|
65 |
```python
|
66 |
+
>>> from transformers import CLIPModel, CLIPProcessor
|
67 |
+
>>> from PIL import Image
|
68 |
|
69 |
>>> model_id = "aimagelab/safeclip_vit-h_14"
|
70 |
|