Commit
·
b78ef94
1
Parent(s):
ee2def5
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,9 +1,8 @@
|
|
| 1 |
# CLIPfa: Connecting Farsi Text and Images
|
| 2 |
OpenAI released [`the paper Learning Transferable Visual Models From Natural Language Supervision`](https://arxiv.org/abs/2103.00020) in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a vision encoder and a text encoder. These were trained on 400 Million images and corresponding captions. We have trained a Farsi (Persian) version of OpenAI's CLIP on a dataset of 400,000 (image, text) pairs. We used [`Farahani's RoBERTa-fa`](https://huggingface.co/m3hrdadfi/roberta-zwnj-wnli-mean-tokens) as the text encoder and [`ViT`](https://huggingface.co/openai/clip-vit-base-patch32) as the vision encoder from Original CLIP and finetuned them.
|
| 3 |
|
| 4 |
-

|
| 5 |
|
| 6 |
-
It should be noted that only 400K pairs were used for this training, whereas 4 million pairs were used for the Original CLIP. Also, the training took 30 days across 592 GPUs powered by the V100 chip.
|
| 7 |
|
| 8 |
|
| 9 |
## How to use?
|
|
@@ -35,66 +34,8 @@ demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
|
|
| 35 |
demo.compute_text_embeddings(['گاو' ,'اسب' ,'ماهی'])
|
| 36 |
demo.compute_image_embeddings(test_df.image_path.to_list())
|
| 37 |
```
|
| 38 |
-
### Image Search:
|
| 39 |
-
```python
|
| 40 |
-
demo.image_search(query='غروب خورشید')
|
| 41 |
-
```
|
| 42 |
-

|
| 43 |
-
|
| 44 |
-
```python
|
| 45 |
-
demo.image_search(query='جنگل در زمستان برفی')
|
| 46 |
-
```
|
| 47 |
-

|
| 48 |
-
|
| 49 |
-
### Analogy:
|
| 50 |
-
```python
|
| 51 |
-
demo.anology('sunset.jpg', additional_text='دریا')
|
| 52 |
-
```
|
| 53 |
-

|
| 54 |
-
|
| 55 |
-
```python
|
| 56 |
-
demo.anology('sunset.jpg', additional_text='برف')
|
| 57 |
-
```
|
| 58 |
-

|
| 59 |
-
|
| 60 |
-
### Zero Shot Image Classification:
|
| 61 |
-
```python
|
| 62 |
-
demo.zero_shot(image_path='apples.jpg')
|
| 63 |
-
```
|
| 64 |
-
- Provided labels with their probability for each image.
|
| 65 |
-
|
| 66 |
-
| گاو:36 , ماهی:22, اسب:42 | گاو:41 , ماهی:23, اسب:36 | گاو:26 , ماهی:**45**, اسب:27 |
|
| 67 |
-
| :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
|
| 68 |
-
|  |  |  |
|
| 69 |
|
| 70 |
## Online Demo: [CLIPfa at Huggingface🤗 spaces](https://huggingface.co/spaces/SajjadAyoubi/CLIPfa-Demo)
|
| 71 |
We used a small set of images (25K) to keep this app almost real-time, but it's obvious that the quality of image search depends heavily on the size of the image database.
|
| 72 |
|
| 73 |
-

|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
## Dataset: 400K
|
| 77 |
-
We started with this question that how much the original Clip model depends on its big training dataset containing a lot of conceptual samples. Our model shows that It is possible to meet an acceptable enough target with only a little amount of data even though, It may not have known enough concepts and subjects to be used widely. Our model trained on a dataset gathered from different resources such as The Flickr30k, MS-COCO 2017, Google CCm3, ... . We used these datasets and translated them into the Persian language with a [`tool`](https://github.com/sajjjadayobi/CLIPfa/blob/main/clipfa/data/translation.py) prepared by ourselves. Using the Google Translate and Multilingual Similarity Check method we provided an automatic translator that has been given a list of English captions and filtered by the best translations.
|
| 78 |
-
|
| 79 |
-
- Note: We used [`image2ds`](https://github.com/rom1504/img2dataset) a great tool to download large scale image datasets such as MS-COCO. It can download, resize and package 100M URLs in 20h on one machine. Also supports saving captions for url+caption datasets.
|
| 80 |
-
- [`coco-flickr-fa 130K on Kaggle`](https://www.kaggle.com/navidkanaani/coco-flickr-farsi)
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
## Training: <a href="https://colab.research.google.com/github/sajjjadayobi/CLIPfa/blob/main/notebook/CLIPfa_Training.ipynb"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=CLIPfa Training&color=white"></a>
|
| 84 |
-
Any dataset can be used with little change by the [`training code`](https://github.com/sajjjadayobi/CLIPfa/tree/main/clipfa). CLIPfa can be trained with other encoders as long as they have the same hidden size at the last layer. In [`this`](https://github.com/sajjjadayobi/CLIPfa/blob/main/notebook/CLIPfa_Training.ipynb) notebook I used [`training code`](https://github.com/sajjjadayobi/CLIPfa/tree/main/clipfa) to train a small CLIP on translated [`flickr30K`](https://www.kaggle.com/sajjadayobi360/flickrfa) dataset.
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
## Citation: ↩️
|
| 88 |
-
If you have a technical question regarding the model, code or publication, create an issue in the repository.
|
| 89 |
-
we didn't publish any papers on the work. However, if you did, please cite us properly with an entry like one below.
|
| 90 |
-
```bibtex
|
| 91 |
-
@misc{ParsBigBird,
|
| 92 |
-
author = {Sajjad Ayoubi, Navid Kanaani},
|
| 93 |
-
title = {CLIPfa: Connecting Farsi Text and Images},
|
| 94 |
-
year = 2021,
|
| 95 |
-
publisher = {GitHub},
|
| 96 |
-
journal = {GitHub repository},
|
| 97 |
-
howpublished = {\url{https://github.com/SajjjadAyobi/CLIPfa}},
|
| 98 |
-
}
|
| 99 |
-
```
|
| 100 |
> Made with ❤️ in my basement🤫
|
|
|
|
| 1 |
# CLIPfa: Connecting Farsi Text and Images
|
| 2 |
OpenAI released [`the paper Learning Transferable Visual Models From Natural Language Supervision`](https://arxiv.org/abs/2103.00020) in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a vision encoder and a text encoder. These were trained on 400 Million images and corresponding captions. We have trained a Farsi (Persian) version of OpenAI's CLIP on a dataset of 400,000 (image, text) pairs. We used [`Farahani's RoBERTa-fa`](https://huggingface.co/m3hrdadfi/roberta-zwnj-wnli-mean-tokens) as the text encoder and [`ViT`](https://huggingface.co/openai/clip-vit-base-patch32) as the vision encoder from Original CLIP and finetuned them.
|
| 3 |
|
|
|
|
| 4 |
|
| 5 |
+
- It should be noted that only 400K pairs were used for this training, whereas 4 million pairs were used for the Original CLIP. Also, the training took 30 days across 592 GPUs powered by the V100 chip.
|
| 6 |
|
| 7 |
|
| 8 |
## How to use?
|
|
|
|
| 34 |
demo.compute_text_embeddings(['گاو' ,'اسب' ,'ماهی'])
|
| 35 |
demo.compute_image_embeddings(test_df.image_path.to_list())
|
| 36 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Online Demo: [CLIPfa at Huggingface🤗 spaces](https://huggingface.co/spaces/SajjadAyoubi/CLIPfa-Demo)
|
| 39 |
We used a small set of images (25K) to keep this app almost real-time, but it's obvious that the quality of image search depends heavily on the size of the image database.
|
| 40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
> Made with ❤️ in my basement🤫
|