gzomer commited on
Commit
fffda91
·
1 Parent(s): 9370a28

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +263 -0
README.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MultiLingual CLIP
2
+
3
+ Multilingual CLIP is a pre-trained model which can be used for multilingual semantic search and zero-shot image classification in 100 languages.
4
+
5
+
6
+ # Model Architecture
7
+ Multilingual CLIP was built using [OpenAI CLIP](https://github.com/openai/CLIP) model. I have used the same Vision encoder (ResNet 50x4), but instead I replaced their text encoder (Transformer) with a Mulilingual Text Encoder ([XLM-Roberta](https://huggingface.co/xlm-roberta-large)) and a configurable number of projection heads, as seen below:
8
+
9
+ ![Model Architecture](https://challengepost-s3-challengepost.netdna-ssl.com/photos/production/software_photos/001/858/046/datas/gallery.jpg)
10
+
11
+ The model was trained in a distributed fashion on 16 Habana Gaudi Accelerators and with mixed Precision in two phases (using COCO Dataset for phase 1 and Google Conceptual Captions for phase 2). The training pipeline was built using PyTorch, PyTorch Lightning, and Distributed Data Parallel.
12
+
13
+
14
+ # Datasets
15
+
16
+ Three datasets have been used for building the model. COCO captions was used for training phase 1 and Google Conceptual Captions was used for training phase 2. Unsplash dataset was used for testing and inference.
17
+
18
+ ## COCO Captions
19
+
20
+ COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset. The COCO captions dataset has around ~85000 images and captions pairs.
21
+
22
+ Run the following to download the dataset:
23
+
24
+ ```bash
25
+ ./download_coco.sh
26
+ ```
27
+
28
+ This dataset was used for the first pre-training phase.
29
+
30
+ ## Google Conceptual Captions
31
+
32
+ Conceptual Captions is a dataset consisting of ~3.3 million images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles.
33
+
34
+ Download the datasets urls/captions from [here](https://storage.cloud.google.com/gcc-data/Train/GCC-training.tsv?_ga=2.191230122.-1896153081.1529438250) as save it to `datasets/googlecc/googlecc.tsv`. The full dataset has over 3 million images, but you can select a subset by loading the `googlecc.tsv` file and saving only the number of rows you want (I have used 1 million images for training).
35
+
36
+ Then run the following commands to download each image on the `googlecc.tsv` file:
37
+
38
+ ```bash
39
+ npm install
40
+ node download_build_googlecc.js
41
+ ```
42
+
43
+ This dataset was used for the second pre-training phase.
44
+
45
+ ## Unplash
46
+
47
+ This dataset was used as the test set during inference.
48
+
49
+ Run `python3.8 download_unsplash.py` to download the dataset.
50
+
51
+ # Training
52
+
53
+ ![Training phase 1](https://challengepost-s3-challengepost.netdna-ssl.com/photos/production/software_photos/001/858/047/datas/gallery.jpg)
54
+
55
+ ![Training phase 2](https://challengepost-s3-challengepost.netdna-ssl.com/photos/production/software_photos/001/858/048/datas/gallery.jpg)
56
+
57
+ ## Setup
58
+
59
+ Create two Habana instances ([AWS EC2 DL1](https://aws.amazon.com/ec2/instance-types/dl1/)) using [Habana® Deep Learning Base AMI (Ubuntu 20.04)](https://aws.amazon.com/marketplace/pp/prodview-fw46rwuxrtfse)
60
+
61
+
62
+ Create the PyTorch docker container running:
63
+
64
+ ```bash
65
+ docker run --name pytorch -td --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.2.0/ubuntu20.04/habanalabs/pytorch-installer-1.10.0:1.2.0-585
66
+ ```
67
+
68
+ Enter the docker image by running:
69
+
70
+ ```
71
+ docker exec -it pytorch /bin/bash
72
+ ```
73
+
74
+ #### Setup password-less ssh between all connected servers
75
+
76
+ 1. Configure password-less ssh between all nodes:
77
+
78
+ Do the following in all the nodes' docker sessions:
79
+ ```bash
80
+ mkdir ~/.ssh
81
+ cd ~/.ssh
82
+ ssh-keygen -t rsa -b 4096
83
+ ```
84
+ Copy id_rsa.pub contents from every node's docker to every other node's docker's ~/.ssh/authorized_keys (all public keys need to be in all hosts' authorized_keys):
85
+ ```bash
86
+ cat id_rsa.pub > authorized_keys
87
+ vi authorized_keys
88
+ ```
89
+ Copy the contents from inside to other systems.
90
+ Paste all hosts' public keys in all hosts' “authorized_keys” file.
91
+
92
+ 2. On each system:
93
+ Add all hosts (including itself) to known_hosts. The IP addresses used below are just for illustration:
94
+ ```bash
95
+ ssh-keyscan -p 3022 -H $IP1 >> ~/.ssh/known_hosts
96
+ ssh-keyscan -p 3022 -H $IP2 >> ~/.ssh/known_hosts
97
+ ```
98
+
99
+ 3. Change Docker SSH port to 3022
100
+ ```bash
101
+ sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config
102
+ sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
103
+ service ssh restart
104
+ ```
105
+
106
+ [Allow all TCP](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html) traffic between the nodes on AWS
107
+
108
+ Clone the git repo:
109
+
110
+ ```bash
111
+ git clone https://github.com/gzomer/clip-multilingual
112
+ ```
113
+
114
+ Create environment:
115
+
116
+ ```bash
117
+ python3.8 -m venv .env
118
+ ```
119
+
120
+ Install requirements:
121
+
122
+ ```bash
123
+ python3.8 -r requirements.txt
124
+ ```
125
+
126
+ Activate environment
127
+
128
+ ```bash
129
+ source .env/bin/activate
130
+ ```
131
+
132
+ ## Training params
133
+
134
+ Learning rate: 1e-3
135
+
136
+ Batch size: 64
137
+
138
+ Phase 1 - Epochs: 100
139
+
140
+ Phase 2 - Epochs: 15
141
+
142
+ ## Train script arguments
143
+
144
+ ```
145
+ --dataset-num-workers Number of workers (default: 8)
146
+ --dataset-type Dataset type (coco or googlecc) (default: coco)
147
+ --dataset-dir Dataset dir (default: ./datasets/coco/)
148
+ --dataset-subset-size Load only a subset of the dataset (useful for debugging)
149
+ --dataset-train-split Dataset train split (default: 0.8)
150
+ --train-device Type of device to use (default: hpu)
151
+ --distributed-num-nodes Number of nodes (machines) (default: 2)
152
+ --distributed-parallel-devices Number of parallel devices per node (default: 8)
153
+ --distributed-master-address Master node IP address
154
+ --distributed-master-port Master node port (default: 12345)
155
+ --distributed-bucket-cap-mb DDP bucket cap MB (default: 200)
156
+ --checkpoint-dir Model checkpoint dir (default: ./models)
157
+ --checkpoint-save-every-n Save every n epochs (default: 1)
158
+ --checkpoint-load-vision-path Load vision encoder checkpoint
159
+ --checkpoint-load-text-path Load text encoder checkpoint
160
+ --model-visual-name Which visual model to use (default: RN50x4)
161
+ --model-textual-name Which textual model to use (default: xlm-roberta-base)
162
+ --hyperparam-num-layers Number of layers (default: 3)
163
+ --hyperparam-lr Model learning rate (default: 0.001)
164
+ --hyperparam-epochs Max epochs (default: 100)
165
+ --hyperparam-precision Precision (default: 16)
166
+ --hyperparam-batch-size Batch size (default: 64)
167
+ --wandb-project W&B project name (default: clip)
168
+ --wandb-enabled W&B is enabled? (default: True)
169
+ ```
170
+
171
+ ## Habana Gaudi - 8 accelerators
172
+
173
+ ### Phase 1 training
174
+
175
+ ```bash
176
+ python3.8 train.py --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 1
177
+ ```
178
+
179
+ ### Phase 2 training
180
+ ```bash
181
+ python3.8 train.py --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 1 --hyperparam-epochs 15 --checkpoint-load-text-path /home/models/text-last.ckpt --checkpoint-load-vision-path /home/models/vision-last.ckpt --checkpoint-dir ./models_phase2
182
+ ```
183
+
184
+ ## Habana Gaudi - 16 accelerators (multi-server training)
185
+
186
+ Change the master IP address based on your instances (use local IP, not public IP).
187
+
188
+ ### Phase 1 training
189
+
190
+ ```bash
191
+ NODE_RANK=0 python3.8 train.py --distributed-master-address 172.31.86.231 --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 2
192
+ ```
193
+
194
+ ```bash
195
+ NODE_RANK=1 python3.8 train.py --distributed-master-address 172.31.86.231 --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 2
196
+ ```
197
+
198
+ ### Phase 2 training
199
+
200
+ ```bash
201
+ NODE_RANK=0 python3.8 train.py --distributed-master-address 172.31.86.231 --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 2 --hyperparam-epochs 10 --checkpoint-load-text-path /home/models/text-last.ckpt --checkpoint-load-vision-path /home/models/vision-last.ckpt --checkpoint-dir ./models_phase2
202
+ ```
203
+
204
+ ```bash
205
+ NODE_RANK=1 python3.8 train.py --distributed-master-address 172.31.86.231 --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 2 --hyperparam-epochs 15 --checkpoint-load-text-path /home/models/text-last.ckpt --checkpoint-load-vision-path /home/models/vision-last.ckpt --checkpoint-dir ./models_phase2
206
+ ```
207
+
208
+ ## Other devices
209
+ If you don't have access to a Habana Gaudi accelerator yet, you can also train on CPU/GPU, although it will be way slower.
210
+
211
+ To train on CPU, just pass `--train-device=cpu` and on GPU `--train-device=cuda` to the `train.py` script.
212
+
213
+ # Inference
214
+
215
+ ## Loading pre-trained model from Hugging Face HUB
216
+ ```python
217
+ from models import create_and_load_from_hub
218
+
219
+ model = create_and_load_from_hub()
220
+ ```
221
+
222
+ ## Loading model from local checkpoint
223
+ ```python
224
+ from models import MultiLingualCLIP, load_model
225
+
226
+ text_checkpoint_path = '/path/to/text model checkpoint'
227
+ vision_checkpoint_path = '/path/to/vision model checkpoint'
228
+
229
+ model = MultiLingualCLIP(num_layers=3)
230
+ load_model(model, vision_checkpoint_path, text_checkpoint_path)
231
+ ```
232
+
233
+ ## Generate embeddings
234
+
235
+ Run the following (after downloading Unplash dataset):
236
+
237
+ `python3.8 ./generate_embeddings.py`
238
+
239
+ ## Searching images
240
+
241
+ ```python
242
+ import numpy as np
243
+ from search import MultiLingualSearch
244
+
245
+ images_embeddings = np.load('/path/to/images_embeddings')
246
+ images_data = [...] # List of image info for each row of the embeddings. For instance, it could be a list of urls, filepaths, ids. They will be returned when calling the search function
247
+ semantic_search = MultiLingualSearch(model, images_embeddings, images_data)
248
+
249
+ results = semantic_search.search('विद्यालय में') # Means at school
250
+ print(results)
251
+ ```
252
+ ```json
253
+ [{"image": "https://images.unsplash.com/photo-1557804506-669a67965ba0?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=MnwyNDg3OTV8MHwxfHNlYXJjaHwxM3x8bWVldGluZ3N8ZW58MHx8fHwxNjQ1NjA2MjQz&ixlib=rb-1.2.1&q=80&w=400",
254
+ "prob": 0.2461608648300171},
255
+ {"image": "https://images.unsplash.com/photo-1558403194-611308249627?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=MnwyNDg3OTV8MHwxfHNlYXJjaHwyMXx8cGVvcGxlJTIwd29ya2luZ3xlbnwwfHx8fDE2NDU2MDMyMjE&ixlib=rb-1.2.1&q=80&w=400",
256
+ "prob": 0.16881239414215088},
257
+ {"image": "https://images.unsplash.com/photo-1531497865144-0464ef8fb9a9?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=MnwyNDg3OTV8MHwxfHNlYXJjaHw4Nnx8cGVvcGxlJTIwd29ya2luZ3xlbnwwfHx8fDE2NDU2MDY5ODc&ixlib=rb-1.2.1&q=80&w=400",
258
+ "prob": 0.14744874835014343},
259
+ {"image": "https://images.unsplash.com/photo-1561089489-f13d5e730d72?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=MnwyNDg3OTV8MHwxfHNlYXJjaHw5MHx8ZWR1Y2F0aW9ufGVufDB8fHx8MTY0NTYwNjk1Nw&ixlib=rb-1.2.1&q=80&w=400",
260
+ "prob": 0.095176100730896},
261
+ {"image": "https://images.unsplash.com/photo-1580582932707-520aed937b7b?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=MnwyNDg3OTV8MHwxfHNlYXJjaHwxMnx8ZWR1Y2F0aW9ufGVufDB8fHx8MTY0NTYwMzIwMA&ixlib=rb-1.2.1&q=80&w=400",
262
+ "prob": 0.05218643322587013}]
263
+ ```