Update README.md
Browse files
README.md
CHANGED
@@ -10,56 +10,48 @@ pipeline_tag: text-to-video
|
|
10 |
|
11 |
# Model Card
|
12 |
## Details
|
13 |
-
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [
|
|
|
14 |
|
15 |
The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
|
16 |
|
17 |
-
To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
|
18 |
|
19 |
### Use with Transformers
|
|
|
20 |
|
21 |
```python
|
22 |
import numpy as np
|
23 |
import torch
|
24 |
-
from transformers import
|
25 |
|
26 |
|
27 |
search_sentence = "a basketball player performing a slam dunk"
|
28 |
|
29 |
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
|
30 |
-
tokenizer =
|
31 |
|
32 |
-
inputs = tokenizer(text=search_sentence , return_tensors="pt"
|
33 |
-
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
|
34 |
-
|
35 |
-
# Special projection and changing last layers:
|
36 |
-
text_projection = model.state_dict()['text_projection.weight']
|
37 |
-
text_embeds = outputs[1] @ text_projection
|
38 |
-
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
|
39 |
|
40 |
# Normalizing the embeddings:
|
41 |
-
final_output =
|
42 |
final_output = final_output.cpu().detach().numpy()
|
43 |
-
sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True)
|
44 |
print("sequence_output: ", sequence_output)
|
45 |
```
|
46 |
|
47 |
-
|
48 |
-
|
49 |
-
### Intended Use
|
50 |
-
|
51 |
-
This model is intended to use for video retrival, look for example this [**space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).
|
52 |
-
|
53 |
-
### Extra Information
|
54 |
-
|
55 |
For video embedding there is an extra [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_EmbedVideos.ipynb) that describes how to embed videos.
|
56 |
|
|
|
|
|
|
|
57 |
|
58 |
## Performance and Limitations
|
59 |
|
60 |
### Performance
|
61 |
|
62 |
-
We have evaluated the performance of
|
63 |
|
64 |
| Model | R1 | R5 | R10 | MedianR | MeanR
|
65 |
|------------------------|-------|-------|-------|-----|---------|
|
|
|
10 |
|
11 |
# Model Card
|
12 |
## Details
|
13 |
+
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
|
14 |
+
Retrieval](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
|
15 |
|
16 |
The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
|
17 |
|
18 |
+
To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32).
|
19 |
|
20 |
### Use with Transformers
|
21 |
+
### Extracting Text Embeddings:
|
22 |
|
23 |
```python
|
24 |
import numpy as np
|
25 |
import torch
|
26 |
+
from transformers import CLIPTokenizer, CLIPTextModelWithProjection
|
27 |
|
28 |
|
29 |
search_sentence = "a basketball player performing a slam dunk"
|
30 |
|
31 |
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
|
32 |
+
tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
|
33 |
|
34 |
+
inputs = tokenizer(text=search_sentence , return_tensors="pt")
|
35 |
+
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
# Normalizing the embeddings:
|
38 |
+
final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
|
39 |
final_output = final_output.cpu().detach().numpy()
|
|
|
40 |
print("sequence_output: ", sequence_output)
|
41 |
```
|
42 |
|
43 |
+
### Extracting Video Embeddings:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
For video embedding there is an extra [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_EmbedVideos.ipynb) that describes how to embed videos.
|
45 |
|
46 |
+
## Model Intended Use
|
47 |
+
|
48 |
+
This model is intended to use for video retrival, look for example this [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).
|
49 |
|
50 |
## Performance and Limitations
|
51 |
|
52 |
### Performance
|
53 |
|
54 |
+
We have evaluated the performance of differenet models on the last 10k video clips from Webvid database.
|
55 |
|
56 |
| Model | R1 | R5 | R10 | MedianR | MeanR
|
57 |
|------------------------|-------|-------|-------|-----|---------|
|