File size: 2,771 Bytes
b88cf0f
fb247e4
 
 
 
3983829
 
1f2aa20
b88cf0f
 
7606fc0
40b2001
 
331c19e
 
b88cf0f
7840dd2
b88cf0f
6b89a29
b88cf0f
7606fc0
40b2001
331c19e
b88cf0f
 
 
40b2001
331c19e
b88cf0f
 
40b2001
 
 
331c19e
b88cf0f
331c19e
 
b88cf0f
40b2001
331c19e
b88cf0f
40b2001
b88cf0f
 
331c19e
7606fc0
b33d6b5
7606fc0
b88cf0f
331c19e
 
7d954dc
b88cf0f
 
7d954dc
b88cf0f
331c19e
b88cf0f
7c3273d
5ae7b6e
3983829
7c3273d
63d5ecf
7c3273d
b88cf0f
8cff908
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
tags:
- text
- vision
- video
datasets:
- HuggingFaceM4/webvid
pipeline_tag: text-to-video
---


# Model Card
## Details
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
Retrieval](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip). 

The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.

In order to integrate the trained clip model into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), we have made modifications to the weights.


### Use with Transformers
### Extracting Text Embeddings:

```python
import numpy as np
import torch
from transformers import CLIPTokenizer, CLIPTextModelWithProjection


search_sentence = "a basketball player performing a slam dunk"

model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")

inputs = tokenizer(text=search_sentence , return_tensors="pt")
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])

# Normalizing the embeddings:
final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
final_output = final_output.cpu().detach().numpy()
print("sequence_output: ", sequence_output)
```

### Extracting Video Embeddings: 

An additional [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb) is available that provides instructions on how to perform video embedding.


## Model Intended Use

This model is intended to use for video retrieval, look for example this [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid). 


## Performance

We have evaluated the performance of differenet models on the last 10k video clips from Webvid database.

| Model | R1 | R5 | R10 | MedianR | MeanR
|------------------------|-------|-------|-------|-----|---------|
| Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128
| CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 |39.3023 
| **CLIP4Clip trained on 150k Webvid** | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535
| Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964

For more information about the evaluation you can look at this [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval-Evaluation.ipynb).