File size: 2,771 Bytes
b88cf0f fb247e4 3983829 1f2aa20 b88cf0f 7606fc0 40b2001 331c19e b88cf0f 7840dd2 b88cf0f 6b89a29 b88cf0f 7606fc0 40b2001 331c19e b88cf0f 40b2001 331c19e b88cf0f 40b2001 331c19e b88cf0f 331c19e b88cf0f 40b2001 331c19e b88cf0f 40b2001 b88cf0f 331c19e 7606fc0 b33d6b5 7606fc0 b88cf0f 331c19e 7d954dc b88cf0f 7d954dc b88cf0f 331c19e b88cf0f 7c3273d 5ae7b6e 3983829 7c3273d 63d5ecf 7c3273d b88cf0f 8cff908 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
---
tags:
- text
- vision
- video
datasets:
- HuggingFaceM4/webvid
pipeline_tag: text-to-video
---
# Model Card
## Details
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
Retrieval](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
In order to integrate the trained clip model into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), we have made modifications to the weights.
### Use with Transformers
### Extracting Text Embeddings:
```python
import numpy as np
import torch
from transformers import CLIPTokenizer, CLIPTextModelWithProjection
search_sentence = "a basketball player performing a slam dunk"
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
inputs = tokenizer(text=search_sentence , return_tensors="pt")
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
# Normalizing the embeddings:
final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
final_output = final_output.cpu().detach().numpy()
print("sequence_output: ", sequence_output)
```
### Extracting Video Embeddings:
An additional [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb) is available that provides instructions on how to perform video embedding.
## Model Intended Use
This model is intended to use for video retrieval, look for example this [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).
## Performance
We have evaluated the performance of differenet models on the last 10k video clips from Webvid database.
| Model | R1 | R5 | R10 | MedianR | MeanR
|------------------------|-------|-------|-------|-----|---------|
| Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128
| CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 |39.3023
| **CLIP4Clip trained on 150k Webvid** | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535
| Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964
For more information about the evaluation you can look at this [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval-Evaluation.ipynb). |