Searchium-ai
/

clip4clip-webvid150k

@@ -10,56 +10,48 @@ pipeline_tag: text-to-video
 # Model Card
 ## Details
-This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
 The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
-To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers.
 ### Use with Transformers
 ```python
 import numpy as np
 import torch
-from transformers import AutoTokenizer, CLIPTextModelWithProjection
 search_sentence = "a basketball player performing a slam dunk"
 model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
-tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid")
-inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
-outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
-# Special projection and changing last layers:
-text_projection = model.state_dict()['text_projection.weight']
-text_embeds = outputs[1] @ text_projection
-final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
 # Normalizing the embeddings:
-final_output = final_output / final_output.norm(dim=-1, keepdim=True)
 final_output = final_output.cpu().detach().numpy()
-sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True)
 print("sequence_output: ", sequence_output)
 ```
-## Model Use
-### Intended Use
-This model is intended to use for video retrival, look for example this [**space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).
-### Extra Information
 For video embedding there is an extra [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_EmbedVideos.ipynb) that describes how to embed videos.
 ## Performance and Limitations
 ### Performance
-We have evaluated the performance of differnet models on the last 10k video clips from Webvid database.
 | Model | R1 | R5 | R10 | MedianR | MeanR
 |------------------------|-------|-------|-------|-----|---------|

 # Model Card
 ## Details
+This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
+Retrieval](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
 The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
+To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32).
 ### Use with Transformers
+### Extracting Text Embeddings:
 ```python
 import numpy as np
 import torch
+from transformers import CLIPTokenizer, CLIPTextModelWithProjection
 search_sentence = "a basketball player performing a slam dunk"
 model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
+tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
+inputs = tokenizer(text=search_sentence , return_tensors="pt")
+outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
 # Normalizing the embeddings:
+final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
 final_output = final_output.cpu().detach().numpy()
 print("sequence_output: ", sequence_output)
 ```
+### Extracting Video Embeddings:
 For video embedding there is an extra [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_EmbedVideos.ipynb) that describes how to embed videos.
+## Model Intended Use
+This model is intended to use for video retrival, look for example this [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).
 ## Performance and Limitations
 ### Performance
+We have evaluated the performance of differenet models on the last 10k video clips from Webvid database.
 | Model | R1 | R5 | R10 | MedianR | MeanR
 |------------------------|-------|-------|-------|-----|---------|