Diangle commited on
Commit
331c19e
·
1 Parent(s): 9d47939

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -22
README.md CHANGED
@@ -10,56 +10,48 @@ pipeline_tag: text-to-video
10
 
11
  # Model Card
12
  ## Details
13
- This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
 
14
 
15
  The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
16
 
17
- To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers.
18
 
19
  ### Use with Transformers
 
20
 
21
  ```python
22
  import numpy as np
23
  import torch
24
- from transformers import AutoTokenizer, CLIPTextModelWithProjection
25
 
26
 
27
  search_sentence = "a basketball player performing a slam dunk"
28
 
29
  model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
30
- tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid")
31
 
32
- inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
33
- outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
34
-
35
- # Special projection and changing last layers:
36
- text_projection = model.state_dict()['text_projection.weight']
37
- text_embeds = outputs[1] @ text_projection
38
- final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
39
 
40
  # Normalizing the embeddings:
41
- final_output = final_output / final_output.norm(dim=-1, keepdim=True)
42
  final_output = final_output.cpu().detach().numpy()
43
- sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True)
44
  print("sequence_output: ", sequence_output)
45
  ```
46
 
47
- ## Model Use
48
-
49
- ### Intended Use
50
-
51
- This model is intended to use for video retrival, look for example this [**space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).
52
-
53
- ### Extra Information
54
-
55
  For video embedding there is an extra [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_EmbedVideos.ipynb) that describes how to embed videos.
56
 
 
 
 
57
 
58
  ## Performance and Limitations
59
 
60
  ### Performance
61
 
62
- We have evaluated the performance of differnet models on the last 10k video clips from Webvid database.
63
 
64
  | Model | R1 | R5 | R10 | MedianR | MeanR
65
  |------------------------|-------|-------|-------|-----|---------|
 
10
 
11
  # Model Card
12
  ## Details
13
+ This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
14
+ Retrieval](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
15
 
16
  The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
17
 
18
+ To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32).
19
 
20
  ### Use with Transformers
21
+ ### Extracting Text Embeddings:
22
 
23
  ```python
24
  import numpy as np
25
  import torch
26
+ from transformers import CLIPTokenizer, CLIPTextModelWithProjection
27
 
28
 
29
  search_sentence = "a basketball player performing a slam dunk"
30
 
31
  model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
32
+ tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
33
 
34
+ inputs = tokenizer(text=search_sentence , return_tensors="pt")
35
+ outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
 
 
 
 
 
36
 
37
  # Normalizing the embeddings:
38
+ final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
39
  final_output = final_output.cpu().detach().numpy()
 
40
  print("sequence_output: ", sequence_output)
41
  ```
42
 
43
+ ### Extracting Video Embeddings:
 
 
 
 
 
 
 
44
  For video embedding there is an extra [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_EmbedVideos.ipynb) that describes how to embed videos.
45
 
46
+ ## Model Intended Use
47
+
48
+ This model is intended to use for video retrival, look for example this [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).
49
 
50
  ## Performance and Limitations
51
 
52
  ### Performance
53
 
54
+ We have evaluated the performance of differenet models on the last 10k video clips from Webvid database.
55
 
56
  | Model | R1 | R5 | R10 | MedianR | MeanR
57
  |------------------------|-------|-------|-------|-----|---------|