Safetensors
mistral
tttoaster commited on
Commit
eec132e
·
verified ·
1 Parent(s): 4894f0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -3,11 +3,11 @@ license: apache-2.0
3
  ---
4
 
5
  # Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
6
-
7
  [![Static Badge](https://img.shields.io/badge/Github-black)](https://github.com/TencentARC/Divot)
8
 
9
 
10
- >We introduce Divot, a **Di**ffusion-Powered **V**ide**o** **T**okenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations.
11
  Building upon the Divot tokenizer, we present **Divot-LLM** through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model.
12
 
13
  All models, training code and inference code are released!
 
3
  ---
4
 
5
  # Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
6
+ [![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/abs/2412.04432)
7
  [![Static Badge](https://img.shields.io/badge/Github-black)](https://github.com/TencentARC/Divot)
8
 
9
 
10
+ >We introduce [Divot](https://arxiv.org/abs/2412.04432), a **Di**ffusion-Powered **V**ide**o** **T**okenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations.
11
  Building upon the Divot tokenizer, we present **Divot-LLM** through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model.
12
 
13
  All models, training code and inference code are released!