VideoPrism is a new video encoder that improves video understanding through a unique training strategy, using a vast dataset (36 million high-quality video-caption pairs and 582 million video clips) for comprehensive learning.
Key points: * It employs a two-stage training approach, initially aligning video and text encoders, followed by an enhanced video-only masked autoencoding process to learn appearance and motion. * It achieves superior performance in a wide array of tasks, such as general video understanding, zero-shot video-text retrieval, video captioning, QA, and computer vision for science, having top performance on 30 out of 33 benchmarks.