-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 87 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 31
Collections
Discover the best community collections!
Collections including paper arxiv:2501.03218
-
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 90 -
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Paper • 2501.01257 • Published • 44 -
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Paper • 2501.01423 • Published • 34 -
REDUCIO! Generating 1024times1024 Video within 16 Seconds using Extremely Compressed Motion Latents
Paper • 2411.13552 • Published
-
StreamChat: Chatting with Streaming Video
Paper • 2412.08646 • Published • 18 -
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Paper • 2412.04432 • Published • 14 -
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
Paper • 2412.00927 • Published • 26 -
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Paper • 2412.09596 • Published • 92
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Paper • 2401.01335 • Published • 64 -
Lumiere: A Space-Time Diffusion Model for Video Generation
Paper • 2401.12945 • Published • 86 -
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
Paper • 2403.06504 • Published • 53 -
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs
Paper • 2403.20041 • Published • 34