Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.10360

Papers - Video - Understanding

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Paper • 2403.09626 • Published Mar 14 • 13
VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15 • 32
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

Paper • 2403.13501 • Published Mar 20 • 9
LITA: Language Instructed Temporal-Localization Assistant

Paper • 2403.19046 • Published Mar 27 • 18

Papers - Multimodal

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22 • 19
ImageBind: One Embedding Space To Bind Them All

Paper • 2305.05665 • Published May 9, 2023 • 5
DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 181
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Paper • 2206.02770 • Published Jun 6, 2022 • 3

Video as the New Language for Real-World Decision Making

Paper • 2402.17139 • Published Feb 27 • 18
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Paper • 2310.19512 • Published Oct 30, 2023 • 15
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11 • 27
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Paper • 2401.09047 • Published Jan 17 • 13

Modality: Video

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

Paper • 2402.10294 • Published Feb 15 • 24
Genie: Generative Interactive Environments

Paper • 2402.15391 • Published Feb 23 • 70
Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published 12 days ago • 131
GenEx: Generating an Explorable World

Paper • 2412.09624 • Published 13 days ago • 84

A collection of Audio, Video and Visual LLMs.

myshell-ai/OpenVoice

Text-to-Speech • Updated about 23 hours ago • 404
Running

1k

🤗

OpenVoice
dataautogpt3/ProteusV0.3

Text-to-Image • Updated Feb 12 • 113k • 93
ByteDance/SDXL-Lightning

Text-to-Image • Updated Apr 3 • 130k • 1.95k

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 181
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1 • 15
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 48
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 41

Gemini: A Family of Highly Capable Multimodal Models

Paper • 2312.11805 • Published Dec 19, 2023 • 44
Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

Paper • 2312.13314 • Published Dec 20, 2023 • 7
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Paper • 2312.11514 • Published Dec 12, 2023 • 257
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

Paper • 2312.09911 • Published Dec 15, 2023 • 53

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

Paper • 2312.08578 • Published Dec 14, 2023 • 16
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

Paper • 2312.08583 • Published Dec 14, 2023 • 9
Vision-Language Models as a Source of Rewards

Paper • 2312.09187 • Published Dec 14, 2023 • 11
StemGen: A music generation model that listens

Paper • 2312.08723 • Published Dec 14, 2023 • 47

Previous
1
2
3
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs