@merve on Hugging Face: "NVIDIA just dropped a gigantic multimodal model called NVLM 72B 🦖…"

Post

2733

NVIDIA just dropped a gigantic multimodal model called NVLM 72B 🦖
nvidia/NVLM-D-72B
Paper page NVLM: Open Frontier-Class Multimodal LLMs (2409.11402)

The paper contains many ablation studies on various ways to use the LLM backbone 👇🏻

🦩 Flamingo-like cross-attention (NVLM-X)
🌋 Llava-like concatenation of image and text embeddings to a decoder-only model (NVLM-D)
✨ a hybrid architecture (NVLM-H)

Checking evaluations, NVLM-D and NVLM-H are best or second best compared to other models 👏

The released model is NVLM-D based on Qwen-2 Instruct, aligned with InternViT-6B using a huge mixture of different datasets

You can easily use this model by loading it through transformers' AutoModel 😍

Join the conversation