Penguin-VL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

GitHub with detailed usage: tencent-ailab/Penguin-VL

📰 News

2025.03 — PenguinVL-Encoder now available for general use.
2025.03 — Released PenguinVL-2B, PenguinVL-8B.

🌟 Model Overview

PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

Key Characteristics

🧠 LLM-based Vision Encoder
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
This provides strong semantic priors and native compatibility with the downstream LLM.
🎥 Efficient Video Understanding
A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.
🏗 Unified Architecture
The model consists of:
1. LLM-initialized vision encoder
2. Lightweight MLP projector
3. Qwen3 language backbone
📊 Compact but Strong
At 8B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.

🧪 Quick Start — Transformers Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "tencent/Penguin-VL-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example: Image + Text
inputs = processor(
    conversation=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": {"image_path": "assets/example.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ],
        },
    ],
    return_tensors="pt",
)


inputs = {k: v.to("cuda") for k, v in inputs.items() if isinstance(v, torch.Tensor)}

output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)

print(response)

🌎 Model Zoo

Model	Base Model	HF Link
PenguinVL-8B	Qwen3-8B	tencent/Penguin-VL-8B
PenguinVL-2B	Qwen3-1.7B	tencent/Penguin-VL-2B
PenguinVL-Encoder	Qwen3-0.6B	tencent/Penguin-Encoder

🚀 Main Results

Chart / OCR / Document Understanding

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	OpenAI GPT-5 nano
InfoVQA	86.8	83.1	79.1	49.2
ChartQA	90.5	89.6	86.7	48.6
DocVQA	96.2	96.1	92.3	78.3
CharXiv (DQ / RQ)	75.7 / 40.0	83.0 / 46.4	72.2 / 44.4	64.4 / 31.7
OCRBench	852	896	840	701

General Knowledge / Multi-Image / Math Reasoning

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	OpenAI GPT-5 nano
AI2D	86.1	85.7	84.0	65.7
RealWorldQA	75.8	71.5	67.5	60.7
V-star	90.2	90.1	70.7	63.4
MMMU-Pro	40.2	55.9	39.7	36.5
BLINK	58.2	69.1	59.5	42.2
MathVista	77.4	77.2	74.2	40.9
MathVerse	50.8	62.1	55.8	27.0
LogicVista	53.8	55.3	57.3	40.5

Video Understanding

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	OpenAI GPT-5 nano
MVBench	71.7	68.7	72.1	52.9
LongVideoBench	67.0	62.6	62.1	38.1
VideoMME	66.2	71.4	66.0	49.4
Egochema	67.0	70.2	61.0	34.8
MMVU	53.9	58.7	51.5	51.0
CharadesSTA	61.4	56.0	32.8	5.0
NextQA	85.4	82.3	81.3	59.3
ActivityNetQA	65.2	63.7	60.1	–
Perception Test	78.0	72.7	72.7	–