arxiv:2504.07089

OmniCaptioner: One Captioner to Rule Them All

Published on Apr 9

· Submitted by

yeeeeeyy on Apr 10

Upvote

Authors:

Yiting Lu ,

Jiakang Yuan ,

Zhen Li ,

Shitian Zhao ,

Qi Qin ,

Xinyue Li ,

Licheng Wen ,

Dongyang Liu ,

Xiangchao Yan ,

Botian Shi ,

Zhibo Chen ,

Bo Zhang ,

Abstract

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

View arXiv page View PDF Add to collection

Community

yeeeeeyy

Paper author Paper submitter 4 days ago

We propose OMNICAPTIONER, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains.
Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g.,
documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual
and textual modalities. Our results highlight three key advantages: (i) Enhanced
Visual Reasoning with LLMs, where long-context captions of visual modalities
empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in
multimodal scenarios; (ii) Improved Image Generation, where detailed captions
improve tasks like text-to-image generation and image transformation; and (iii)
Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with
less data. We believe the versatility and adaptability of OMNICAPTIONER can offer
a new perspective for bridging the gap between language and visual modalities.