Papers
arxiv:2504.05599

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Published on Apr 8
· Submitted by xuchensong on Apr 9
#3 Paper of the day
Authors:
,
Chris ,
,
,
,
,

Abstract

We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.

Community

Paper author Paper submitter

Skywork R1V: an open-sourced 38B multimodal reasoning model extending R1-series LLMs to vision via efficient transfer, hybrid SFT+GRPO training, and adaptive CoT distillation—69.0 on MMMU, 67.5 on MathVista, with strong math reasoning. Model weights are open-sourced! #AI #LLM #Multimodal

https://github.com/SkyworkAI/Skywork-R1V

WX20250408-235911@2x.png

Paper author

impressive work!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.05599 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.05599 in a Space README.md to link it from this page.

Collections including this paper 7