Valley 2.0

Introduction

Valley github is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model not only

  • Achieved the best results in the inhouse e-commerce and short-video benchmarks
  • Demonstrated comparatively outstanding performance in the OpenCompass (average scores > 67) tests

when evaluated against models of the same scale.

Release

Valley-Eagle

The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.

  • In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
  • This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.

and the model structure is shown as follows:

opencompass

Environment Setup

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

License Agreement

All of our open-source models are licensed under the Apache-2.0 license.

Citation

Coming Soon!

Downloads last month
4
Safetensors
Model size
8.88B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for bytedance-research/Valley-Eagle-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(151)
this model