What is Yi-VL?

Architecture

Yi-VL adopts the LLaVA architecture, which is composed of three primary components:

Vision Transformer (ViT): it's initialized with CLIP ViT-H/14 model and used for image encoding.
Projection Module: it's designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
Large Language Model (LLM): it's initialized with Yi-34B-Chat or Yi-6B-Chat, demonstrating exceptional proficiency in understanding and generating both English and Chinese.

How to use Yi-VL?

Quick start

This has been implemented into the SGLang codebase, where you can simply call this model by creating a function like so:

import sglang as sgl

@sgl.function
def image_qa(s, image_path, question):
    s += sgl.user(sgl.image(image_path) + question)
    s += sgl.assistant(sgl.gen("answer"))


runtime = sgl.Runtime(model_path="BabyChou/Yi-VL-34B",
                      tokenizer_path="BabyChou/Yi-VL-34B")
sgl.set_default_backend(runtime)


# Single
state = image_qa.run(
    image_path="images/cat.jpeg",
    question="What is this?",
    max_new_tokens=64)
print(state["answer"], "\n")

License

Please refer to the acknowledgments and attributions as well as individual components, for the license of source code.

The Yi series models are fully open for academic research and free for commercial use, permissions of which are automatically granted upon application.

All usage must adhere to the Yi Series Models Community License Agreement 2.1.

For free commercial use, you only need to send an email to get official commercial permission.