Welcome to try DeepSeek-VL2~

#2
by CharlesCXK - opened

Hello, great work! We would also like to invite you to evaluate our DeepSeek-VL2 (the number of activated parameters is 4.5B.) model, which was released in December 2024. DeepSeek-VL2 exhibits strong grounding capabilities and can handle various scenarios, including natural scenes, UI elements, and more.

The input format for our model is <|ref|>xxx<|/ref|>, where <|ref|> and <|/ref|> are special tokens, and xxx represents the object you want to query. The output format is <|ref|>xxx<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>, where xxx is the queried object from the prompt, and [x1, y1, x2, y2] are the coordinates of the detected object. Here, x1 and y1 denote the top-left corner, and x2 and y2 denote the bottom-right corner of the bounding box, with the top-left corner of the image being (0, 0). These coordinates are normalized to the range [0, 999]. For example, if the original width of the image is W, the absolute coordinate of x1 can be calculated as x1 / 999 * W. If multiple objects are detected, there will be more than one list, separated by commas, such as <|det|>[[x1, y1, x2, y2], [m1, n1, m2, n2]]<|/det|>.

Our model is available for download on HuggingFace (https://huggingface.co/deepseek-ai/deepseek-vl2) and can also be accessed via API at https://cloud.siliconflow.cn.
image.png

Hello @CharlesCXK ,

thanks for asking. For sure we will try your model out and add your model to our vision agent lib.

Do you have the training code available? Then we can fine tune your model on UI datasets and report it back to you.

Hi, we do not provide training code. Our model has been trained on some UI data, so it can be tested directly.πŸ˜„

Hello @CharlesCXK ,

I tried to get the DeepSeek-Vl2 model in an hugging face space running. But I run in multiple problems. Do you have a working huggingface space for me?

My error is:

Exit code: 1. Reason: Python version is above 3.10, patching the collections module.
Traceback (most recent call last):
  File "/home/user/app/app.py", line 11, in <module>
    from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
  File "/usr/local/lib/python3.10/site-packages/deepseek_vl2/models/__init__.py", line 21, in <module>
    from .modeling_deepseek_vl_v2 import DeepseekVLV2ForCausalLM
  File "/usr/local/lib/python3.10/site-packages/deepseek_vl2/models/modeling_deepseek_vl_v2.py", line 28, in <module>
    from .modeling_deepseek import DeepseekV2ForCausalLM
  File "/usr/local/lib/python3.10/site-packages/deepseek_vl2/models/modeling_deepseek.py", line 37, in <module>
    from transformers.models.llama.modeling_llama import (
ImportError: cannot import name 'LlamaFlashAttention2' from 'transformers.models.llama.modeling_llama' (/usr/local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py)

My requirements.txt is:

flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
xformers
deepseek-vl2 @ git+https://github.com/deepseek-ai/DeepSeek-VL2.git

And the script from yor space.

import torch
from transformers import AutoModelForCausalLM

from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images


# specify the path to the model
model_path = "deepseek-ai/deepseek-vl2-tiny"
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

## single image conversation example
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
        "images": ["./images/visual_grounding_1.jpeg"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)

This appears to be an issue with the version of transformers. Please configure the environment according to the version specified here.

Hello @CharlesCXK ,

thanks for guiding me to the right requirements.txt. I've setted up a space with with DeepSeek-VL2. Would be nice. if you can continue our discussion to this space.

I've multiple questions:

  • I've only used the tiny model, because otherwise the Huggingface Space run out of memorer for the small and the normal model. Do you have some suggestion for me?
  • Additionally with our experience the models perform the best, when we know how to prompt it in the right way. I looked in your paper and saw that you are using: "<|ref|>ActionGames<|/ref|>" or "Pinpoint <|ref|>Notifications<|/ref|> in
    the image with its coordinates." or " Find <|ref|>The DeepThink button<|/ref|>"
    • How did came up with this terms? Do you have a list of it?
    • Do you plan to train another model? Can we provide you some datasets with UI Data and you can tune them in?
  1. In our recent update, we have incorporated the Incremental Prefilling strategy, which enables DeepSeek-VL-Small to run on a 40G A100 GPU.
  2. 2.1 We used LLM to assist in designing some prompts for training, but ultimately found that directly using <|ref|>ActionGames<|/ref|> also yielded good results. Below are some prompt templates used during training. is a placeholder and will be replaced with content similar to <|ref|>ActionGames<|/ref|>.
        "Locate <object> in the given image.",
        "Identify <object> in this picture.",
        "Pinpoint <object> in the image with its coordinates.",
        "Find <object> in this image with its bounding box.",
        "Show the position of <object> in the image.",
        "Trace <object> in the image and provide its bounding box.",
        "Locate and box <object> in the image.",
        "Detect <object> in the picture.",
        "Where can <object> be found in this image?",
        "Identify <object> in the image.",
        "Locate <object> and output its bounding box.",
    

    2.2. Thank you for your help. We have included a small amount of UI data during training, such as Wava-UI-25K and Mind2Web. If convenient, please share your data with us so we can check if there are any new additions. Thank you!

Thanks for sharing the details. I will try to integrate them.

2.2. @maxiw has listed a lot of UI Datasets. I think this is a good staring point to extend your VLM capabilities in the UI domain.

@CharlesCXK ,

I tried your approach regarding "Incremental Prefilling strategy" out, but still no succuess also with an A100 80GB

I'm not entirely sure what's going on. In our local testing, an 80G A100 can directly run DeepSeek-VL2 without needing the "Incremental Prefilling strategy." The "Incremental Prefilling strategy" was designed to enable a 40G A100 to run DeepSeek-VL2-Small.

Sign up or log in to comment