Demo example in the paper

#6
by willsky - opened

Great work!
May I ask how to get the results in Figure 4 and Figure 5 in the paper? I.e., retrieve the specific frames corresponding to the prompts.
Many thanks!

OpenGVLab org

To achieve a more detail video understanding than conversations, you need to load third-party modules from TPO, which can be referred to https://huggingface.co/OpenGVLab/VideoChat-TPO/tree/main

It seems the the third-party modules in TPO are cgdetr and sam2. How should I proceed after loading these two modules?

OpenGVLab org

After loading the corresponding task decoder, the model will identify whether the task decoder needs to be called and assist in giving the corresponding response.

Could you give an example code to do this? I just want to get the specific frame number or specific time corresponding to the prompts. For example, "In this video, in which frames does a man appear?" "In this video, from which second to which second does a man appear?" Currently, the demo cannot output the right frames/seconds.

OpenGVLab org

You can try this:

Based on the video content, Determine the start and end times of **various activity events** in the video, accompanied by descriptions.

I have tried this, but it cannot output the right time. For a video of 6 second, it outputs "25 to 30 seconds"

I have been trying to replicate this for a while with no luck. I tried running TPO but couldn't figure out how to load this model with it, I couldn't run the third-party modules alone with the new model either. Are there any updates on this? I want to understand how to get the third-party modules working with this model. If that's how the outputs were achieved in your paper, I would greatly appreciate any documentation or code on how to replicate those experiments.

Sign up or log in to comment