Input Video length constraints

#6
by NikhilJoson - opened

Is there any limits to the length of the input video that can be provided to SmolVLM2 2.2B?
What is the max length, of a video, that it can handle?

Hugging Face TB Research org

there is a limit of 64 frames.
SmolVLM2 will sample frames at 1FPS with a max of 64 frames.
If the video is longer than 64 seconds, then it will sample evenly spaced frames

@mfarre 1FPS is a bit low for my desired usecase. Is there an option to increase the sample rate? Would it be an alternative to manually sample the frames and use the multi-image inference?

Hugging Face TB Research org

@j0yk1ll
You can adjust the fps by:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4", "target_fps": fps},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Best,
Orr

Thanks for the awesome models!

I am trying to use the above snippet to control the number of frames the model can access.

For this, I try to first calculate the desired FPS using:

def get_desired_fps_given_video_path_and_num_frames(video_path, num_frames):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    num_frames_in_video = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    cap.release()
    return fps * num_frames / num_frames_in_video

and then control the SmolVLM2 by:

    processor.image_processor.video_sampling["max_frames"] = args.num_frames
    processor.image_processor.video_sampling["fps"] = desired_fps

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "path": args.clip_path, "target_fps": desired_fps},
                {"type": "text", "text": prompt},
            ],
        },
    ]

However, in the log that model prints internally when it's called, it seems to still be extracting 64 frames:

Desired FPS: 0.21333333333333335
User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.

Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:

Do you have any suggestions on where I might be going wrong?

Hugging Face TB Research org
edited 11 days ago

In general, the SmolVLM processor will not sample more then max_frames frames. You will only sample at target_fps if the resulting number of frames is < then max_frames.

Therefore, you only really need to set max_frames.

just do

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "path": args.clip_path, "max_frames": max_frames},
                {"type": "text", "text": prompt},
            ],
        },
    ]

i don't think you need to do

    processor.image_processor.video_sampling["max_frames"] = args.num_frames
    processor.image_processor.video_sampling["fps"] = desired_fps

Thanks for the super quick reply! I ran with setting the "max_frames" parameter.

However, I still get the same output:

User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.

Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:
Hugging Face TB Research org

OK i tested it: you don't need to pass it with messages, but as input to apply_chat_template:

max_frames=2


inputs = processor.apply_chat_template(
    messages,
    max_frames=max_frames,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

This work! Thanks a lot for your swift responses! I appreciate it.

Sign up or log in to comment