Input Video length constraints

by NikhilJoson - opened 20 days ago

20 days ago

Is there any limits to the length of the input video that can be provided to SmolVLM2 2.2B?
What is the max length, of a video, that it can handle?

mfarre

Hugging Face TB Research org 20 days ago

there is a limit of 64 frames.
SmolVLM2 will sample frames at 1FPS with a max of 64 frames.
If the video is longer than 64 seconds, then it will sample evenly spaced frames

j0yk1ll

17 days ago

•

edited 17 days ago

@mfarre 1FPS is a bit low for my desired usecase. Is there an option to increase the sample rate? Would it be an alternative to manually sample the frames and use the multi-image inference?

orrzohar

Hugging Face TB Research org 17 days ago

@j0yk1ll
You can adjust the fps by:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4", "target_fps": fps},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Best,
Orr

RuchitRawal

11 days ago

Thanks for the awesome models!

I am trying to use the above snippet to control the number of frames the model can access.

For this, I try to first calculate the desired FPS using:

def get_desired_fps_given_video_path_and_num_frames(video_path, num_frames):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    num_frames_in_video = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    cap.release()
    return fps * num_frames / num_frames_in_video

and then control the SmolVLM2 by:

    processor.image_processor.video_sampling["max_frames"] = args.num_frames
    processor.image_processor.video_sampling["fps"] = desired_fps

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "path": args.clip_path, "target_fps": desired_fps},
                {"type": "text", "text": prompt},
            ],
        },
    ]

However, in the log that model prints internally when it's called, it seems to still be extracting 64 frames:

Desired FPS: 0.21333333333333335
User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.

Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:

Do you have any suggestions on where I might be going wrong?

orrzohar

Hugging Face TB Research org 11 days ago

•

edited 11 days ago

In general, the SmolVLM processor will not sample more then max_frames frames. You will only sample at target_fps if the resulting number of frames is < then max_frames.

Therefore, you only really need to set max_frames.

just do

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "path": args.clip_path, "max_frames": max_frames},
                {"type": "text", "text": prompt},
            ],
        },
    ]

i don't think you need to do

    processor.image_processor.video_sampling["max_frames"] = args.num_frames
    processor.image_processor.video_sampling["fps"] = desired_fps

RuchitRawal

11 days ago

Thanks for the super quick reply! I ran with setting the "max_frames" parameter.

However, I still get the same output:

User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.

Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:

orrzohar

Hugging Face TB Research org 11 days ago

OK i tested it: you don't need to pass it with messages, but as input to apply_chat_template:

max_frames=2


inputs = processor.apply_chat_template(
    messages,
    max_frames=max_frames,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

RuchitRawal

11 days ago

This work! Thanks a lot for your swift responses! I appreciate it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment