Input Video length constraints
Is there any limits to the length of the input video that can be provided to SmolVLM2 2.2B?
What is the max length, of a video, that it can handle?
there is a limit of 64 frames.
SmolVLM2 will sample frames at 1FPS with a max of 64 frames.
If the video is longer than 64 seconds, then it will sample evenly spaced frames
@mfarre 1FPS is a bit low for my desired usecase. Is there an option to increase the sample rate? Would it be an alternative to manually sample the frames and use the multi-image inference?
@j0yk1ll
You can adjust the fps by:
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": "path_to_video.mp4", "target_fps": fps},
{"type": "text", "text": "Describe this video in detail"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
Best,
Orr
Thanks for the awesome models!
I am trying to use the above snippet to control the number of frames the model can access.
For this, I try to first calculate the desired FPS using:
def get_desired_fps_given_video_path_and_num_frames(video_path, num_frames):
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
num_frames_in_video = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.release()
return fps * num_frames / num_frames_in_video
and then control the SmolVLM2 by:
processor.image_processor.video_sampling["max_frames"] = args.num_frames
processor.image_processor.video_sampling["fps"] = desired_fps
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": args.clip_path, "target_fps": desired_fps},
{"type": "text", "text": prompt},
],
},
]
However, in the log that model prints internally when it's called, it seems to still be extracting 64 frames:
Desired FPS: 0.21333333333333335
User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.
Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:
Do you have any suggestions on where I might be going wrong?
In general, the SmolVLM processor will not sample more then max_frames
frames. You will only sample at target_fps
if the resulting number of frames is < then max_frames.
Therefore, you only really need to set max_frames.
just do
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": args.clip_path, "max_frames": max_frames},
{"type": "text", "text": prompt},
],
},
]
i don't think you need to do
processor.image_processor.video_sampling["max_frames"] = args.num_frames
processor.image_processor.video_sampling["fps"] = desired_fps
Thanks for the super quick reply! I ran with setting the "max_frames" parameter.
However, I still get the same output:
User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.
Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:
OK i tested it: you don't need to pass it with messages, but as input to apply_chat_template
:
max_frames=2
inputs = processor.apply_chat_template(
messages,
max_frames=max_frames,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
This work! Thanks a lot for your swift responses! I appreciate it.