Video-Text-to-Text
Transformers
Safetensors
English
llava
text-generation
multimodal
Eval Results
Inference Endpoints
ZhangYuanhan commited on
Commit
013210b
1 Parent(s): d5cd10a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -202,7 +202,7 @@ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].c
202
  video = [video]
203
  conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
204
  time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
205
- question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
206
  conv = copy.deepcopy(conv_templates[conv_template])
207
  conv.append_message(conv.roles[0], question)
208
  conv.append_message(conv.roles[1], None)
 
202
  video = [video]
203
  conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
204
  time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
205
+ question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\nPlease describe this video in detail."
206
  conv = copy.deepcopy(conv_templates[conv_template])
207
  conv.append_message(conv.roles[0], question)
208
  conv.append_message(conv.roles[1], None)