Batch Inference with LLaVA-Video generates '!' for samples with smaller prompt
I am trying to get video descriptions for a batch of videos for faster inference. My prompts are of different lengths as well. With batch size of 1, the outputs look great, but with batch_size > 1 the output for videos with smaller prompt is '!'. I am using padding and attention mask as provided. What am I missing here?
Prepare video tensor
video = processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(accelerator.device).to(torch.float16)
batch_videos.append(video)
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
batch_prompts.append(prompt_question)
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").to(accelerator.device)
batch_input_ids.append(input_ids)
Pad input_ids to same length
max_length = max(ids.size(0) for ids in batch_input_ids)
padded_input_ids = torch.ones((len(batch_input_ids), max_length), dtype=torch.long) * tokenizer.pad_token_id
padded_input_ids = padded_input_ids.to(accelerator.device)
attention_mask = torch.zeros((len(batch_input_ids), max_length), dtype=torch.long).to(accelerator.device)
for i, ids in enumerate(batch_input_ids):
padded_input_ids[i, :ids.size(0)] = ids
attention_mask[i, :ids.size(0)] = 1
Generate descriptions in batch
with torch.no_grad():
outputs = model.generate(
padded_input_ids,
attention_mask=attention_mask,
images=batch_videos,
modalities=["video"] * len(batch_videos),
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
Process and save outputs
descriptions = tokenizer.batch_decode(outputs, skip_special_tokens=True)