Video-Text-to-Text
Transformers
Safetensors
English
llava
text-generation
multimodal
Eval Results
Inference Endpoints

Batch Inference with LLaVA-Video generates '!' for samples with smaller prompt

#12
by rranjan1 - opened

I am trying to get video descriptions for a batch of videos for faster inference. My prompts are of different lengths as well. With batch size of 1, the outputs look great, but with batch_size > 1 the output for videos with smaller prompt is '!'. I am using padding and attention mask as provided. What am I missing here?

Prepare video tensor

video = processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(accelerator.device).to(torch.float16)
batch_videos.append(video)

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
batch_prompts.append(prompt_question)

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").to(accelerator.device)
batch_input_ids.append(input_ids)

Pad input_ids to same length

max_length = max(ids.size(0) for ids in batch_input_ids)
padded_input_ids = torch.ones((len(batch_input_ids), max_length), dtype=torch.long) * tokenizer.pad_token_id
padded_input_ids = padded_input_ids.to(accelerator.device)
attention_mask = torch.zeros((len(batch_input_ids), max_length), dtype=torch.long).to(accelerator.device)

for i, ids in enumerate(batch_input_ids):
padded_input_ids[i, :ids.size(0)] = ids
attention_mask[i, :ids.size(0)] = 1

Generate descriptions in batch

with torch.no_grad():
outputs = model.generate(
padded_input_ids,
attention_mask=attention_mask,
images=batch_videos,
modalities=["video"] * len(batch_videos),
do_sample=False,
temperature=0,
max_new_tokens=4096,
)

Process and save outputs

descriptions = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Sign up or log in to comment