Multi-image input support

#34
by cynricfu - opened

Does molmo currently support an input with one text and multiple images?
And how about the interleaved image-text input?

I am also wondering~~

Hey @cynricfu and @Michael34234 Molmo does not support multiple image atm, but it has capability to respond interleaved image-text input.

@amanrangapur obviously knows best, but I wanted to add a couple of thoughts to this convo:

  • I have had really success with putting two images side by side into 1 image and having molmo compare them.

example:

image.png

  • With regard to feeding 2 separate images... I was actually just trying an experiment with this, and it seems to me that it actually works very well?
img1 = Image.open(BytesIO(image1))
img2 = Image.open(BytesIO(image2))
img1 = img1.convert("RGB")
img2 = img2.convert("RGB")
prompt = request.form['prompt'] or "These 2 images are from before and after. Describe the specific differences."

with torch.no_grad():
    with torch.autocast('cuda', enabled=True, dtype=torch.bfloat16):
        print('Processing inputs')
        # process the image and text
        inputs = processor.process(
            images=[img1, img2],
            text=prompt
        )

        # move inputs to the correct device and make a batch of size 1
        inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

        print('Generating outputs')
        # generate output; maximum 500 new tokens; stop generation when <|endoftext|> is generated
        output = model.generate_from_batch(
            inputs,
            GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
            tokenizer=processor.tokenizer)
        
        generated_tokens = output[0,inputs['input_ids'].size(1):]
        generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)

its also a bit hallucinate-y, but i am running it at bf16...

image.png

@mw44 ! That's a great workaround. Molmo currently only processes one image at a time in its official implementation. Interesting that processor function is concatenating the image embeddings in proper way.

Sign up or log in to comment