Post
3582
EPFL and Apple (at
@EPFL-VILAB
) just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! ๐
4M is a multimodal training framework introduced by Apple and EPFL.
Resulting model takes image and text and output image and text ๐คฉ
Models: EPFL-VILAB/4m-models-660193abe3faf4b4d98a2742
Demo: EPFL-VILAB/4M
Paper: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2406.09406)
This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:
input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!
This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation ๐ผ๏ธ
The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well โบ๏ธ
4M is a multimodal training framework introduced by Apple and EPFL.
Resulting model takes image and text and output image and text ๐คฉ
Models: EPFL-VILAB/4m-models-660193abe3faf4b4d98a2742
Demo: EPFL-VILAB/4M
Paper: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2406.09406)
This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:
input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!
This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation ๐ผ๏ธ
The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well โบ๏ธ