Dense Grounded Understanding of Images and Videos
Boltz-1
Generate spatial audio from images (and optionally text)