a tiny vision language model
interact with videos !
Demo of the Transformers implementation of ColPali