ColPali marries the idea of modern vision language models with retrieval 🤝
The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali 🖇️ BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) 🤩
The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet. ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!