FLMR model card

FLMR is an open-source model for multimodal knowledge retrieval. It is a transformer-based model that uses a combination of text and image inputs to retrieve relevant documents from a large corpus.

Model Details

Model Description

Model type: FLMRModelForRetrieval
Language(s) (NLP): English
License: MIT License

Paper and resources for more detail

Blog Post for quick overview: https://www.jinghong-chen.net/fined-grained-late-interaction-multimodal-retrieval-flmr/
Paper: https://openreview.net/forum?id=IWWWulAX7g
Repository: https://github.com/LinWeizheDragon/FLMR

Uses

Direct Use

This model can be used directly to retrieve documents from a large corpus using a combination of text and image input queries. The retrieval usage can be found in the official implementation.

Downstream Use

This model can be used combined with language models to create a retrieval-augmented language model. The use for Knowledge-based VQA can be found in RAVQA

How to Get Started with the Model

For details of training, indexing, and performing retrieval, please refer to here.

Training datasets

The model is pre-trained on

Image to Text retrieval: WIT
Image & Question to Text retrieval: OKVQA

For details on the dataset split and conversion process, please refer to the paper Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering.

The processed datasets are:

Evaluation datasets

The model is evaluated on OKVQA, Infoseek, and FVQA.

Please find the evaluation results in the paper.

Citation

BibTeX:

@inproceedings{
  lin2023finegrained,
  title={Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering},
  author={Weizhe Lin and Jinghong Chen and Jingbiao Mei and Alexandru Coca and Bill Byrne},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=IWWWulAX7g}
}