FLMR model card

FLMR is an open-source model for multimodal knowledge retrieval. It is a transformer-based model that uses a combination of text and image inputs to retrieve relevant documents from a large corpus.

Model Details

Model Description

  • Model type: FLMRModelForRetrieval
  • Language(s) (NLP): English
  • License: MIT License

Paper and resources for more detail

Uses

Direct Use

This model can be used directly to retrieve documents from a large corpus using a combination of text and image input queries. The retrieval usage can be found in the official implementation.

Downstream Use

This model can be used combined with language models to create a retrieval-augmented language model. The use for Knowledge-based VQA can be found in RAVQA

How to Get Started with the Model

For details of training, indexing, and performing retrieval, please refer to here.

Training datasets

The model is pre-trained on

  1. Image to Text retrieval: WIT
  2. Image & Question to Text retrieval: OKVQA

For details on the dataset split and conversion process, please refer to the paper Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering.

The processed datasets are:

Evaluation datasets

The model is evaluated on OKVQA, Infoseek, and FVQA.

Please find the evaluation results in the paper.

Citation

BibTeX:

@inproceedings{
  lin2023finegrained,
  title={Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering},
  author={Weizhe Lin and Jinghong Chen and Jingbiao Mei and Alexandru Coca and Bill Byrne},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=IWWWulAX7g}
}
Downloads last month
21
Safetensors
Model size
207M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.