|
--- |
|
library_name: transformers |
|
license: mit |
|
language: |
|
- en |
|
tags: |
|
- retrieval |
|
- multi-modal |
|
- knowledge-based visual question answering |
|
- FLMR |
|
- PreFLMR |
|
--- |
|
|
|
# FLMR model card |
|
|
|
FLMR is an open-source model for multimodal knowledge retrieval. It is a transformer-based model that uses a combination of text and image inputs to retrieve relevant documents from a large corpus. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Model type:** FLMRModelForRetrieval |
|
- **Language(s) (NLP):** English |
|
- **License:** MIT License |
|
|
|
### Paper and resources for more detail |
|
|
|
- **Blog Post for quick overview:** https://www.jinghong-chen.net/fined-grained-late-interaction-multimodal-retrieval-flmr/ |
|
- **Paper:** https://openreview.net/forum?id=IWWWulAX7g |
|
- **Repository:** https://github.com/LinWeizheDragon/FLMR |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model can be used directly to retrieve documents from a large corpus using a combination of text and image input queries. The retrieval usage can be found in the [official implementation](https://github.com/LinWeizheDragon/FLMR). |
|
|
|
### Downstream Use |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
This model can be used combined with language models to create a retrieval-augmented language model. The use for Knowledge-based VQA can be found in [RAVQA](https://github.com/linweizhedragon/retrieval-augmented-visual-question-answering) |
|
|
|
## How to Get Started with the Model |
|
|
|
For details of training, indexing, and performing retrieval, please refer to [here](https://github.com/LinWeizheDragon/FLMR). |
|
|
|
## Training datasets |
|
The model is pre-trained on |
|
1. Image to Text retrieval: WIT |
|
3. Image & Question to Text retrieval: OKVQA |
|
|
|
For details on the dataset split and conversion process, please refer to the paper [Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering](https://openreview.net/forum?id=IWWWulAX7g). |
|
|
|
The processed datasets are: |
|
- https://huggingface.co/datasets/BByrneLab/OKVQA_FLMR_preprocessed_data |
|
- https://huggingface.co/datasets/BByrneLab/OKVQA_FLMR_preprocessed_GoogleSearch_passages |
|
|
|
|
|
## Evaluation datasets |
|
|
|
The model is evaluated on OKVQA, Infoseek, and FVQA. |
|
|
|
Please find the evaluation results in [the paper](https://openreview.net/forum?id=IWWWulAX7g). |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
``` |
|
@inproceedings{ |
|
lin2023finegrained, |
|
title={Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering}, |
|
author={Weizhe Lin and Jinghong Chen and Jingbiao Mei and Alexandru Coca and Bill Byrne}, |
|
booktitle={Thirty-seventh Conference on Neural Information Processing Systems}, |
|
year={2023}, |
|
url={https://openreview.net/forum?id=IWWWulAX7g} |
|
} |
|
``` |