LinWeizheDragon
/

FLMR

Feature Extraction

knowledge-based visual question answering

Model card Files Files and versions Community

FLMR / README.md

LinWeizheDragon's picture

LinWeizheDragon

Create README.md

05c02b4 verified 11 months ago

|

2.76 kB

	---
	library_name: transformers
	license: mit
	language:
	- en
	tags:
	- retrieval
	- multi-modal
	- knowledge-based visual question answering
	- FLMR
	- PreFLMR
	---

	# FLMR model card

	FLMR is an open-source model for multimodal knowledge retrieval. It is a transformer-based model that uses a combination of text and image inputs to retrieve relevant documents from a large corpus.

	## Model Details

	### Model Description

	- Model type: FLMRModelForRetrieval
	- Language(s) (NLP): English
	- License: MIT License

	### Paper and resources for more detail

	- Blog Post for quick overview: https://www.jinghong-chen.net/fined-grained-late-interaction-multimodal-retrieval-flmr/
	- Paper: https://openreview.net/forum?id=IWWWulAX7g
	- Repository: https://github.com/LinWeizheDragon/FLMR

	## Uses

	### Direct Use

	This model can be used directly to retrieve documents from a large corpus using a combination of text and image input queries. The retrieval usage can be found in the [official implementation](https://github.com/LinWeizheDragon/FLMR).

	### Downstream Use

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

	This model can be used combined with language models to create a retrieval-augmented language model. The use for Knowledge-based VQA can be found in [RAVQA](https://github.com/linweizhedragon/retrieval-augmented-visual-question-answering)

	## How to Get Started with the Model

	For details of training, indexing, and performing retrieval, please refer to [here](https://github.com/LinWeizheDragon/FLMR).

	## Training datasets
	The model is pre-trained on
	1. Image to Text retrieval: WIT
	3. Image & Question to Text retrieval: OKVQA

	For details on the dataset split and conversion process, please refer to the paper [Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering](https://openreview.net/forum?id=IWWWulAX7g).

	The processed datasets are:
	- https://huggingface.co/datasets/BByrneLab/OKVQA_FLMR_preprocessed_data
	- https://huggingface.co/datasets/BByrneLab/OKVQA_FLMR_preprocessed_GoogleSearch_passages


	## Evaluation datasets

	The model is evaluated on OKVQA, Infoseek, and FVQA.

	Please find the evaluation results in [the paper](https://openreview.net/forum?id=IWWWulAX7g).

	## Citation

	BibTeX:
	```
	@inproceedings{
	lin2023finegrained,
	title={Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering},
	author={Weizhe Lin and Jinghong Chen and Jingbiao Mei and Alexandru Coca and Bill Byrne},
	booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
	year={2023},
	url={https://openreview.net/forum?id=IWWWulAX7g}
	}
	```