Model Name: llava-v1.5-13b-dpo

[Arxiv paper] [GitHub] [Data] [Model] [Data]

Developers: Shengzhi Li (TIFIN), Rongyu Lin (KAUST), Shichao Pei (University of Massachusetts Boston)
Affiliations: TIFIN, KAUST, University of Massachusetts Boston
Contact Information: [email protected], [email protected], [email protected]

Overview

The llava-v1.5-13b-dpo model is designed to enhance the instruction-following capabilities of multi-modal large language models (MLLMs), particularly in scenarios where visual instruction tuning might degrade language proficiency. This model leverages a novel Direct Preference Optimization (DPO) method, along with a curated 6K-entry VQA preference dataset, to achieve superior performance on multi-modal tasks and benchmarks.

Intended Use

Primary Applications: This model is intended for tasks requiring the integration of text and image modalities, including but not limited to visual question answering (VQA), image captioning, and multi-modal instruction following.
Target Audience: Researchers and practitioners in the fields of natural language processing, computer vision, and multi-modal AI.

Training Data

The MM-LLM-DPO model was trained using a lightweight (6k entries) VQA preference dataset, where answers were annotated for 5 quality metrics in a granular fashion. The dataset was designed to address the diversity and complexity gap typically observed in VQA datasets.

Evaluation

The model demonstrates significant improvements over baseline models like Vicuna and LLaVA on various benchmarks:

MT-Bench: Achieved a score of 6.73, surpassing Vicuna's 6.57 and LLaVA's 5.99.
Visual Instruction Performance: Recorded a +4.9% improvement on MM-Vet and +6% on LLaVA-Bench.

Model Name	MM-Vet	LLaVA-bench	PoPe	MM-Bench	MT-bench	AlpacaEval
Vicuna-1.5-13b [16]	-	-	-	-	6.57	81.4
LLaVA-1.5-13b [10]	36.3	73.1	0.859	67.4	5.99	79.3
LLaVA-RLHF-13b [23]	37.2	76.8	0.869	60.1	6.18	81.0
Standard SFT	36.5	63.7	0.850	65.4	5.01	50.2
SteerLM	35.2	67.0	0.878	65.1	5.70	68.8
Rejection-sampling	38.0	70.6	0.883	67.6	6.22	74.9
llava-v1.5-13b-dpo	41.2	79.1	0.870	66.8	6.73	86.4

*We applied the last four Standard sft, SteerLM, Rejection Sampling and DPO, and found DPO to be most performant

Ethical Considerations

This model was developed with a focus on mitigating modality conflict and catastrophic forgetting in MLLMs. Users are encouraged to consider the potential biases and limitations inherent in the training data and model outputs, especially when deploying the model in diverse and sensitive contexts.

Limitations

The model's training dataset, while addressing key gaps in VQA datasets, is relatively small at 6k entries. This may limit the model's generalizability across broader or more diverse multi-modal tasks.
Performance enhancements, particularly in language instruction capabilities post-visual tuning, are based on the current scope of evaluated benchmarks and datasets. The model's efficacy may vary in different or more challenging contexts.

Acknowledgments

This work was made possible through the contributions of Shengzhi Li, Rongyu Lin, and Shichao Pei, and supported by their respective institutions.

Citation

Please cite this work as:

@misc{li2024multimodal,
    title={Multi-modal preference alignment remedies regression of visual instruction tuning on language model},
    author={Shengzhi Li and Rongyu Lin and Shichao Pei},
    year={2024},
    eprint={2402.10884},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}