YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Name: llava-v1.5-13b-dpo

[Arxiv paper] [GitHub] [Data] [Model] [Data]

Developers: Shengzhi Li (TIFIN), Rongyu Lin (KAUST), Shichao Pei (University of Massachusetts Boston)
Affiliations: TIFIN, KAUST, University of Massachusetts Boston
Contact Information: [email protected], [email protected], [email protected]

Overview

The llava-v1.5-13b-dpo model is designed to enhance the instruction-following capabilities of multi-modal large language models (MLLMs), particularly in scenarios where visual instruction tuning might degrade language proficiency. This model leverages a novel Direct Preference Optimization (DPO) method, along with a curated 6K-entry VQA preference dataset, to achieve superior performance on multi-modal tasks and benchmarks.

Intended Use

  • Primary Applications: This model is intended for tasks requiring the integration of text and image modalities, including but not limited to visual question answering (VQA), image captioning, and multi-modal instruction following.
  • Target Audience: Researchers and practitioners in the fields of natural language processing, computer vision, and multi-modal AI.

Training Data

The MM-LLM-DPO model was trained using a lightweight (6k entries) VQA preference dataset, where answers were annotated for 5 quality metrics in a granular fashion. The dataset was designed to address the diversity and complexity gap typically observed in VQA datasets.

Evaluation

The model demonstrates significant improvements over baseline models like Vicuna and LLaVA on various benchmarks:

  • MT-Bench: Achieved a score of 6.73, surpassing Vicuna's 6.57 and LLaVA's 5.99.
  • Visual Instruction Performance: Recorded a +4.9% improvement on MM-Vet and +6% on LLaVA-Bench.
Model Name MM-Vet LLaVA-bench PoPe MM-Bench MT-bench AlpacaEval
Vicuna-1.5-13b [16] - - - - 6.57 81.4
LLaVA-1.5-13b [10] 36.3 73.1 0.859 67.4 5.99 79.3
LLaVA-RLHF-13b [23] 37.2 76.8 0.869 60.1 6.18 81.0
Standard SFT 36.5 63.7 0.850 65.4 5.01 50.2
SteerLM 35.2 67.0 0.878 65.1 5.70 68.8
Rejection-sampling 38.0 70.6 0.883 67.6 6.22 74.9
llava-v1.5-13b-dpo 41.2 79.1 0.870 66.8 6.73 86.4

*We applied the last four Standard sft, SteerLM, Rejection Sampling and DPO, and found DPO to be most performant

Ethical Considerations

This model was developed with a focus on mitigating modality conflict and catastrophic forgetting in MLLMs. Users are encouraged to consider the potential biases and limitations inherent in the training data and model outputs, especially when deploying the model in diverse and sensitive contexts.

Limitations

  • The model's training dataset, while addressing key gaps in VQA datasets, is relatively small at 6k entries. This may limit the model's generalizability across broader or more diverse multi-modal tasks.
  • Performance enhancements, particularly in language instruction capabilities post-visual tuning, are based on the current scope of evaluated benchmarks and datasets. The model's efficacy may vary in different or more challenging contexts.

Acknowledgments

This work was made possible through the contributions of Shengzhi Li, Rongyu Lin, and Shichao Pei, and supported by their respective institutions.

Citation

Please cite this work as:

@misc{li2024multimodal,
    title={Multi-modal preference alignment remedies regression of visual instruction tuning on language model},
    author={Shengzhi Li and Rongyu Lin and Shichao Pei},
    year={2024},
    eprint={2402.10884},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
11
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.