Adapting Multimodal Large Language Models to Domains via Post-Training

This project adapts general Multimodal Large Language Models (MLLMs) to specific domains like science and industry to improve their real-world use. It focuses on three main areas:

1. Data Synthesis

We create a generate-then-filter pipeline using open-source models to make diverse visual tasks from domain-specific image-caption pairs.
This data works better than data made by hand or closed-source models (e.g., GPT-4V).

2. Training Pipeline

Instead of the usual two-step training (image-caption pairs first, then visual tasks), we use a single-step training to handle more tasks for specific domains.

3. Task Evaluation

We test our method in important fields like biomedicine, food, and remote sensing.
We train and evaluate MLLMs on domain-specific tasks to show how well they perform.

Resources

🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗

Model	Repo ID in HF 🤗	Domain	Base Model	Training Data	Evaluation Benchmark
Visual Instruction Synthesizer	AdaptLLM/visual-instruction-synthesizer	-	open-llava-next-llama3-8b	VisionFLAN and ALLaVA	-
AdaMLLM-med-2B	AdaptLLM/biomed-Qwen2-VL-2B-Instruct	Biomedicine	Qwen2-VL-2B-Instruct	biomed-visual-instructions	biomed-VQA-benchmark
AdaMLLM-food-2B	AdaptLLM/food-Qwen2-VL-2B-Instruct	Food	Qwen2-VL-2B-Instruct	food-visual-instructions	food-VQA-benchmark
AdaMLLM-remote-sensing-2B	AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct	Remote Sensing	Qwen2-VL-2B-Instruct	remote-sensing-visual-instructions	remote-sensing-VQA-benchmark
AdaMLLM-med-8B	AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B	Biomedicine	open-llava-next-llama3-8b	biomed-visual-instructions	biomed-VQA-benchmark
AdaMLLM-food-8B	AdaptLLM/food-LLaVA-NeXT-Llama3-8B	Food	open-llava-next-llama3-8b	food-visual-instructions	food-VQA-benchmark
AdaMLLM-remote-sensing-8B	AdaptLLM/remote-sensing-LLaVA-NeXT-Llama3-8B	Remote Sensing	open-llava-next-llama3-8b	remote-sensing-visual-instructions	remote-sensing-VQA-benchmark
AdaMLLM-med-11B	AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct	Biomedicine	Llama-3.2-11B-Vision-Instruct	biomed-visual-instructions	biomed-VQA-benchmark
AdaMLLM-food-11B	AdaptLLM/food-Llama-3.2-11B-Vision-Instruct	Food	Llama-3.2-11B-Vision-Instruct	food-visual-instructions	food-VQA-benchmark
AdaMLLM-remote-sensing-11B	AdaptLLM/remote-sensing-Llama-3.2-11B-Vision-Instruct	Remote Sensing	Llama-3.2-11B-Vision-Instruct	remote-sensing-visual-instructions	remote-sensing-VQA-benchmark

Code: https://github.com/bigai-ai/QA-Synthesizer

Citation

If you find our work helpful, please cite us.

Adapt MLLM to Domains

@article{adamllm,
  title={On Domain-Specific Post-Training for Multimodal Large Language Models},
  author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
  journal={arXiv preprint arXiv:2411.19930},
  year={2024}
}

Adapt LLM to Domains (ICLR 2024)

@inproceedings{
adaptllm,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}