Adapting Multimodal Large Language Models to Domains via Post-Training

This project adapts general Multimodal Large Language Models (MLLMs) to specific domains like science and industry to improve their real-world use. It focuses on three main areas:

1. Data Synthesis

  • We create a generate-then-filter pipeline using open-source models to make diverse visual tasks from domain-specific image-caption pairs.
  • This data works better than data made by hand or closed-source models (e.g., GPT-4V).

2. Training Pipeline

  • Instead of the usual two-step training (image-caption pairs first, then visual tasks), we use a single-step training to handle more tasks for specific domains.

3. Task Evaluation

  • We test our method in important fields like biomedicine, food, and remote sensing.
  • We train and evaluate MLLMs on domain-specific tasks to show how well they perform.

Resources

๐Ÿค— We share our data and models with example usages, feel free to open any issues or discussions! ๐Ÿค—

Model Repo ID in HF ๐Ÿค— Domain Base Model Training Data Evaluation Benchmark
Visual Instruction Synthesizer AdaptLLM/visual-instruction-synthesizer - open-llava-next-llama3-8b VisionFLAN and ALLaVA -
AdaMLLM-med-2B AdaptLLM/biomed-Qwen2-VL-2B-Instruct Biomedicine Qwen2-VL-2B-Instruct biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-food-2B AdaptLLM/food-Qwen2-VL-2B-Instruct Food Qwen2-VL-2B-Instruct food-visual-instructions food-VQA-benchmark
AdaMLLM-remote-sensing-2B AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct Remote Sensing Qwen2-VL-2B-Instruct remote-sensing-visual-instructions remote-sensing-VQA-benchmark
AdaMLLM-med-8B AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B Biomedicine open-llava-next-llama3-8b biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-food-8B AdaptLLM/food-LLaVA-NeXT-Llama3-8B Food open-llava-next-llama3-8b food-visual-instructions food-VQA-benchmark
AdaMLLM-remote-sensing-8B AdaptLLM/remote-sensing-LLaVA-NeXT-Llama3-8B Remote Sensing open-llava-next-llama3-8b remote-sensing-visual-instructions remote-sensing-VQA-benchmark
AdaMLLM-med-11B AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct Biomedicine Llama-3.2-11B-Vision-Instruct biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-food-11B AdaptLLM/food-Llama-3.2-11B-Vision-Instruct Food Llama-3.2-11B-Vision-Instruct food-visual-instructions food-VQA-benchmark
AdaMLLM-remote-sensing-11B AdaptLLM/remote-sensing-Llama-3.2-11B-Vision-Instruct Remote Sensing Llama-3.2-11B-Vision-Instruct remote-sensing-visual-instructions remote-sensing-VQA-benchmark

Code: https://github.com/bigai-ai/QA-Synthesizer

Citation

If you find our work helpful, please cite us.

Adapt MLLM to Domains

@article{adamllm,
  title={On Domain-Specific Post-Training for Multimodal Large Language Models},
  author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
  journal={arXiv preprint arXiv:2411.19930},
  year={2024}
}

Adapt LLM to Domains (ICLR 2024)

@inproceedings{
adaptllm,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support