arxiv:2603.27481

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

Published on Mar 29

· Submitted by

Chongyang Zhao on Mar 31

Upvote

Authors:

Chongyang Zhao ,

Mingsong Li ,

Dong Gong

Abstract

LLaVA-DyMoE addresses routing-drift-induced forgetting in multimodal continual instruction tuning by dynamically expanding mixture of experts with token-level assignment guidance and routing score regularizations.

AI-generated summary

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.

View arXiv page View PDF Project page Add to collection

Community

zhaoc5

Paper author Paper submitter about 10 hours ago

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision-Language Models by learning from new data without forgetting previously acquired knowledge. Dynamic MoE architectures naturally facilitate this by incrementally adding new experts while keeping existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing drift: old-task tokens become mistakenly attracted to newly added experts.

This paper analyzes the failure mode at the token level and reveals the Token's Dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts. Motivated by this, the authors propose LLaVA-DyMoE, a dynamic MoE framework with drift-aware token assignment that steers problematic tokens away from new experts and enforces expert-group separation through targeted regularization — with no inference overhead and orthogonal to existing CL strategies.