On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions
Abstract
With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gating within the HMoE frameworks. We theoretically demonstrate that applying the Laplace gating function at both levels of the HMoE model helps eliminate undesirable parameter interactions caused by the Softmax gating and, therefore, accelerates the expert convergence as well as enhances the expert specialization. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements compared to the conventional HMoE models.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper