arxiv:2410.02935

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Published on Oct 3, 2024

Authors:

Abstract

With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gating within the HMoE frameworks. We theoretically demonstrate that applying the Laplace gating function at both levels of the HMoE model helps eliminate undesirable parameter interactions caused by the Softmax gating and, therefore, accelerates the expert convergence as well as enhances the expert specialization. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements compared to the conventional HMoE models.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.02935 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.02935 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.02935 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.