Papers
arxiv:2501.13074

Autonomy-of-Experts Models

Published on Jan 22
· Submitted by AngLv on Jan 23
Authors:
Ang Lv ,
,
,
,

Abstract

Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

Community

Paper author Paper submitter

The separation between the router’s decision-making and the experts’ execution is a critical yet often overlooked issue in MoE models, leading to suboptimal expert selection and ineffective learning. To address this, we propose a new MoE paradigm, Autonomy-of-Experts Models (AoE), which allows experts to autonomously select themselves, without the need for routers.

i really like the insight on how the experts know which tokens they find especially interesting by themselves. The router approach is simple but it's definitely not the best, shown by your evidence!

Great work, I like it and I will try this myself!

·
Paper author

Thank you for your interest. There are indeed many aspects that can be further explored or improved. It might take some time for us to open source due to the code review within the company, but if you face any questions while implementing the method, please feel free to contact us!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

A very interesting idea!

It's happy to see that there are others paying attention to whether the experts' choices are confident. We previously observed that uncertain tokens can affect the performance of MoE (https://arxiv.org/abs/2406.12375).

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.13074 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.13074 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.13074 in a Space README.md to link it from this page.

Collections including this paper 6