arxiv:2501.13074

Autonomy-of-Experts Models

Published on Jan 22

· Submitted by

AngLv on Jan 23

Upvote

Authors:

Ang Lv ,

Ruobing Xie ,

Songhao Wu ,

Zhanhui Kang ,

Abstract

Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

View arXiv page View PDF Add to collection

Community

AngLv

Paper author Paper submitter 1 day ago

The separation between the router’s decision-making and the experts’ execution is a critical yet often overlooked issue in MoE models, leading to suboptimal expert selection and ineffective learning. To address this, we propose a new MoE paradigm, Autonomy-of-Experts Models (AoE), which allows experts to autonomously select themselves, without the need for routers.

duinamit

about 24 hours ago

i really like the insight on how the experts know which tokens they find especially interesting by themselves. The router approach is simple but it's definitely not the best, shown by your evidence!

Great work, I like it and I will try this myself!

AngLv

Paper author about 23 hours ago

Thank you for your interest. There are indeed many aspects that can be further explored or improved. It might take some time for us to open source due to the code review within the company, but if you face any questions while implementing the method, please feel free to contact us!

librarian-bot

about 12 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

WaitHZ

about 9 hours ago

A very interesting idea!

It's happy to see that there are others paying attention to whether the experts' choices are confident. We previously observed that uncertain tokens can affect the performance of MoE (https://arxiv.org/abs/2406.12375).