Papers
arxiv:2002.05202

GLU Variants Improve Transformer

Published on Feb 12, 2020
Authors:

Abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Community

Sign up or log in to comment

Models citing this paper 80

Browse 80 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2002.05202 in a dataset README.md to link it from this page.

Spaces citing this paper 266

Collections including this paper 3