Papers
arxiv:2503.17500

Variance Control via Weight Rescaling in LLM Pre-training

Published on Mar 21
ยท Submitted by louisowen6 on Mar 25

Abstract

The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.

Community

Paper author Paper submitter
โ€ข
edited 15 days ago

๐Ÿš€ Controlling Weight Variance for Better LLM Performance ๐Ÿš€

We trained over ๐Ÿฐ๐Ÿฌ ๐—ผ๐—ป๐—ฒ-๐—ฏ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป-๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ ๐—Ÿ๐—Ÿ๐—ฎ๐— ๐—” ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ณ๐—ผ๐—ฟ ๐Ÿญ๐Ÿฌ๐Ÿฌ ๐—•๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป ๐—ง๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ and discovered that ๐—ฐ๐—ผ๐—ป๐˜๐—ฟ๐—ผ๐—น๐—น๐—ถ๐—ป๐—ด ๐˜„๐—ฒ๐—ถ๐—ด๐—ต๐˜ ๐˜ƒ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฎ๐˜ ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ป๐—ฑ ๐—ฑ๐˜‚๐—ฟ๐—ถ๐—ป๐—ด ๐—ฝ๐—ฟ๐—ฒ-๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด is crucial for improving downstream task performanceโ€”leading to gains of up to ๐Ÿฐ.๐Ÿฒ% ๐—ผ๐—ป ๐—ฐ๐—ผ๐—บ๐—บ๐—ผ๐—ป ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€! ๐Ÿ“ˆ

To achieve this, we introduce:
โœ… Layer Index Rescaling (LIR) โ€“ a weight initialization scheme
โœ… Target Variance Rescaling (TVR) โ€“ a variance control strategy

Beyond performance gains, these techniques also help reduce extreme activation values, mitigating risks in quantization and low-precision training for LLMs.

@louisowen6 @akanyaani @nilabhra @gueraf

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.17500 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.17500 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.17500 in a Space README.md to link it from this page.

Collections including this paper 1