Papers
arxiv:2501.18512

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Published on Jan 30
· Submitted by ArthurDouillard on Jan 31
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

Community

Paper submitter

Distributed training for LLMs, an improvement over the DiLoCo:

  1. Synchronize subset of the params at a time --> reduce peak bandwidth
  2. Overlap communication with computation --> increase tolerated latency
  3. Quantize communication --> further reduce total amount of data exchanged

In the end, on a 10x overtrained 1B transformer, get similar performance as a well-tuned data-parallel baseline while using more than 400x less bandwidth.

·

Hey Arthur,
Congrats on they mighty paper. Loved picking through it. Wouldn't "collaborative training" be a better suited terminology for this ? I also am a long-time very enthusiastic observer of this, BTW. Feeling the excitement growing over this.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.18512 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.18512 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.18512 in a Space README.md to link it from this page.

Collections including this paper 2