arxiv:2501.18512

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Published on Jan 30

· Submitted by

ArthurDouillard on Jan 31

#3 Paper of the day

Upvote

Authors:

Abstract

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

View arXiv page View PDF Add to collection

Community

ArthurDouillard

Paper submitter 1 day ago

Distributed training for LLMs, an improvement over the DiLoCo:

Synchronize subset of the params at a time --> reduce peak bandwidth
Overlap communication with computation --> increase tolerated latency
Quantize communication --> further reduce total amount of data exchanged

In the end, on a 10x overtrained 1B transformer, get similar performance as a well-tuned data-parallel baseline while using more than 400x less bandwidth.

Aurelien-Morgan

1 day ago

Hey Arthur,
Congrats on they mighty paper. Loved picking through it. Wouldn't "collaborative training" be a better suited terminology for this ? I also am a long-time very enthusiastic observer of this, BTW. Feeling the excitement growing over this.