lvwerra HF staff commited on
Commit
4d951cf
·
1 Parent(s): 60aea95

several fixes

Browse files
assets/images/tp_diagram.svg CHANGED
assets/images/tp_diagram4.png CHANGED

Git LFS Details

  • SHA256: f075304c019e12be1ac0ef8afa9241c03bc466f568dca0c66e20b1391a471bca
  • Pointer size: 131 Bytes
  • Size of remote file: 486 kB

Git LFS Details

  • SHA256: a37adac220e4ec37dd58be698d26630520501c2de71161c6601d6318e1cbffcd
  • Pointer size: 131 Bytes
  • Size of remote file: 618 kB
dist/assets/images/5D_nutshell_tp_sp.svg CHANGED
dist/assets/images/5d_nutshell_cp.svg CHANGED
dist/assets/images/5d_nutshell_ep.svg CHANGED
dist/assets/images/tp_diagram.svg CHANGED
dist/assets/images/tp_diagram4.png CHANGED

Git LFS Details

  • SHA256: 92f1591b62f4f7eb8a059b973a379784523915386ee9f682e17e3ab43d4f494d
  • Pointer size: 130 Bytes
  • Size of remote file: 89.8 kB

Git LFS Details

  • SHA256: cb2772716631ff96aeab01b1eb6cc8e59927d4f30cba72d8ba506dcf326406c7
  • Pointer size: 131 Bytes
  • Size of remote file: 129 kB
dist/index.html CHANGED
@@ -18,8 +18,28 @@
18
  "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
19
  "description": "This blog covers everything about scaling LLMs in 2025.",
20
  "published": "Feb 19, 2025",
21
- "affiliation": {"name": "HuggingFace"},
22
  "authors": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  {
24
  "author":"Leandro Werra",
25
  "authorURL":"https://huggingface.co/lvwerra"
@@ -202,6 +222,8 @@
202
  </li>
203
  </ul>
204
 
 
 
205
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
206
 
207
  <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
@@ -580,7 +602,7 @@
580
  </ul>
581
 
582
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
583
- <p>Figure: Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p>
584
 
585
  <p>The trace helps identify bottlenecks like:</p>
586
  <ul>
@@ -1080,11 +1102,9 @@
1080
 
1081
  <p>In practice we’ll go from the left diagram to the right:</p>
1082
 
1083
- <p><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
1084
  in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
1085
- SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" /></p>
1086
-
1087
- <p>Where the abbreviations are: in forward: f = no-op ; f<em> = all-reduce ; g = all-gather ; g</em> = reduce-scatter in backward: f = all-reduce ; f<em> = no-op ; g = reduce-scatter ; g</em> = all-gather SP region needs full hidden_dim</p>
1088
 
1089
  <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
1090
 
@@ -1099,7 +1119,7 @@
1099
  <li>"f" is an all-reduce to synchronize gradients</li>
1100
  </ul>
1101
 
1102
- <p>These operations "f" and "f<em>" are called </em><em>conjugate</em>* pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
1103
 
1104
  <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
1105
 
@@ -1900,13 +1920,13 @@
1900
  <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
1901
 
1902
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
1903
- <p><em>Source: https://blog.codingconfessions.com/p/gpu-computing.</em></p>
1904
 
1905
  <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
1906
 
1907
  <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
1908
- <p><em>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</em></p>
1909
-
1910
  <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
1911
 
1912
  <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
@@ -1914,9 +1934,10 @@
1914
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1915
 
1916
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1917
- <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
 
1918
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1919
- <p>Figure 6: Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p>
1920
 
1921
  <p>Kernels are generally scheduled as follow:</p>
1922
 
@@ -2091,7 +2112,7 @@
2091
  <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2092
 
2093
 
2094
- <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong> !</p>
2095
  <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2096
 
2097
 
@@ -2197,14 +2218,14 @@
2197
 
2198
  <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
2199
 
2200
- <p><img alt="image.png" src="/assets/images/flashattn.png" /></p>
2201
-
2202
  <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
2203
 
2204
  <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
2205
 
2206
  <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
2207
- <p>From the FLASH-ATTENTION paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p>
2208
 
2209
  <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
2210
  <ul>
@@ -2503,9 +2524,14 @@
2503
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2504
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2505
  </ul>
2506
-
2507
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2508
 
 
 
 
 
 
2509
  <h2>References</h2>
2510
 
2511
  <h3>Landmark LLM Scaling Papers</h3>
 
18
  "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
19
  "description": "This blog covers everything about scaling LLMs in 2025.",
20
  "published": "Feb 19, 2025",
21
+ "affiliation": {"name": "Hugging Face"},
22
  "authors": [
23
+ {
24
+ "author":"Nouamane Tazi",
25
+ "authorURL":"https://huggingface.co/nouamanetazi"
26
+ },
27
+ {
28
+ "author":"Ferdinand Mom",
29
+ "authorURL":"https://huggingface.co/3outeille"
30
+ },
31
+ {
32
+ "author":"Haojun Zhao",
33
+ "authorURL":"https://huggingface.co/zzhhjjj"
34
+ },
35
+ {
36
+ "author":"Phuc Nguyen",
37
+ "authorURL":"https://huggingface.co/neuralink"
38
+ },
39
+ {
40
+ "author":"Mohamed Mekkouri",
41
+ "authorURL":"https://huggingface.co/medmekk"
42
+ },
43
  {
44
  "author":"Leandro Werra",
45
  "authorURL":"https://huggingface.co/lvwerra"
 
222
  </li>
223
  </ul>
224
 
225
+ <aside>If you want to watch a video on distributed training rather than reading the blog or picotron code checkout <a href="https://www.youtube.com/watch?v=u2VSwDDpaBM&list=PL-_armZiJvAnhcRr6yTJ0__f3Oi-LLi9S">Ferdinand's YouTube channel</a>.</aside>
226
+
227
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
228
 
229
  <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
 
602
  </ul>
603
 
604
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
605
+ <div class="figure-legend"><p>Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
606
 
607
  <p>The trace helps identify bottlenecks like:</p>
608
  <ul>
 
1102
 
1103
  <p>In practice we’ll go from the left diagram to the right:</p>
1104
 
1105
+ <p style="text-align: center"><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
1106
  in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
1107
+ SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" style="width: 500px" /></p>
 
 
1108
 
1109
  <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
1110
 
 
1119
  <li>"f" is an all-reduce to synchronize gradients</li>
1120
  </ul>
1121
 
1122
+ <p>These operations "f" and "f*" are called <strong>conjugate</strong> pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
1123
 
1124
  <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
1125
 
 
1920
  <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
1921
 
1922
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
1923
+ <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
1924
 
1925
  <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
1926
 
1927
  <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
1928
+ <div class="figure-legend"><p>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</p></div>
1929
+
1930
  <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
1931
 
1932
  <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
 
1934
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1935
 
1936
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1937
+ <div class="figure-legend"><p>Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p></div>
1938
+
1939
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1940
+ <div class="figure-legend"><p>Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p></div>
1941
 
1942
  <p>Kernels are generally scheduled as follow:</p>
1943
 
 
2112
  <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2113
 
2114
 
2115
+ <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>!</p>
2116
  <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2117
 
2118
 
 
2218
 
2219
  <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
2220
 
2221
+ <p style="text-align: center"><img alt="image.png" src="/assets/images/flashattn.png" style="width: 500px" /></p>
2222
+
2223
  <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
2224
 
2225
  <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
2226
 
2227
  <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
2228
+ <div class="figure-legend"><p>Source: FlashAttention paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p></div>
2229
 
2230
  <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
2231
  <ul>
 
2524
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2525
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2526
  </ul>
2527
+
2528
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2529
 
2530
+ <h3>Acknowledgements</h3>
2531
+
2532
+ <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
2533
+
2534
+
2535
  <h2>References</h2>
2536
 
2537
  <h3>Landmark LLM Scaling Papers</h3>
src/index.html CHANGED
@@ -18,8 +18,28 @@
18
  "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
19
  "description": "This blog covers everything about scaling LLMs in 2025.",
20
  "published": "Feb 19, 2025",
21
- "affiliation": {"name": "HuggingFace"},
22
  "authors": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  {
24
  "author":"Leandro Werra",
25
  "authorURL":"https://huggingface.co/lvwerra"
@@ -202,6 +222,8 @@
202
  </li>
203
  </ul>
204
 
 
 
205
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
206
 
207
  <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
@@ -580,7 +602,7 @@
580
  </ul>
581
 
582
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
583
- <p>Figure: Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p>
584
 
585
  <p>The trace helps identify bottlenecks like:</p>
586
  <ul>
@@ -1080,11 +1102,9 @@
1080
 
1081
  <p>In practice we’ll go from the left diagram to the right:</p>
1082
 
1083
- <p><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
1084
  in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
1085
- SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" /></p>
1086
-
1087
- <p>Where the abbreviations are: in forward: f = no-op ; f<em> = all-reduce ; g = all-gather ; g</em> = reduce-scatter in backward: f = all-reduce ; f<em> = no-op ; g = reduce-scatter ; g</em> = all-gather SP region needs full hidden_dim</p>
1088
 
1089
  <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
1090
 
@@ -1099,7 +1119,7 @@
1099
  <li>"f" is an all-reduce to synchronize gradients</li>
1100
  </ul>
1101
 
1102
- <p>These operations "f" and "f<em>" are called </em><em>conjugate</em>* pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
1103
 
1104
  <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
1105
 
@@ -1900,13 +1920,13 @@
1900
  <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
1901
 
1902
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
1903
- <p><em>Source: https://blog.codingconfessions.com/p/gpu-computing.</em></p>
1904
 
1905
  <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
1906
 
1907
  <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
1908
- <p><em>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</em></p>
1909
-
1910
  <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
1911
 
1912
  <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
@@ -1914,9 +1934,10 @@
1914
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1915
 
1916
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1917
- <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
 
1918
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1919
- <p>Figure 6: Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p>
1920
 
1921
  <p>Kernels are generally scheduled as follow:</p>
1922
 
@@ -2091,7 +2112,7 @@
2091
  <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2092
 
2093
 
2094
- <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong> !</p>
2095
  <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2096
 
2097
 
@@ -2197,14 +2218,14 @@
2197
 
2198
  <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
2199
 
2200
- <p><img alt="image.png" src="/assets/images/flashattn.png" /></p>
2201
-
2202
  <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
2203
 
2204
  <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
2205
 
2206
  <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
2207
- <p>From the FLASH-ATTENTION paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p>
2208
 
2209
  <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
2210
  <ul>
@@ -2503,9 +2524,14 @@
2503
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2504
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2505
  </ul>
2506
-
2507
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2508
 
 
 
 
 
 
2509
  <h2>References</h2>
2510
 
2511
  <h3>Landmark LLM Scaling Papers</h3>
 
18
  "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
19
  "description": "This blog covers everything about scaling LLMs in 2025.",
20
  "published": "Feb 19, 2025",
21
+ "affiliation": {"name": "Hugging Face"},
22
  "authors": [
23
+ {
24
+ "author":"Nouamane Tazi",
25
+ "authorURL":"https://huggingface.co/nouamanetazi"
26
+ },
27
+ {
28
+ "author":"Ferdinand Mom",
29
+ "authorURL":"https://huggingface.co/3outeille"
30
+ },
31
+ {
32
+ "author":"Haojun Zhao",
33
+ "authorURL":"https://huggingface.co/zzhhjjj"
34
+ },
35
+ {
36
+ "author":"Phuc Nguyen",
37
+ "authorURL":"https://huggingface.co/neuralink"
38
+ },
39
+ {
40
+ "author":"Mohamed Mekkouri",
41
+ "authorURL":"https://huggingface.co/medmekk"
42
+ },
43
  {
44
  "author":"Leandro Werra",
45
  "authorURL":"https://huggingface.co/lvwerra"
 
222
  </li>
223
  </ul>
224
 
225
+ <aside>If you want to watch a video on distributed training rather than reading the blog or picotron code checkout <a href="https://www.youtube.com/watch?v=u2VSwDDpaBM&list=PL-_armZiJvAnhcRr6yTJ0__f3Oi-LLi9S">Ferdinand's YouTube channel</a>.</aside>
226
+
227
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
228
 
229
  <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
 
602
  </ul>
603
 
604
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
605
+ <div class="figure-legend"><p>Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
606
 
607
  <p>The trace helps identify bottlenecks like:</p>
608
  <ul>
 
1102
 
1103
  <p>In practice we’ll go from the left diagram to the right:</p>
1104
 
1105
+ <p style="text-align: center"><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
1106
  in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
1107
+ SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" style="width: 500px" /></p>
 
 
1108
 
1109
  <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
1110
 
 
1119
  <li>"f" is an all-reduce to synchronize gradients</li>
1120
  </ul>
1121
 
1122
+ <p>These operations "f" and "f*" are called <strong>conjugate</strong> pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
1123
 
1124
  <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
1125
 
 
1920
  <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
1921
 
1922
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
1923
+ <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
1924
 
1925
  <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
1926
 
1927
  <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
1928
+ <div class="figure-legend"><p>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</p></div>
1929
+
1930
  <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
1931
 
1932
  <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
 
1934
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1935
 
1936
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1937
+ <div class="figure-legend"><p>Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p></div>
1938
+
1939
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1940
+ <div class="figure-legend"><p>Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p></div>
1941
 
1942
  <p>Kernels are generally scheduled as follow:</p>
1943
 
 
2112
  <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2113
 
2114
 
2115
+ <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>!</p>
2116
  <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2117
 
2118
 
 
2218
 
2219
  <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
2220
 
2221
+ <p style="text-align: center"><img alt="image.png" src="/assets/images/flashattn.png" style="width: 500px" /></p>
2222
+
2223
  <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
2224
 
2225
  <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
2226
 
2227
  <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
2228
+ <div class="figure-legend"><p>Source: FlashAttention paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p></div>
2229
 
2230
  <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
2231
  <ul>
 
2524
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2525
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2526
  </ul>
2527
+
2528
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2529
 
2530
+ <h3>Acknowledgements</h3>
2531
+
2532
+ <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
2533
+
2534
+
2535
  <h2>References</h2>
2536
 
2537
  <h3>Landmark LLM Scaling Papers</h3>