thomwolf HF staff commited on
Commit
fff70fc
·
verified ·
1 Parent(s): 31ab930
assets/images/torch-compile-triton-kernel.png ADDED

Git LFS Details

  • SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
  • Pointer size: 131 Bytes
  • Size of remote file: 113 kB
assets/images/torch-compile-triton.png ADDED

Git LFS Details

  • SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
  • Pointer size: 131 Bytes
  • Size of remote file: 102 kB
dist/assets/images/torch-compile-triton-kernel.png ADDED

Git LFS Details

  • SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
  • Pointer size: 131 Bytes
  • Size of remote file: 113 kB
dist/assets/images/torch-compile-triton.png ADDED

Git LFS Details

  • SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
  • Pointer size: 131 Bytes
  • Size of remote file: 102 kB
dist/index.html CHANGED
@@ -1913,11 +1913,62 @@
1913
 
1914
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1915
 
1916
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1917
  <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1918
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1919
- <p>Figure 6: Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p>
1920
-
1921
  <p>Kernels are generally scheduled as follow:</p>
1922
 
1923
  <ul>
@@ -1953,8 +2004,9 @@
1953
 
1954
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
1955
 
1956
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1957
 
 
1958
 
1959
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
1960
 
@@ -1982,7 +2034,7 @@
1982
  tl.store(out_ptr0 + (x0), tmp6, xmask)
1983
  </d-code>
1984
 
1985
- <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
1986
 
1987
  <d-code block language="python">
1988
  @triton.jit
@@ -2013,23 +2065,25 @@
2013
 
2014
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2015
 
2016
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
2017
 
2018
- <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
2019
 
2020
- <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
2021
 
2022
- <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
 
 
2023
 
2024
- <ul>
2025
  <li>Pytorch: easy but slow</li>
2026
  <li>torch.compile: easy, fast, but not flexible</li>
2027
  <li>triton: harder, faster, and more flexible</li>
2028
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2029
 
2030
- </ul>
2031
 
2032
- <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2033
 
2034
  <h4>Memory Coalescing</h4>
2035
 
@@ -2060,8 +2114,12 @@
2060
 
2061
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2062
 
2063
- <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
2064
- <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
 
 
 
 
2065
 
2066
 
2067
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
 
1913
 
1914
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1915
 
1916
+ <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
1917
+ <div>
1918
+ <d-code block language="python">
1919
+ // Host code
1920
+ void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
1921
+ // Allocate vectors in device memory
1922
+ int size = n * sizeof(float);
1923
+ float *d_A, *d_B, *d_C;
1924
+ cudaMalloc(&d_A, size);
1925
+ cudaMalloc(&d_B, size);
1926
+ cudaMalloc(&d_C, size);
1927
+
1928
+ // Copy vectors from host memory to device memory
1929
+ cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
1930
+ cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
1931
+
1932
+ // Invoke kernel
1933
+ int threadsPerBlock = 256;
1934
+ int blocksPerGrid =
1935
+ (N + threadsPerBlock - 1) / threadsPerBlock;
1936
+ VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
1937
+
1938
+ // Copy result from device memory to host memory
1939
+ // h_C contains the result in host memory
1940
+ cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
1941
+
1942
+ // Free device memory
1943
+ cudaFree(d_A);
1944
+ cudaFree(d_B);
1945
+ cudaFree(d_C);
1946
+ }</d-code>
1947
+ <div class="figure-legend">
1948
+ <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1949
+ </div>
1950
+ </div>
1951
+ <div>
1952
+ <d-code block language="python">
1953
+ // Device code
1954
+ __global__ void VecAdd(float* A, float* B, float* C, int N)
1955
+ {
1956
+ int i = blockDim.x * blockIdx.x + threadIdx.x;
1957
+ if (i < N)
1958
+ C[i] = A[i] + B[i];
1959
+ }
1960
+ </d-code>
1961
+ <div class="figure-legend">
1962
+ <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1963
+
1964
+ </div>
1965
+ </div>
1966
+ </div>
1967
+
1968
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1969
  <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1970
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1971
+ -->
 
1972
  <p>Kernels are generally scheduled as follow:</p>
1973
 
1974
  <ul>
 
2004
 
2005
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
2006
 
2007
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
2008
 
2009
+ <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
2010
 
2011
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
2012
 
 
2034
  tl.store(out_ptr0 + (x0), tmp6, xmask)
2035
  </d-code>
2036
 
2037
+ <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
2038
 
2039
  <d-code block language="python">
2040
  @triton.jit
 
2065
 
2066
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2067
 
2068
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
2069
 
2070
+ <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
2071
 
2072
+ <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
2073
 
2074
+ <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
2075
+
2076
+ <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
2077
 
2078
+ <ol>
2079
  <li>Pytorch: easy but slow</li>
2080
  <li>torch.compile: easy, fast, but not flexible</li>
2081
  <li>triton: harder, faster, and more flexible</li>
2082
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2083
 
2084
+ </ol>
2085
 
2086
+ <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2087
 
2088
  <h4>Memory Coalescing</h4>
2089
 
 
2114
 
2115
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2116
 
2117
+ <div class="large-image-background">
2118
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
2119
+ </div>
2120
+ <div class="large-image-background">
2121
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2122
+ </div>
2123
 
2124
 
2125
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
dist/style.css CHANGED
@@ -424,3 +424,15 @@ d-article {
424
  d-code {
425
  font-size: 12px;
426
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
424
  d-code {
425
  font-size: 12px;
426
  }
427
+
428
+ .large-image-background {
429
+ width: 100vw;
430
+ padding-top: 10px;
431
+ padding-bottom: 10px;
432
+ margin-left: calc(-50vw + 50%);
433
+ margin-right: calc(-50vw + 50%);
434
+ background: white;
435
+ height: fit-content; /* This will make it match the image height */
436
+ display: flex;
437
+ justify-content: center; /* This will center your image */
438
+ }
src/index.html CHANGED
@@ -1913,11 +1913,62 @@
1913
 
1914
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1915
 
1916
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1917
  <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1918
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1919
- <p>Figure 6: Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p>
1920
-
1921
  <p>Kernels are generally scheduled as follow:</p>
1922
 
1923
  <ul>
@@ -1953,8 +2004,9 @@
1953
 
1954
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
1955
 
1956
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1957
 
 
1958
 
1959
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
1960
 
@@ -1982,7 +2034,7 @@
1982
  tl.store(out_ptr0 + (x0), tmp6, xmask)
1983
  </d-code>
1984
 
1985
- <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
1986
 
1987
  <d-code block language="python">
1988
  @triton.jit
@@ -2013,23 +2065,25 @@
2013
 
2014
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2015
 
2016
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
2017
 
2018
- <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
2019
 
2020
- <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
2021
 
2022
- <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
 
 
2023
 
2024
- <ul>
2025
  <li>Pytorch: easy but slow</li>
2026
  <li>torch.compile: easy, fast, but not flexible</li>
2027
  <li>triton: harder, faster, and more flexible</li>
2028
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2029
 
2030
- </ul>
2031
 
2032
- <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2033
 
2034
  <h4>Memory Coalescing</h4>
2035
 
@@ -2060,8 +2114,12 @@
2060
 
2061
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2062
 
2063
- <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
2064
- <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
 
 
 
 
2065
 
2066
 
2067
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
 
1913
 
1914
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1915
 
1916
+ <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
1917
+ <div>
1918
+ <d-code block language="python">
1919
+ // Host code
1920
+ void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
1921
+ // Allocate vectors in device memory
1922
+ int size = n * sizeof(float);
1923
+ float *d_A, *d_B, *d_C;
1924
+ cudaMalloc(&d_A, size);
1925
+ cudaMalloc(&d_B, size);
1926
+ cudaMalloc(&d_C, size);
1927
+
1928
+ // Copy vectors from host memory to device memory
1929
+ cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
1930
+ cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
1931
+
1932
+ // Invoke kernel
1933
+ int threadsPerBlock = 256;
1934
+ int blocksPerGrid =
1935
+ (N + threadsPerBlock - 1) / threadsPerBlock;
1936
+ VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
1937
+
1938
+ // Copy result from device memory to host memory
1939
+ // h_C contains the result in host memory
1940
+ cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
1941
+
1942
+ // Free device memory
1943
+ cudaFree(d_A);
1944
+ cudaFree(d_B);
1945
+ cudaFree(d_C);
1946
+ }</d-code>
1947
+ <div class="figure-legend">
1948
+ <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1949
+ </div>
1950
+ </div>
1951
+ <div>
1952
+ <d-code block language="python">
1953
+ // Device code
1954
+ __global__ void VecAdd(float* A, float* B, float* C, int N)
1955
+ {
1956
+ int i = blockDim.x * blockIdx.x + threadIdx.x;
1957
+ if (i < N)
1958
+ C[i] = A[i] + B[i];
1959
+ }
1960
+ </d-code>
1961
+ <div class="figure-legend">
1962
+ <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1963
+
1964
+ </div>
1965
+ </div>
1966
+ </div>
1967
+
1968
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1969
  <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1970
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1971
+ -->
 
1972
  <p>Kernels are generally scheduled as follow:</p>
1973
 
1974
  <ul>
 
2004
 
2005
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
2006
 
2007
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
2008
 
2009
+ <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
2010
 
2011
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
2012
 
 
2034
  tl.store(out_ptr0 + (x0), tmp6, xmask)
2035
  </d-code>
2036
 
2037
+ <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
2038
 
2039
  <d-code block language="python">
2040
  @triton.jit
 
2065
 
2066
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2067
 
2068
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
2069
 
2070
+ <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
2071
 
2072
+ <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
2073
 
2074
+ <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
2075
+
2076
+ <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
2077
 
2078
+ <ol>
2079
  <li>Pytorch: easy but slow</li>
2080
  <li>torch.compile: easy, fast, but not flexible</li>
2081
  <li>triton: harder, faster, and more flexible</li>
2082
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2083
 
2084
+ </ol>
2085
 
2086
+ <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2087
 
2088
  <h4>Memory Coalescing</h4>
2089
 
 
2114
 
2115
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2116
 
2117
+ <div class="large-image-background">
2118
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
2119
+ </div>
2120
+ <div class="large-image-background">
2121
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2122
+ </div>
2123
 
2124
 
2125
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
src/style.css CHANGED
@@ -424,3 +424,15 @@ d-article {
424
  d-code {
425
  font-size: 12px;
426
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
424
  d-code {
425
  font-size: 12px;
426
  }
427
+
428
+ .large-image-background {
429
+ width: 100vw;
430
+ padding-top: 10px;
431
+ padding-bottom: 10px;
432
+ margin-left: calc(-50vw + 50%);
433
+ margin-right: calc(-50vw + 50%);
434
+ background: white;
435
+ height: fit-content; /* This will make it match the image height */
436
+ display: flex;
437
+ justify-content: center; /* This will center your image */
438
+ }