GPU Programming
I’m documenting my GPU programming journey through this blog post. I do not have a nvidia gpu, but I do however have apple m2. The concepts are similar irresepective of the manufacturer.
Grid, Threadblocks, Warps
- Given a grid, which can be 1D, 2D or 3D, for example, a matrix or an image, it is subdivided into (thread)blocks of size n, which is further subdivided into groups of 32.
- We can think of a 2D matrix, lets say 4096x4096, as a grid of 4096x4096 threads. This grid is divided into blocks of size n, lets say 1024 (32 width-wise, 32-height wise, 1 depth-wise). The 32x32x1 blocks are further divided into 1D group of size 32 called warps. So in this case, there are 32 warps for this block.
- The number 1024 and 32 are not choosen at random, altough it varies from architecture to architure, it is genereally 1024 and 32.
- All threads in a warp have consecutive thread ID value.
Streaming Multiprocessor (SM)
- These links should give sufficient information to start with
- https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor
- https://stevengong.co/notes/Streaming-Multiprocessor
- https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
- Programming Massively Parallel Processors: A Hands-on Approach book
- Basically, it is the heart of the GPU that consists of cores (CUDA Cores / Streaming Processor (SP)) which execute the instructions in parallel.
- Every thread in a block is guarenteed to run on the same SM.
- For example, on the A100 GPU, there are 108 SM’s with 64 cores each, totalling to 6912 cores on the entire GPU. The SM is organized into 4 processing blocks. So, at any given time 4 warps are running simultaneously in 1 SM.
Single Instruction Multiple Data (SIMT) Architecture
- A SM follows SIMT model, which means at any instant in time, one instruction is fetched and executed by all threads in the warp.
- These threads apply the same instruction to different portions of the data.