Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course
Sep 26, 2025
The Missing Readme - consolidated by new grad
Jun 16, 2025
Caching Crash Course
Jun 08, 2025
OS Crash Course
Jun 08, 2025

Recent Notes

C2 - Data Models and Query Languages
Dec 17, 2025
C1-Reliable, Scalable, and Maintainable Applications
Dec 16, 2025

Outline
GPU Heterogeneous Computing
Simple Processing Flow
Moving to parallel
Vector Addition on Device
Concurrent thread Group (CTA) (todo)
Summary of GPU programming Model
GPU Execution Unit (SM)
warp as a parallel execution unit
Key Motivation
MegaKernels
Building a Megakernel system

❯

Infra LLMs + AI Agents

❯

Lec 8 LLM GPUs MegaKernel

Lec 8 LLM GPUs MegaKernel

Oct 15, 20252 min read

Outline

GPU Revisited – Why do we need megakernels?
- GPU Programming Model and Execution Model Revisited
- Motivations and Enablers of Megakernel
How to design mega kernels for modern GPUs?
- Megakernel System Framework
- Megakernel Components
Understanding the Performance Benefits of Megakernels

GPU Heterogeneous Computing

Terms:
- Host - The CPU and its memory (host memory)
- Device - The GPU and its memory (device memory)
GPUs need some special programs to utilize code on the parallel compute of GPU

Simple Processing Flow

Copy input data from CPU memory to GPU memory
Load GPU program and execute caching data on chip for performance
Copy results from GPU memory to CPU memory

Special code to start the kernel to load code onto GPU

Moving to parallel

GPU computing is about massive parallelism

so how do we program to make our code run in parallel on the device
CPU-like code only compute one element at a time

Vector Addition on Device

With add() running in parallel we can do vector addition
Terminology: the code of of add() is executed on massive parallel cores, each core we call it a thread
The programming model is called Single Instruction Multiple Threads (SIMT)
- Each invocation can refer to its block index using ThreadIdx.x
By using ThreadIdx.x to index into the array, each block handles a different index

Concurrent thread Group (CTA) (todo)

contains a set of threads that run concurrently

Summary of GPU programming Model

Programming model
- SIMT - Single Instruction Multiple (Concurrent) Threads
  - concurrent - running at the same time, not necessarily parallel
  - Single function to instruct

GPU Execution Unit (SM)

A group of 32 threads in a thread block is called a warp
- In a CTA, threads 0-31 fall into same warp
- Each sub-core in the V100 is capable of scheduling and interleaving execution of up to 16 warps
Warp is not part of CUDA programming interface but is an important CUDA implementation detail on modern NVIDIA GPUs

warp as a parallel execution unit

Entire warp of CUDA threads is running this instruction stream

Key Motivation

CPU-based control overhead and opaque hardware scheduling from the kernel overhead.

CPU will launch kernel and wait for results
- causing GPU idle time

MegaKernels

Combining multiple GPU kernel launches and interkernel communication.

Building a Megakernel system

Recent Writing

Computer Arch Crash Course
Sep 26, 2025
The Missing Readme - consolidated by new grad
Jun 16, 2025
Caching Crash Course
Jun 08, 2025
OS Crash Course
Jun 08, 2025

Recent Notes

C2 - Data Models and Query Languages
Dec 17, 2025
C1-Reliable, Scalable, and Maintainable Applications
Dec 16, 2025

Graph View

Outline
GPU Heterogeneous Computing
Simple Processing Flow
Moving to parallel
Vector Addition on Device
Concurrent thread Group (CTA) (todo)
Summary of GPU programming Model
GPU Execution Unit (SM)
warp as a parallel execution unit
Key Motivation
MegaKernels
Building a Megakernel system

Created with Quartz v4.5.2 © 2025

GitHub
LinkedIn