How to Think About GPUs

6 days ago

Copy Link

GPUs and TPUs are compared in terms of architecture and performance for LLMs.
Modern ML GPUs like H100 and B200 consist of compute cores (SMs) and fast memory (HBM).
Each SM in an H100 is divided into 4 quadrants with Tensor Cores, CUDA Cores, and Warp Schedulers.
CUDA Cores perform SIMD/SIMT vector arithmetic, while Tensor Cores handle matrix multiplications.
GPUs have a hierarchy of memories: HBM, L2, L1/SMEM, TMEM, and register memory.
Recent GPU models (V100, A100, H100, H200, B200) are compared in terms of specs like clock speed, SMs, and memory capacity.
GPU and TPU components are mapped for comparison (e.g., SM to Tensor Core, Warp Scheduler to VPU).
GPUs are more modular with many small SMs, while TPUs have fewer, larger Tensor Cores.
TPUs have more fast cache memory (VMEM) compared to GPUs, beneficial for LLM inference.
Networking differences: GPUs use hierarchical tree-based switching, TPUs use 2D/3D tori.
GPU nodes (e.g., 8 GPUs) use NVLink for high-bandwidth, low-latency interconnects.
Collective operations (AllGather, ReduceScatter, AllReduce, AllToAll) are analyzed for GPUs.
Rooflines for LLM scaling on GPUs are discussed, covering data, tensor, pipeline, and expert parallelism.
Practical considerations for sharding large models on GPUs are summarized.
Blackwell GPUs introduce NVLink 5 and larger NVLink domains (e.g., 72 GPUs in NVL72).
Grace Hopper systems pair GPUs with Grace CPUs for high CPU-GPU bandwidth.

Hasty Briefsbeta