How to Think About GPUs
6 days ago
- #GPU
- #LLM
- #TPU
- GPUs and TPUs are compared in terms of architecture and performance for LLMs.
- Modern ML GPUs like H100 and B200 consist of compute cores (SMs) and fast memory (HBM).
- Each SM in an H100 is divided into 4 quadrants with Tensor Cores, CUDA Cores, and Warp Schedulers.
- CUDA Cores perform SIMD/SIMT vector arithmetic, while Tensor Cores handle matrix multiplications.
- GPUs have a hierarchy of memories: HBM, L2, L1/SMEM, TMEM, and register memory.
- Recent GPU models (V100, A100, H100, H200, B200) are compared in terms of specs like clock speed, SMs, and memory capacity.
- GPU and TPU components are mapped for comparison (e.g., SM to Tensor Core, Warp Scheduler to VPU).
- GPUs are more modular with many small SMs, while TPUs have fewer, larger Tensor Cores.
- TPUs have more fast cache memory (VMEM) compared to GPUs, beneficial for LLM inference.
- Networking differences: GPUs use hierarchical tree-based switching, TPUs use 2D/3D tori.
- GPU nodes (e.g., 8 GPUs) use NVLink for high-bandwidth, low-latency interconnects.
- Collective operations (AllGather, ReduceScatter, AllReduce, AllToAll) are analyzed for GPUs.
- Rooflines for LLM scaling on GPUs are discussed, covering data, tensor, pipeline, and expert parallelism.
- Practical considerations for sharding large models on GPUs are summarized.
- Blackwell GPUs introduce NVLink 5 and larger NVLink domains (e.g., 72 GPUs in NVL72).
- Grace Hopper systems pair GPUs with Grace CPUs for high CPU-GPU bandwidth.