Hasty Briefsbeta

How to Think About GPUs

6 days ago
  • #GPU
  • #LLM
  • #TPU
  • GPUs and TPUs are compared in terms of architecture and performance for LLMs.
  • Modern ML GPUs like H100 and B200 consist of compute cores (SMs) and fast memory (HBM).
  • Each SM in an H100 is divided into 4 quadrants with Tensor Cores, CUDA Cores, and Warp Schedulers.
  • CUDA Cores perform SIMD/SIMT vector arithmetic, while Tensor Cores handle matrix multiplications.
  • GPUs have a hierarchy of memories: HBM, L2, L1/SMEM, TMEM, and register memory.
  • Recent GPU models (V100, A100, H100, H200, B200) are compared in terms of specs like clock speed, SMs, and memory capacity.
  • GPU and TPU components are mapped for comparison (e.g., SM to Tensor Core, Warp Scheduler to VPU).
  • GPUs are more modular with many small SMs, while TPUs have fewer, larger Tensor Cores.
  • TPUs have more fast cache memory (VMEM) compared to GPUs, beneficial for LLM inference.
  • Networking differences: GPUs use hierarchical tree-based switching, TPUs use 2D/3D tori.
  • GPU nodes (e.g., 8 GPUs) use NVLink for high-bandwidth, low-latency interconnects.
  • Collective operations (AllGather, ReduceScatter, AllReduce, AllToAll) are analyzed for GPUs.
  • Rooflines for LLM scaling on GPUs are discussed, covering data, tensor, pipeline, and expert parallelism.
  • Practical considerations for sharding large models on GPUs are summarized.
  • Blackwell GPUs introduce NVLink 5 and larger NVLink domains (e.g., 72 GPUs in NVL72).
  • Grace Hopper systems pair GPUs with Grace CPUs for high CPU-GPU bandwidth.