TPU (Tensor Processing Unit) Deep Dive

10 months ago

TPUs are Google's ASICs designed for extreme matrix multiplication throughput and energy efficiency.
TPUs originated in 2006, with significant development starting in 2013 due to the computational demands of neural networks.
TPUs power most of Google's AI services, including training and inference for models like Gemini and Veo.
A single TPUv4 chip contains two TensorCores with shared memory units (CMEM and HBM).
TPUs use systolic arrays for efficient matrix multiplication and convolutions, but struggle with sparse matrices.
TPUs rely on Ahead-of-Time (AoT) compilation via the XLA compiler to optimize memory access and energy efficiency.
TPU design emphasizes reducing memory operations to save energy and increase performance.
TPUs are scalable, with configurations ranging from single chips to multi-pod systems with thousands of chips.
TPU racks are organized in 3D torus topologies, with Optical Circuit Switching (OCS) for flexible and efficient communication.
TPU slices can be configured in various topologies (e.g., cube, cigar shape) to optimize for different parallelism methods.
Multi-pod TPU systems use Data-Center Network (DCN) for communication between pods, enabling large-scale training like PaLM.

Hasty Briefsbeta