TPU (Tensor Processing Unit) Deep Dive
10 months ago
- #AI Hardware
- #TPU
- TPUs are Google's ASICs designed for extreme matrix multiplication throughput and energy efficiency.
- TPUs originated in 2006, with significant development starting in 2013 due to the computational demands of neural networks.
- TPUs power most of Google's AI services, including training and inference for models like Gemini and Veo.
- A single TPUv4 chip contains two TensorCores with shared memory units (CMEM and HBM).
- TPUs use systolic arrays for efficient matrix multiplication and convolutions, but struggle with sparse matrices.
- TPUs rely on Ahead-of-Time (AoT) compilation via the XLA compiler to optimize memory access and energy efficiency.
- TPU design emphasizes reducing memory operations to save energy and increase performance.
- TPUs are scalable, with configurations ranging from single chips to multi-pod systems with thousands of chips.
- TPU racks are organized in 3D torus topologies, with Optical Circuit Switching (OCS) for flexible and efficient communication.
- TPU slices can be configured in various topologies (e.g., cube, cigar shape) to optimize for different parallelism methods.
- Multi-pod TPU systems use Data-Center Network (DCN) for communication between pods, enabling large-scale training like PaLM.