Touching the Elephant – TPUs
5 days ago
- #Google TPU
- #Deep Learning
- #AI Hardware
- Google's Tensor Processing Unit (TPU) is a specialized hardware accelerator designed for deep learning, originally developed to address the growing demand for neural network computations.
- The TPU was created to avoid the need for expanding datacenter capacity by doubling down on hardware efficiency, focusing initially on inference tasks before expanding to training.
- TPUv1 featured a 256x256 systolic array for matrix multiplication, optimized for dense GEMM operations, and avoided general-purpose hardware overhead like caches and branch prediction.
- TPUv2 introduced dual-core architecture, BrainFloat16 (bf16) for training, and inter-core interconnects (ICI) to enable distributed training across multiple chips.
- TPUv4 added optical circuit switching (OCS) for flexible pod scaling, SparseCores for embedding-heavy models, and a shared CMEM cache to improve inference efficiency.
- The TPU's design philosophy emphasizes domain-specific optimization, co-design with software (XLA compiler), and system-level scaling to maximize performance per watt.
- Google's TPU software stack includes Borg for resource allocation, libtpunet for ICI routing, and Pathways for distributed execution across datacenter networks (DCN).
- Subsequent TPU generations (v5e, v5p, Trillium, Ironwood) further optimized for efficiency, performance, and scale, with Ironwood supporting up to 9,216 chips and 42.5 Exaflops of FP8 compute.