Touching the Elephant – TPUs

5 days ago

Copy Link

Google's Tensor Processing Unit (TPU) is a specialized hardware accelerator designed for deep learning, originally developed to address the growing demand for neural network computations.
The TPU was created to avoid the need for expanding datacenter capacity by doubling down on hardware efficiency, focusing initially on inference tasks before expanding to training.
TPUv1 featured a 256x256 systolic array for matrix multiplication, optimized for dense GEMM operations, and avoided general-purpose hardware overhead like caches and branch prediction.
TPUv2 introduced dual-core architecture, BrainFloat16 (bf16) for training, and inter-core interconnects (ICI) to enable distributed training across multiple chips.
TPUv4 added optical circuit switching (OCS) for flexible pod scaling, SparseCores for embedding-heavy models, and a shared CMEM cache to improve inference efficiency.
The TPU's design philosophy emphasizes domain-specific optimization, co-design with software (XLA compiler), and system-level scaling to maximize performance per watt.
Google's TPU software stack includes Borg for resource allocation, libtpunet for ICI routing, and Pathways for distributed execution across datacenter networks (DCN).
Subsequent TPU generations (v5e, v5p, Trillium, Ironwood) further optimized for efficiency, performance, and scale, with Ironwood supporting up to 9,216 chips and 42.5 Exaflops of FP8 compute.

Hasty Briefsbeta