Hasty Briefsbeta

Touching the Elephant – TPUs

5 days ago
  • #Google TPU
  • #Deep Learning
  • #AI Hardware
  • Google's Tensor Processing Unit (TPU) is a specialized hardware accelerator designed for deep learning, originally developed to address the growing demand for neural network computations.
  • The TPU was created to avoid the need for expanding datacenter capacity by doubling down on hardware efficiency, focusing initially on inference tasks before expanding to training.
  • TPUv1 featured a 256x256 systolic array for matrix multiplication, optimized for dense GEMM operations, and avoided general-purpose hardware overhead like caches and branch prediction.
  • TPUv2 introduced dual-core architecture, BrainFloat16 (bf16) for training, and inter-core interconnects (ICI) to enable distributed training across multiple chips.
  • TPUv4 added optical circuit switching (OCS) for flexible pod scaling, SparseCores for embedding-heavy models, and a shared CMEM cache to improve inference efficiency.
  • The TPU's design philosophy emphasizes domain-specific optimization, co-design with software (XLA compiler), and system-level scaling to maximize performance per watt.
  • Google's TPU software stack includes Borg for resource allocation, libtpunet for ICI routing, and Pathways for distributed execution across datacenter networks (DCN).
  • Subsequent TPU generations (v5e, v5p, Trillium, Ironwood) further optimized for efficiency, performance, and scale, with Ironwood supporting up to 9,216 chips and 42.5 Exaflops of FP8 compute.