Show HN: I built a toy TPU that can do inference and training on the XOR problem
5 days ago
- #hardware-design
- #machine-learning
- #systolic-array
- The project aimed to build a 'toy' TPU (Tensor Processing Unit) from scratch, focusing on understanding and replicating its functionality for both inference and training.
- The team chose to build a TPU because it was a challenging project with no well-documented open-source ML accelerators that handled both inference and training.
- Design philosophy: 'ALWAYS TRY THE HACKY WAY'—prioritizing original ideas before consulting external sources to reinvent rather than reverse-engineer the TPU.
- The TPU is an ASIC designed by Google for efficient ML workloads, excelling at matrix multiplications, which dominate deep learning computations.
- Key hardware concepts: clock cycles, Verilog for hardware description, and systolic arrays for efficient matrix multiplication.
- The systolic array consists of Processing Elements (PEs) that perform multiply-accumulate operations, enabling parallel computation.
- The XOR problem was used as a simple case study to demonstrate inference and training, requiring an MLP due to its non-linear decision boundaries.
- Pipelining and double buffering were implemented to optimize performance, ensuring continuous inference and efficient weight updates.
- Backpropagation was integrated into the design, leveraging the systolic array for gradient calculations and weight updates.
- A unified buffer (UB) was introduced to store intermediate values during training, improving efficiency and scalability.
- The final design included a custom instruction set (ISA) and control unit to manage operations, achieving a fully functional 'toy' TPU.