Show HN: I built a toy TPU that can do inference and training on the XOR problem

6 days ago

https://www.tinytpu.com

Copy Link

#hardware-design
#machine-learning
#systolic-array

The project aimed to build a 'toy' TPU (Tensor Processing Unit) from scratch, focusing on understanding and replicating its functionality for both inference and training.
The team chose to build a TPU because it was a challenging project with no well-documented open-source ML accelerators that handled both inference and training.
Design philosophy: 'ALWAYS TRY THE HACKY WAY'—prioritizing original ideas before consulting external sources to reinvent rather than reverse-engineer the TPU.
The TPU is an ASIC designed by Google for efficient ML workloads, excelling at matrix multiplications, which dominate deep learning computations.
Key hardware concepts: clock cycles, Verilog for hardware description, and systolic arrays for efficient matrix multiplication.
The systolic array consists of Processing Elements (PEs) that perform multiply-accumulate operations, enabling parallel computation.
The XOR problem was used as a simple case study to demonstrate inference and training, requiring an MLP due to its non-linear decision boundaries.
Pipelining and double buffering were implemented to optimize performance, ensuring continuous inference and efficient weight updates.
Backpropagation was integrated into the design, leveraging the systolic array for gradient calculations and weight updates.
A unified buffer (UB) was introduced to store intermediate values during training, improving efficiency and scalability.
The final design included a custom instruction set (ISA) and control unit to manage operations, achieving a fully functional 'toy' TPU.

Hasty Briefsbeta

Show HN: I built a toy TPU that can do inference and training on the XOR problem