Talos: Hardware accelerator for deep convolutional neural networks
2 hours ago
- #HardwareAcceleration
- #CNN
- #FPGA
- Talos is a custom FPGA-based hardware accelerator designed for efficient execution of Convolutional Neural Networks (CNNs).
- Unlike flexible deep learning frameworks, Talos eliminates runtime, scheduler, and OS overhead by implementing the entire inference pipeline in SystemVerilog for deterministic, cycle-accurate control.
- Hardware debugging is more challenging than software, requiring precise timing and adherence to physical constraints like logic elements, on-chip memory, and clock budgets.
- Talos optimizes for inference by stripping away unnecessary features, using fixed-point arithmetic, and ensuring deterministic behavior with known cycle costs for operations.
- The architecture includes a single convolutional layer, ReLU activation, MaxPool layer, and a fully connected layer, all optimized for hardware efficiency.
- Fixed-point arithmetic (Q16.16) is used to handle floating-point weights, ensuring deterministic and efficient hardware execution.
- Convolution is implemented as a multiply-accumulate (MAC) loop, with weights and inputs in Q16.16 format.
- MaxPool and ReLU operations are fused for efficiency, avoiding extra cycles by initializing comparisons at zero.
- Talos uses a time-multiplexing architecture to fit within FPGA constraints, running CNN and MaxPool modules consecutively for each kernel.
- Weight storage was optimized using M10K ROM blocks, reducing resource utilization and enabling clean routing.
- Latency management includes a priming mechanism to handle ROM read delays, ensuring valid data for arithmetic operations.
- The project highlights the challenges and rewards of hardware design, emphasizing simplicity, explicit control, and deterministic timing.