Talos: Hardware accelerator for deep convolutional neural networks

a month ago

#HardwareAcceleration
#CNN
#FPGA

Talos is a custom FPGA-based hardware accelerator designed for efficient execution of Convolutional Neural Networks (CNNs).
Unlike flexible deep learning frameworks, Talos eliminates runtime, scheduler, and OS overhead by implementing the entire inference pipeline in SystemVerilog for deterministic, cycle-accurate control.
Hardware debugging is more challenging than software, requiring precise timing and adherence to physical constraints like logic elements, on-chip memory, and clock budgets.
Talos optimizes for inference by stripping away unnecessary features, using fixed-point arithmetic, and ensuring deterministic behavior with known cycle costs for operations.
The architecture includes a single convolutional layer, ReLU activation, MaxPool layer, and a fully connected layer, all optimized for hardware efficiency.
Fixed-point arithmetic (Q16.16) is used to handle floating-point weights, ensuring deterministic and efficient hardware execution.
Convolution is implemented as a multiply-accumulate (MAC) loop, with weights and inputs in Q16.16 format.
MaxPool and ReLU operations are fused for efficiency, avoiding extra cycles by initializing comparisons at zero.
Talos uses a time-multiplexing architecture to fit within FPGA constraints, running CNN and MaxPool modules consecutively for each kernel.
Weight storage was optimized using M10K ROM blocks, reducing resource utilization and enabling clean routing.
Latency management includes a priming mechanism to handle ROM read delays, ensuring valid data for arithmetic operations.
The project highlights the challenges and rewards of hardware design, emphasizing simplicity, explicit control, and deterministic timing.

Hasty Briefsbeta

Talos: Hardware accelerator for deep convolutional neural networks