Hasty Briefsbeta

Bilingual

Talos: Hardware accelerator for deep convolutional neural networks

4 hours ago
  • #HardwareAcceleration
  • #CNN
  • #FPGA
  • Talos is a custom FPGA-based hardware accelerator designed for efficient execution of Convolutional Neural Networks (CNNs).
  • Unlike flexible deep learning frameworks, Talos eliminates runtime, scheduler, and OS overhead by implementing the entire inference pipeline in SystemVerilog for deterministic, cycle-accurate control.
  • Hardware debugging is more challenging than software, requiring precise timing and adherence to physical constraints like logic elements, on-chip memory, and clock budgets.
  • Talos optimizes for inference by stripping away unnecessary features, using fixed-point arithmetic, and ensuring deterministic behavior with known cycle costs for operations.
  • The architecture includes a single convolutional layer, ReLU activation, MaxPool layer, and a fully connected layer, all optimized for hardware efficiency.
  • Fixed-point arithmetic (Q16.16) is used to handle floating-point weights, ensuring deterministic and efficient hardware execution.
  • Convolution is implemented as a multiply-accumulate (MAC) loop, with weights and inputs in Q16.16 format.
  • MaxPool and ReLU operations are fused for efficiency, avoiding extra cycles by initializing comparisons at zero.
  • Talos uses a time-multiplexing architecture to fit within FPGA constraints, running CNN and MaxPool modules consecutively for each kernel.
  • Weight storage was optimized using M10K ROM blocks, reducing resource utilization and enabling clean routing.
  • Latency management includes a priming mechanism to handle ROM read delays, ensuring valid data for arithmetic operations.
  • The project highlights the challenges and rewards of hardware design, emphasizing simplicity, explicit control, and deterministic timing.