Hasty Briefsbeta

Measuring FPGA vs ARM on Pynq-Z2: Tiny MLP, huge AXI/DMA Overhead

2 days ago
  • #Machine Learning
  • #High-Frequency Trading
  • #FPGA
  • A lab notebook detailing the implementation of a tiny MLP (Multi-Layer Perceptron) into an FPGA datapath for high-frequency trading (HFT) applications.
  • The project aims to create a minimal, measurable HFT datapath that interacts with real hardware, moving beyond Python backtests.
  • Two-lane architecture: Reflex lane (CPU) for hard-coded rules and Inference lane (FPGA) for a tiny quantized MLP.
  • Original goal: Process packets through features → MLP → decision → packet out with timestamps at each stage.
  • Current focus: Benchmarking System-on-Chip (SoC) performance on Pynq-Z2, comparing ARM reflex vs FPGA MLP lane.
  • Key findings: The MLP math is tiny (64 cycles ≈ 0.5 µs), but the fabric shell is large (≈140k cycles ≈ 1.0–1.3 ms).
  • Performance comparison: ARM reflex lane is ~100× faster than the FPGA lane (ARM: ~16–20 µs, FPGA: ~3.4–3.6 ms with DMA).
  • Four overlays used to measure latency: Full, MLP-only, No-DMA, and Core probe, revealing the shell's overhead dominates.
  • Challenges identified: AXI interconnect, width converters, PL/PS boundary, and software control path add significant latency.
  • Contrast with modern HFT setups: Real designs use streamlined pipelines, minimal FIFOs, and sideband controls for faster performance.
  • Key takeaway: The project highlights the importance of disciplined datapath design to avoid MLP latency being overshadowed by shell overhead.