Measuring FPGA vs ARM on Pynq-Z2: Tiny MLP, huge AXI/DMA Overhead

2 days ago

Copy Link

A lab notebook detailing the implementation of a tiny MLP (Multi-Layer Perceptron) into an FPGA datapath for high-frequency trading (HFT) applications.
The project aims to create a minimal, measurable HFT datapath that interacts with real hardware, moving beyond Python backtests.
Two-lane architecture: Reflex lane (CPU) for hard-coded rules and Inference lane (FPGA) for a tiny quantized MLP.
Original goal: Process packets through features → MLP → decision → packet out with timestamps at each stage.
Current focus: Benchmarking System-on-Chip (SoC) performance on Pynq-Z2, comparing ARM reflex vs FPGA MLP lane.
Key findings: The MLP math is tiny (64 cycles ≈ 0.5 µs), but the fabric shell is large (≈140k cycles ≈ 1.0–1.3 ms).
Performance comparison: ARM reflex lane is ~100× faster than the FPGA lane (ARM: ~16–20 µs, FPGA: ~3.4–3.6 ms with DMA).
Four overlays used to measure latency: Full, MLP-only, No-DMA, and Core probe, revealing the shell's overhead dominates.
Challenges identified: AXI interconnect, width converters, PL/PS boundary, and software control path add significant latency.
Contrast with modern HFT setups: Real designs use streamlined pipelines, minimal FIFOs, and sideband controls for faster performance.
Key takeaway: The project highlights the importance of disciplined datapath design to avoid MLP latency being overshadowed by shell overhead.

Hasty Briefsbeta