Measuring FPGA vs ARM on Pynq-Z2: Tiny MLP, huge AXI/DMA Overhead
2 days ago
- #Machine Learning
- #High-Frequency Trading
- #FPGA
- A lab notebook detailing the implementation of a tiny MLP (Multi-Layer Perceptron) into an FPGA datapath for high-frequency trading (HFT) applications.
- The project aims to create a minimal, measurable HFT datapath that interacts with real hardware, moving beyond Python backtests.
- Two-lane architecture: Reflex lane (CPU) for hard-coded rules and Inference lane (FPGA) for a tiny quantized MLP.
- Original goal: Process packets through features → MLP → decision → packet out with timestamps at each stage.
- Current focus: Benchmarking System-on-Chip (SoC) performance on Pynq-Z2, comparing ARM reflex vs FPGA MLP lane.
- Key findings: The MLP math is tiny (64 cycles ≈ 0.5 µs), but the fabric shell is large (≈140k cycles ≈ 1.0–1.3 ms).
- Performance comparison: ARM reflex lane is ~100× faster than the FPGA lane (ARM: ~16–20 µs, FPGA: ~3.4–3.6 ms with DMA).
- Four overlays used to measure latency: Full, MLP-only, No-DMA, and Core probe, revealing the shell's overhead dominates.
- Challenges identified: AXI interconnect, width converters, PL/PS boundary, and software control path add significant latency.
- Contrast with modern HFT setups: Real designs use streamlined pipelines, minimal FIFOs, and sideband controls for faster performance.
- Key takeaway: The project highlights the importance of disciplined datapath design to avoid MLP latency being overshadowed by shell overhead.