Hasty Briefsbeta

Reverse-Engineering the RK3588 NPU: Hacking Limits to Run Vision Transformers

3 days ago
  • #NPU Optimization
  • #RK3588
  • #Edge AI
  • The Rockchip RK3588 NPU promises 6 TOPS performance but fails to run modern AI models like SmolVLM's Vision Transformer due to memory constraints.
  • Standard Computer Vision SDK (rknn-toolkit2) is optimized for older CNNs and crashes with large Attention matrices from Transformers.
  • A 'First Principles' approach was taken to reverse-engineer the NPU, avoiding black-box solutions and understanding hardware limitations.
  • Error 0xe010 was identified as a memory overflow issue due to the NPU's 32KB L1 SRAM scratchpad limitation for vector operations.
  • A 'Nano-Tiling' algorithm was developed to manually slice large matrices into 32x32 tiles fitting the 32KB scratchpad.
  • A 'Poison Pill' technique was used to prevent the compiler from fusing tiled operations back into large blocks.
  • SigLIP's accuracy collapse was addressed with a 'Sandwich' Domain Shift to handle large activation spikes and tiny signals during quantization.
  • A custom runtime scheduler was created to manage thousands of tiny operations by sharding the model into 26 binary files and orchestrating execution across 3 NPU cores.
  • Results showed a 15x speedup (from 30s to <1.8s per inference) while maintaining 0.999 accuracy compared to FP32 reference.