Hasty Briefsbeta

Going Faster Than Memcpy

13 days ago
  • #multithreading
  • #memory-optimization
  • #AVX-instructions
  • Profiling revealed that large binary unserialized messages (>512kB) spend most execution time in memory copying (memcpy) between process and shared memory.
  • Implemented faster memory copy methods including REP MOVSB (Enhanced Rep Movs) for hardware optimization and AVX instructions for vectorized copying (32 bytes at a time).
  • Explored non-temporal moves (_mm256_stream_load_si256, _mm256_stream_si256) to skip cache, improving performance for large data sizes.
  • Introduced prefetching (_mm_prefetch) to optimize cache usage, enhancing performance by prefetching data for the next iteration during the current one.
  • Implemented loop unrolling (4x) to reduce branch statements, improving copy speed, especially for aligned data.
  • Developed a multithreaded copier (MTCopier) to parallelize memory copying across multiple threads, leveraging CPU core count for faster operations.
  • Created a Copier API to integrate custom memory copying logic, supporting both single-device and cross-device (e.g., CPU-GPU) scenarios.
  • Benchmarked performance using Google’s Benchmark, comparing methods (std::memcpy, REP MOVSB, AVX, prefetching) across data sizes (32kB to 64MB).
  • Found std::memcpy to be the best general-purpose solution, while custom methods (prefetching, AVX) excel in specific scenarios (large data, aligned memory).
  • Highlighted the risks of custom copiers with a warning ('Here be dragons') due to their complexity and alignment requirements.