Going Faster Than Memcpy

13 days ago

Copy Link

Profiling revealed that large binary unserialized messages (>512kB) spend most execution time in memory copying (memcpy) between process and shared memory.
Implemented faster memory copy methods including REP MOVSB (Enhanced Rep Movs) for hardware optimization and AVX instructions for vectorized copying (32 bytes at a time).
Explored non-temporal moves (_mm256_stream_load_si256, _mm256_stream_si256) to skip cache, improving performance for large data sizes.
Introduced prefetching (_mm_prefetch) to optimize cache usage, enhancing performance by prefetching data for the next iteration during the current one.
Implemented loop unrolling (4x) to reduce branch statements, improving copy speed, especially for aligned data.
Developed a multithreaded copier (MTCopier) to parallelize memory copying across multiple threads, leveraging CPU core count for faster operations.
Created a Copier API to integrate custom memory copying logic, supporting both single-device and cross-device (e.g., CPU-GPU) scenarios.
Benchmarked performance using Google’s Benchmark, comparing methods (std::memcpy, REP MOVSB, AVX, prefetching) across data sizes (32kB to 64MB).
Found std::memcpy to be the best general-purpose solution, while custom methods (prefetching, AVX) excel in specific scenarios (large data, aligned memory).
Highlighted the risks of custom copiers with a warning ('Here be dragons') due to their complexity and alignment requirements.

Hasty Briefsbeta