Going Faster Than Memcpy
13 days ago
- #multithreading
- #memory-optimization
- #AVX-instructions
- Profiling revealed that large binary unserialized messages (>512kB) spend most execution time in memory copying (memcpy) between process and shared memory.
- Implemented faster memory copy methods including REP MOVSB (Enhanced Rep Movs) for hardware optimization and AVX instructions for vectorized copying (32 bytes at a time).
- Explored non-temporal moves (_mm256_stream_load_si256, _mm256_stream_si256) to skip cache, improving performance for large data sizes.
- Introduced prefetching (_mm_prefetch) to optimize cache usage, enhancing performance by prefetching data for the next iteration during the current one.
- Implemented loop unrolling (4x) to reduce branch statements, improving copy speed, especially for aligned data.
- Developed a multithreaded copier (MTCopier) to parallelize memory copying across multiple threads, leveraging CPU core count for faster operations.
- Created a Copier API to integrate custom memory copying logic, supporting both single-device and cross-device (e.g., CPU-GPU) scenarios.
- Benchmarked performance using Google’s Benchmark, comparing methods (std::memcpy, REP MOVSB, AVX, prefetching) across data sizes (32kB to 64MB).
- Found std::memcpy to be the best general-purpose solution, while custom methods (prefetching, AVX) excel in specific scenarios (large data, aligned memory).
- Highlighted the risks of custom copiers with a warning ('Here be dragons') due to their complexity and alignment requirements.