Flash-KMeans: Fast and Memory-Efficient Exact K-Means

2 months ago

Flash-KMeans is introduced as a fast and memory-efficient exact K-Means algorithm for modern GPU workloads.
Existing GPU implementations of K-Means are bottlenecked by system constraints, including IO bottlenecks in the assignment stage and hardware-level atomic write contention in the centroid update stage.
Flash-KMeans proposes two core innovations: FlashAssign, which bypasses intermediate memory materialization, and sort-inverse update, which transforms high-contention atomic scatters into localized reductions.
The algorithm includes system co-designs like chunked-stream overlap and cache-aware compile heuristics for practical deployability.
Evaluations show Flash-KMeans achieves up to 17.9× speedup over baselines and outperforms industry-standard libraries like cuML and FAISS by 33× and over 200×, respectively.

Hasty Briefsbeta