Flash-KMeans: Fast and Memory-Efficient Exact K-Means
2 days ago
- #K-Means
- #GPU Optimization
- #Machine Learning
- Flash-KMeans is introduced as a fast and memory-efficient exact K-Means algorithm for modern GPU workloads.
- Existing GPU implementations of K-Means are bottlenecked by system constraints, including IO bottlenecks in the assignment stage and hardware-level atomic write contention in the centroid update stage.
- Flash-KMeans proposes two core innovations: FlashAssign, which bypasses intermediate memory materialization, and sort-inverse update, which transforms high-contention atomic scatters into localized reductions.
- The algorithm includes system co-designs like chunked-stream overlap and cache-aware compile heuristics for practical deployability.
- Evaluations show Flash-KMeans achieves up to 17.9× speedup over baselines and outperforms industry-standard libraries like cuML and FAISS by 33× and over 200×, respectively.