Hasty Briefsbeta

Bilingual

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

2 days ago
  • #K-Means
  • #GPU Optimization
  • #Machine Learning
  • Flash-KMeans is introduced as a fast and memory-efficient exact K-Means algorithm for modern GPU workloads.
  • Existing GPU implementations of K-Means are bottlenecked by system constraints, including IO bottlenecks in the assignment stage and hardware-level atomic write contention in the centroid update stage.
  • Flash-KMeans proposes two core innovations: FlashAssign, which bypasses intermediate memory materialization, and sort-inverse update, which transforms high-contention atomic scatters into localized reductions.
  • The algorithm includes system co-designs like chunked-stream overlap and cache-aware compile heuristics for practical deployability.
  • Evaluations show Flash-KMeans achieves up to 17.9× speedup over baselines and outperforms industry-standard libraries like cuML and FAISS by 33× and over 200×, respectively.