Hasty Briefsbeta

Bilingual

Generalized K-Means Clustering

6 months ago
  • #spark
  • #clustering
  • #machine-learning
  • The project follows security best practices with automated dependency updates via dependabot.yml.
  • Version 0.6.0 introduces a modern, RDD-free DataFrame-native API with Spark ML integration.
  • Generalizes K-Means to multiple Bregman divergences and advanced variants like Bisecting, X-Means, Soft/Fuzzy, Streaming, K-Medians, and K-Medoids.
  • Supports multiple divergences: Squared Euclidean, KL, Itakura–Saito, L1/Manhattan, Generalized-I, and Logistic-loss.
  • Tested on tens of millions of points in 700+ dimensions.
  • Comprehensive CI pipeline ensures quality across multiple dimensions including lint, build matrix, test matrix, and security scanning.
  • Provides detailed diagnostics for tuning performance and avoiding OOM errors.
  • Automatic validation at fit time for different divergences with actionable error messages.
  • Models implement DefaultParamsWritable/Readable for persistence across Spark versions.
  • Legacy RDD API is kept for backward compatibility but new development should use the DataFrame API.