Generalized K-Means Clustering

6 months ago

The project follows security best practices with automated dependency updates via dependabot.yml.
Version 0.6.0 introduces a modern, RDD-free DataFrame-native API with Spark ML integration.
Generalizes K-Means to multiple Bregman divergences and advanced variants like Bisecting, X-Means, Soft/Fuzzy, Streaming, K-Medians, and K-Medoids.
Supports multiple divergences: Squared Euclidean, KL, Itakura–Saito, L1/Manhattan, Generalized-I, and Logistic-loss.
Tested on tens of millions of points in 700+ dimensions.
Comprehensive CI pipeline ensures quality across multiple dimensions including lint, build matrix, test matrix, and security scanning.
Provides detailed diagnostics for tuning performance and avoiding OOM errors.
Automatic validation at fit time for different divergences with actionable error messages.
Models implement DefaultParamsWritable/Readable for persistence across Spark versions.
Legacy RDD API is kept for backward compatibility but new development should use the DataFrame API.

Hasty Briefsbeta