Generalized K-Means Clustering
6 months ago
- #spark
- #clustering
- #machine-learning
- The project follows security best practices with automated dependency updates via dependabot.yml.
- Version 0.6.0 introduces a modern, RDD-free DataFrame-native API with Spark ML integration.
- Generalizes K-Means to multiple Bregman divergences and advanced variants like Bisecting, X-Means, Soft/Fuzzy, Streaming, K-Medians, and K-Medoids.
- Supports multiple divergences: Squared Euclidean, KL, Itakura–Saito, L1/Manhattan, Generalized-I, and Logistic-loss.
- Tested on tens of millions of points in 700+ dimensions.
- Comprehensive CI pipeline ensures quality across multiple dimensions including lint, build matrix, test matrix, and security scanning.
- Provides detailed diagnostics for tuning performance and avoiding OOM errors.
- Automatic validation at fit time for different divergences with actionable error messages.
- Models implement DefaultParamsWritable/Readable for persistence across Spark versions.
- Legacy RDD API is kept for backward compatibility but new development should use the DataFrame API.