Text classification with Python 3.14's ZSTD module
3 months ago
- #Python
- #Zstandard
- #Text Classification
- Python 3.14 introduced the `compression.zstd` module, implementing Facebook's Zstandard (Zstd) compression algorithm.
- Zstd supports incremental compression, making it ideal for text classification via compression.
- The method leverages compression length to approximate Kolmogorov complexity, a concept revisited in a 2023 paper.
- Zstd's incremental API allows efficient text classification by rebuilding compressors for each class.
- A `ZstdClassifier` class is implemented, which classifies text based on the smallest compressed output from class-specific compressors.
- Parameters like window size, compression level, and rebuild frequency can be tuned for performance and accuracy.
- Benchmarking on the 20 newsgroups dataset showed 91% accuracy in under 2 seconds, outperforming previous LZW-based methods.
- Comparison with a TF-IDF + logistic regression baseline showed competitive accuracy, though slightly lower, with faster execution.
- The simplicity and maintainability of the Zstd-based classifier make it an attractive option for certain applications.