Removing newlines in FASTA file increases ZSTD compression ratio by 10x
3 days ago
- #Zstandard
- #genomics
- #compression
- Zstandard's long range mode (--long) improves compression by increasing the search window to at least 128MiB.
- Originally released in 2017 with Zstandard 1.3.2, the feature had performance overheads but has since been optimized.
- The 661k dataset, a microbial genomics benchmark, was used to test compression performance.
- Specialist MiniPhy achieves a compression ratio (CR) of 91, while default Zstandard only achieves a CR of 3.
- Removing newlines from the FASTA format tripled Zstandard's CR to 11, and increasing the window size to 2GiB further tripled CR to 31.
- Using --long=31, Zstandard achieved a CR within an order of magnitude of slower, state-of-the-art methods.
- The long range mode is most effective with uninterrupted single-line sequences.