Hasty Briefsbeta

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

3 days ago
  • #Zstandard
  • #genomics
  • #compression
  • Zstandard's long range mode (--long) improves compression by increasing the search window to at least 128MiB.
  • Originally released in 2017 with Zstandard 1.3.2, the feature had performance overheads but has since been optimized.
  • The 661k dataset, a microbial genomics benchmark, was used to test compression performance.
  • Specialist MiniPhy achieves a compression ratio (CR) of 91, while default Zstandard only achieves a CR of 3.
  • Removing newlines from the FASTA format tripled Zstandard's CR to 11, and increasing the window size to 2GiB further tripled CR to 31.
  • Using --long=31, Zstandard achieved a CR within an order of magnitude of slower, state-of-the-art methods.
  • The long range mode is most effective with uninterrupted single-line sequences.