A Fake Shell for Pangenomics
20 hours ago
- #high-performance
- #vectorized-interpreter
- #pangenomics
- FlatGFA is a zero-copy pangenomics toolkit with identical in-memory and on-disk formats, enabling mmap-based loading and up to thousands of times speedup over odgi in some cases.
- Two options for composing workflows—CLI shell scripts or Rust API—are limited: CLI restricts composition to files/pipes with overhead, while Rust API is idiosyncratic and complex for biologists.
- A 'fake shell' named Flash is built as a vectorized interpreter that parses shell syntax, translates to an IR, and opportunistically replaces CLI commands with internal Rust functions to avoid I/O and enable optimizations.
- Flash supports mixing external resources (files/pipes) with internal data structures (e.g., GFA stores), allowing unmodified shell scripts to run faster by deduplicating loads, avoiding intermediate files, and using efficient binary formats.
- Optimizations in Flash include avoiding round trips through files, replacing instructions with cheaper alternatives, deduplicating loads, and using pre-converted binary formats, yielding a 28× speedup over odgi in an example workflow.