Show HN: I made an open-source Rust program for memory-efficient genomics
10 days ago
- #Rust
- #bioinformatics
- #genomics
- Rosalind is a Rust engine for genome alignment, streaming variant calling, and custom bioinformatics analytics.
- Designed for low-memory environments (<100 MB RAM), suitable for hospital workstations, clinic laptops, field kits, and classrooms.
- Achieves O(��t) working memory, deterministic replay, and drop-in extensibility for new pipelines.
- Core problem addressed: standard tools require >50 GB RAM, making them inaccessible in many settings.
- Solution: split workloads into ��t blocks, reuse rolling boundaries, and evaluate height-compressed trees.
- Features include O(��t) working memory, end-to-end determinism, full-history equivalence, and streaming SAM/BAM/VCF outputs.
- Use cases: clinical genomics on laptops, outbreak monitoring at the edge, population-scale research, education, and custom analytics.
- Technical details: space bound, deterministic replay, composable design, guardrails, partition invariance, and full-history equivalence.
- Performance: working set �� (α + β) · ��t + γ, with whole-genome runs around 30–80 MB.
- Comparison with typical stacks: Rosalind uses <100 MB RAM, is deterministic, partition invariant, and streaming-friendly.
- Capabilities: FM-index alignment, streaming variant calling, standards-compliant outputs, and plugin & Python ecosystem.
- Implementation: rolling boundary, block decomposition, height-compressed trees, streaming ledger, and workspace pooling.
- Execution flow: reads → block alignment → rolling boundary update → tree merge → streaming outputs.
- Directory structure: src/framework/, src/genomics/, src/plugin/, src/python_bindings/, examples/, scripts/, tests/.
- Installation: requires Rust 1.72+, Python 3.9+, and native compression headers.
- Usage: CLI, Rust APIs, Python bindings, and plugins.
- Sample data: includes small FASTA/FASTQ snippets and alignment inputs.
- Embedding Rosalind: align reads and call variants using Rust APIs.
- CLI usage: align FASTQ reads, emit SAM/BAM/VCF outputs, and call variants.
- Python bindings: install with maturin, use PyGenomicEngine for exploratory analysis.
- Testing: verify O(��t) bound, run benchmarks, and ensure determinism.
- Extending Rosalind: add Rust plugins, CLI extensions, Python orchestration, and sample datasets.
- Troubleshooting: common issues and solutions.
- Contributing: guidelines for pull requests.
- License: Apache-2.0 + MIT dual license.