A Conceptual Model for Storage Unification
3 days ago
- #data-storage
- #virtualization
- #unification
- Object storage is increasingly used in the data stack, but low-latency systems still require separate hot-data storage.
- Storage unification involves presenting heterogeneous storage systems and formats as a single coherent resource, not a single system or format.
- Primary use case for unification is combining real-time and historical data under one abstraction, seen in systems like Kafka, Pulsar, SingleStore, TiDB, Apache Pinot, Druid, and ClickHouse.
- Lakehouses represent the next frontier in unification, integrating real-time data with historical lakehouse data.
- Data virtualization is key to storage unification, abstracting logical resources from physical implementations.
- Virtualization combines frontend abstraction (unified logical model) and backend work (physical storage management).
- Physical storage management includes data organization, tiering, materialization, and lifecycle management.
- Tiering can be internal (only primary system accesses tiers) or shared (multiple systems access tiers).
- Shared tiering combines tiering and materialization, requiring bidirectional lossless conversion between formats.
- Challenges of shared tiering include lifecycle management, schema evolution, data exposure, fidelity, security, performance overhead, and risk.
- Storage unification can be implemented client-side or server-side, each with trade-offs.
- Materialization and tiering processes can be integrated or external, with direct or API access to storage.
- Lifecycle management should be centralized in a metadata service to coordinate tiering and materialization.
- Schema management and evolution must ensure compatibility across different storage services and formats.
- Choosing between shared tiering and materialization depends on factors like stitching logic location and pros/cons of each approach.
- Shared tiering reduces storage costs but adds complexity, while materialization offers flexibility and reliability at the cost of duplication.
- Clear ownership and disciplined management are essential for successful shared tiering.