Hasty Briefsbeta

A Conceptual Model for Storage Unification

3 days ago
  • #data-storage
  • #virtualization
  • #unification
  • Object storage is increasingly used in the data stack, but low-latency systems still require separate hot-data storage.
  • Storage unification involves presenting heterogeneous storage systems and formats as a single coherent resource, not a single system or format.
  • Primary use case for unification is combining real-time and historical data under one abstraction, seen in systems like Kafka, Pulsar, SingleStore, TiDB, Apache Pinot, Druid, and ClickHouse.
  • Lakehouses represent the next frontier in unification, integrating real-time data with historical lakehouse data.
  • Data virtualization is key to storage unification, abstracting logical resources from physical implementations.
  • Virtualization combines frontend abstraction (unified logical model) and backend work (physical storage management).
  • Physical storage management includes data organization, tiering, materialization, and lifecycle management.
  • Tiering can be internal (only primary system accesses tiers) or shared (multiple systems access tiers).
  • Shared tiering combines tiering and materialization, requiring bidirectional lossless conversion between formats.
  • Challenges of shared tiering include lifecycle management, schema evolution, data exposure, fidelity, security, performance overhead, and risk.
  • Storage unification can be implemented client-side or server-side, each with trade-offs.
  • Materialization and tiering processes can be integrated or external, with direct or API access to storage.
  • Lifecycle management should be centralized in a metadata service to coordinate tiering and materialization.
  • Schema management and evolution must ensure compatibility across different storage services and formats.
  • Choosing between shared tiering and materialization depends on factors like stitching logic location and pros/cons of each approach.
  • Shared tiering reduces storage costs but adds complexity, while materialization offers flexibility and reliability at the cost of duplication.
  • Clear ownership and disciplined management are essential for successful shared tiering.