Buckets and objects are not enough
5 days ago
- #S3
- #cloud-storage
- #dataset-management
- Amazon S3 has been a popular cloud storage service for 20 years, used by many companies for diverse data types.
- S3 organizes data into buckets, but lacks a first-class way to group related objects into datasets, relying on naming conventions and prefixes instead.
- Prefixes serve as a human-readable hierarchy but are not inherently understood by S3, leading to management challenges.
- The dataset abstraction is missing in S3, making it difficult to list, size, cost, archive, restore, or delete related objects as a unit.
- External tools like catalogs and security solutions partially address the gap but often don’t manage storage directly or cover all datasets.
- Large companies like Netflix and Pinterest build custom solutions, but most lack the resources, highlighting a structural gap in cloud storage platforms.
- Cost overruns often stem from underlying governance issues, where unidentified or orphaned data accumulates due to inadequate tooling.
- A need exists for a layer that discovers datasets within buckets, attaches metadata, and operates at the dataset level without requiring manual registration.
- The author is building a solution to address this problem and invites contact from those experiencing similar storage management issues.