Building a 30 PB storage cluster in the heart of SF
17 hours ago
- #data-storage
- #cost-optimization
- #machine-learning
- Built a storage cluster in downtown SF to store 90 million hours of video data for pretraining models.
- Cost savings: $354k/year in-house vs. $12M/year on AWS, a ~40x reduction.
- Unique data use case: ML training data doesn't need high redundancy or availability like enterprise data.
- Storage setup: 30PB using 2,400 HDDs in 100 DS4246 chassis, with 10 CPU head nodes.
- Software: Simple 200-line Rust code for writing, nginx for reading, SQLite for metadata.
- Cost breakdown: $29.5k/month total (including depreciation) vs. $1.13M/month on AWS.
- Lessons learned: Simplicity was key; avoided complex solutions like Ceph or MinIO.
- Challenges: Physical setup (screwing in 2.4k HDDs), networking compatibility, and debugging.
- Recommendations: Use SAS drives, overprovision network, and ensure good cable management.
- Future improvements: Higher density setups with 90-drive SuperMicro SuperServers.