Show HN: Minimal DL library in C – 24 NAIVE CUDA/CPU ops, autodiff, Python API
2 days ago
- #ML systems
- #Deep Learning
- #GPU programming
- ML systems and GPU programming exercise to build a small DL stack end-to-end.
- Blackwell-optimized CUDA kernels under active development.
- PyTorch internals explainer with notes/diagrams on core pieces.
- Book planned for longer-form writeup of design and lessons learned.
- Minimal DL library in C with core CUDA/CPU ops, autodiff, and backprop engine.
- Tensor abstraction with strides/views and complex indexing like numpy.
- Python API bindings for ops, layers, and models.
- Training components: optimizers, weight initializers, saving/loading params.
- Tooling includes computation-graph visualizer and autogenerated tests.
- Automatic cleanup of intermediate tensors for memory management.
- Project built as an ML systems learning project without AI assistance.
- Commands provided to define and train Conv-Net and MLP on GPU/CPU.
- Visualization of model graph and running generated test code.
- Environment setup instructions for running generated test code.
- Data download instructions for CIFAR-10 dataset.