Making Deep Learning Go Brrrr from First Principles

2 hours ago

Optimizing deep learning performance requires identifying the bottleneck: compute-bound, memory-bandwidth-bound, or overhead-bound.
In the compute-bound regime, maximize FLOPS utilization, often by leveraging specialized hardware like Tensor Cores for matrix multiplications.
Memory-bandwidth-bound operations spend most time on data transfers; operator fusion reduces global memory accesses by combining operations.
Overhead-bound issues arise from framework and interpreter latency; solutions include tracing, JIT compilation, and increasing batch sizes to hide overhead.
To determine the regime, measure metrics like FLOPS percentage, memory bandwidth, or runtime scaling with problem size.

Hasty Briefsbeta