EEmicroGPT: 19,000× faster microgpt training on a laptop CPU (loss vs. time)

11 hours ago

EEmicroGPT is a highly optimized, single-file, dependency-free C implementation of GPT training, designed for Apple Silicon.
It achieves up to 19,000x faster training per sample compared to Python implementations by focusing on hardware efficiency.
Key optimizations include SIMD vectorization, register usage, skipping unnecessary computations (like padding), and leveraging Apple's SME2 for matrix operations.
The project demonstrates that understanding hardware and computational overhead can outperform raw compute power, especially for small models.
EEmicroGPT's performance highlights the 'killer microsecond' problem in GPUs, where kernel launch overhead dominates small workloads.
The implementation includes optimizations like fast approximate math functions (e.g., exp), batch processing, and cache-aware data layouts.
It serves as a case study in the trade-offs between model capacity, batch size, and training speed, showing how faster iteration enables better hyperparameter tuning.
The project connects small-scale optimizations to large-scale AI training, emphasizing the importance of efficiency in the AI revolution.

Hasty Briefsbeta