Show HN: I Parallelized RNN Training from O(T) to O(log T) Using CUDA
3 days ago
- #RNNs
- #GPU Programming
- #Parallel Computing
- Implementation of parallelizable GRUs and LSTMs for CS179 in CUDA.
- Project focuses on verifying the claim from 'Were RNNs All We Needed?' by Feng et al. that simplifies LSTMs and GRUs for parallel scan algorithm.
- Traditional RNNs process sequences sequentially, making them inefficient on GPUs.
- Simplified models (minGRU and minLSTM) remove direct dependency on previous hidden states, enabling parallel computation.
- Parallel scan algorithm reduces computation from O(T) to O(log T), improving GPU performance.
- Benchmarks show significant speedup with GPU-scan compared to CPU-seq and CPU-scan.
- Optimizations include fusing gate computations into a single kernel and using cuBLAS GEMM for better performance.
- Project highlights the importance of architecture in deep learning, contrasting with the 'bitter lesson' perspective.