Show HN: I Parallelized RNN Training from O(T) to O(log T) Using CUDA

3 days ago

Copy Link

Implementation of parallelizable GRUs and LSTMs for CS179 in CUDA.
Project focuses on verifying the claim from 'Were RNNs All We Needed?' by Feng et al. that simplifies LSTMs and GRUs for parallel scan algorithm.
Traditional RNNs process sequences sequentially, making them inefficient on GPUs.
Simplified models (minGRU and minLSTM) remove direct dependency on previous hidden states, enabling parallel computation.
Parallel scan algorithm reduces computation from O(T) to O(log T), improving GPU performance.
Benchmarks show significant speedup with GPU-scan compared to CPU-seq and CPU-scan.
Optimizations include fusing gate computations into a single kernel and using cuBLAS GEMM for better performance.
Project highlights the importance of architecture in deep learning, contrasting with the 'bitter lesson' perspective.

Hasty Briefsbeta