Hasty Briefsbeta

Show HN: I Parallelized RNN Training from O(T) to O(log T) Using CUDA

3 days ago
  • #RNNs
  • #GPU Programming
  • #Parallel Computing
  • Implementation of parallelizable GRUs and LSTMs for CS179 in CUDA.
  • Project focuses on verifying the claim from 'Were RNNs All We Needed?' by Feng et al. that simplifies LSTMs and GRUs for parallel scan algorithm.
  • Traditional RNNs process sequences sequentially, making them inefficient on GPUs.
  • Simplified models (minGRU and minLSTM) remove direct dependency on previous hidden states, enabling parallel computation.
  • Parallel scan algorithm reduces computation from O(T) to O(log T), improving GPU performance.
  • Benchmarks show significant speedup with GPU-scan compared to CPU-seq and CPU-scan.
  • Optimizations include fusing gate computations into a single kernel and using cuBLAS GEMM for better performance.
  • Project highlights the importance of architecture in deep learning, contrasting with the 'bitter lesson' perspective.