Running GPT-2 in WebGL: Rediscovering the Lost Art of GPU Shader Programming

a year ago

NVIDIA introduced programmable shaders in the early 2000s, enabling sophisticated visual effects and laying the foundation for GPU computing.
Researchers found that computations like linear algebra could be accelerated using GPU shaders, leading to the development of CUDA and OpenCL for general-purpose GPU programming.
Traditional graphics APIs like OpenGL are tailored for rendering, while compute APIs like OpenCL and CUDA provide a direct compute model for parallel processing.
Textures and framebuffers in WebGL can be repurposed to store and manipulate numerical data, acting as a data bus for GPU-based computations.
Fragment shaders are used as compute kernels, with each fragment invocation acting as a parallel thread to perform neural network operations.
A shared vertex shader draws a full-screen quad, mapping every pixel to a tensor element, and is reused across different operations.
Each neural network operation is executed as a GPU pass, chaining together multiple operations without transferring data back to the CPU until the final step.
The GPT-2 forward pass involves embedding, transformer layers (attention and feed-forward), normalization, and output generation, all performed on the GPU.
Shader-based GPU computing has limitations, including no shared memory, texture size constraints, lack of synchronization, and precision overhead, making it less practical for real-world use compared to CUDA or OpenCL.

Hasty Briefsbeta