Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster
5 hours ago
- #parallel-computing
- #autonomous-research
- #machine-learning
- Claude Code was used to autonomously improve a neural network training script (autoresearch) by running ~910 experiments over 8 hours on 16 GPUs.
- Parallel execution allowed the agent to test factorial grids of 10-13 experiments per wave, identifying interaction effects between parameters that sequential search would miss.
- The agent discovered that scaling model width (aspect ratio) had the most significant impact, reducing validation bits per byte (val_bpb) from 1.003 to 0.974 (2.87% improvement).
- The agent autonomously developed a strategy to exploit heterogeneous hardware (H100s and H200s), screening ideas on cheaper H100s and validating top performers on faster H200s.
- The experiment was managed using SkyPilot, which allowed the agent to provision and manage GPU clusters without manual intervention.
- The search progressed through five phases: hyperparameter sweeps, architecture discovery, fine-tuning, optimizer tuning, and diminishing returns.
- Parallel execution changed the agent's research strategy from greedy hill-climbing to factorial grids, enabling more efficient exploration of the parameter space.
- The total cost for the session was approximately $300, including $9 for Claude Code API and ~$260 for GPU compute.
- The setup is available for replication, with instructions and YAML files provided in the SkyPilot examples repository.