Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

5 hours ago

Claude Code was used to autonomously improve a neural network training script (autoresearch) by running ~910 experiments over 8 hours on 16 GPUs.
Parallel execution allowed the agent to test factorial grids of 10-13 experiments per wave, identifying interaction effects between parameters that sequential search would miss.
The agent discovered that scaling model width (aspect ratio) had the most significant impact, reducing validation bits per byte (val_bpb) from 1.003 to 0.974 (2.87% improvement).
The agent autonomously developed a strategy to exploit heterogeneous hardware (H100s and H200s), screening ideas on cheaper H100s and validating top performers on faster H200s.
The experiment was managed using SkyPilot, which allowed the agent to provision and manage GPU clusters without manual intervention.
The search progressed through five phases: hyperparameter sweeps, architecture discovery, fine-tuning, optimizer tuning, and diminishing returns.
Parallel execution changed the agent's research strategy from greedy hill-climbing to factorial grids, enabling more efficient exploration of the parameter space.
The total cost for the session was approximately $300, including $9 for Claude Code API and ~$260 for GPU compute.
The setup is available for replication, with instructions and YAML files provided in the SkyPilot examples repository.

Hasty Briefsbeta