Can frontier LLMs solve CAD tasks?

2 months ago

Frontier LLMs like GPT-5.3-Codex, Gemini 3.1 Pro, and Claude Opus 4.6 show varying capabilities, excelling in some tasks while struggling with others.
LLMs are primarily trained on text data, lacking the visual/spatial/motor experience humans naturally acquire, making them less adept at tasks like CAD.
The experiment tested LLMs on designing a 3D-printable wall mount for a bike pump using OpenSCAD, with simulations in MuJoCo to validate designs.
Claude Opus 4.6 performed best with a 100% pass rate, though designs often needed refinement. GPT-5.2 had a good pass rate but produced flawed designs.
Gemini 3.1 Pro and 3 Flash showed potential but were inconsistent, sometimes producing great designs and other times failing or looping.
Open-weight models like GLM-4.6V, Kimi K2.5, and Qwen 3.5 397B performed poorly, with simplistic or non-functional designs.
The project highlighted challenges like convex decomposition in MuJoCo and the complexity of building an agentic harness for LLMs.
Future improvements could include better grading rubrics, more objects for testing, and integrating off-the-shelf agent harnesses for better performance.

Hasty Briefsbeta