Hasty Briefsbeta

Bilingual

Comparing Fable and 10 other LLMs on refactoring a LangGraph god node

6 hours ago
  • #LLM Evaluation
  • #Software Architecture
  • #Code Refactoring
  • The article describes an experiment where 11 LLMs (including Fable-5, GPT models, and others) were tasked with proposing refactoring solutions for a 'god node' in a LangGraph agent, followed by cross-evaluations of each other's proposals.
  • The 'god node' (plan node) contained about 350 lines of hidden logic, making the graph hard to debug, test, and change. The goal was to lift this logic to the graph level for clarity.
  • In stage one, each model generated a proposal to split the plan node. Proposals varied in granularity, from balanced pipelines (e.g., Fable-5's 5-stage split) to more coarse or detailed approaches.
  • In stage two, models evaluated all proposals. Reviews differed in thoroughness: Fable-5's review was meticulous with bug findings, while others like Gemini-3.1-pro were minimalistic.
  • Stage three involved comparing proposals and reviews using methods like average scores, thesis analysis, and meta-evaluations to determine the best proposal and analyst.
  • The best proposal by average score was Fable-5's, followed by GPT-5.4 and GPT-5.5. For evaluators, GPT-5.5 was a top predictor of consensus, while Fable-5 and GPT-5.5 were highly rated in meta-analyses.
  • Key takeaways: For generating architecture, Fable or GPT models are recommended; for evaluation, GPT-5.5 or Fable-5 are good, but human oversight is still needed as models make errors.