I got the highest score on ARC-AGI again swapping Python for English

8 months ago

ARC-AGI is a benchmark for abstract pattern recognition, highlighting the gap between human and AI performance.
The author achieved a new high score of 79.6% on ARC v1 and 29.4% on ARC v2 using Evolutionary Test-Time Compute with English instructions.
The method involves generating and refining natural language instructions through evolutionary cycles, replacing Python functions.
ARC-AGI v2 tasks are more complex, requiring multi-step reasoning, yet remain solvable by humans with high accuracy.
Current LLMs struggle with 'dead reasoning zones,' where logic fails inconsistently across domains.
The author suggests that reinforcement learning (RL) can help models develop consistent, transferable reasoning skills.
AGI, as defined by François Chollet, requires efficient skill acquisition outside training data, a goal not yet met by LLMs.

Hasty Briefsbeta