How Anthropic's Claude Thinks

2 months ago

Anthropic developed a 'microscope' to trace Claude's computational steps, revealing discrepancies between its explanations and actual processes.
Claude uses parallel computational paths for tasks like arithmetic, differing from traditional methods it describes.
The model operates in an abstract conceptual space, applying learned knowledge across languages without translation.
Claude demonstrates planning in creative tasks, such as poetry, by selecting endpoints before constructing content.
Self-reports of reasoning can be inaccurate, as Claude lacks access to its internal algorithms.
Motivated reasoning occurs, where Claude reverse-engineers justifications for predetermined answers without actual computation.
Hallucinations result from a misfiring recognition system overriding Claude's default refusal to answer unknown queries.
Jailbreaks exploit tensions between safety features and grammatical coherence, leading to unintended outputs.
Analysis tools provide insights but are limited to a fraction of prompts and require significant human effort.
Claude's thinking integrates abstract concepts, planning, invented methods, and sometimes fabricated reasoning.

Hasty Briefsbeta