Why We Think

6 hours ago

https://lilianweng.github.io/posts/2025-05-01-thinking/

Copy Link

#reinforcement learning
#chain-of-thought
#test-time compute

Test-time compute and Chain-of-thought (CoT) have significantly improved model performance, raising research questions.
Human thinking is characterized into fast (System 1) and slow (System 2) modes, with System 2 enabling more rational choices.
Neural networks can leverage more computation at test time for better performance, similar to human System 2 thinking.
Transformer models' computation per token is roughly twice the number of parameters, with CoT allowing variable compute based on problem difficulty.
Latent variable models can express rich distributions over visible variables, useful for understanding CoT methods.
Early CoT work involved supervised learning on human-written reasoning traces, later improved by reinforcement learning (RL).
Parallel sampling and sequential revision are two main approaches for utilizing test-time compute to improve model outputs.
Beam search and best-of-N are methods for finding high-scoring samples, with process reward models guiding the search.
Self-correction in models often requires external feedback to avoid hallucinations and behavior collapse.
Recent RL successes in improving reasoning include models like DeepSeek-R1, which excel in math and coding tasks.
Tool use, like code interpreters and Web search, enhances reasoning capabilities in models like o3 and o4-mini.
CoT provides interpretability but assumes truthful descriptions of internal processes, which may not always be the case.
Faithfulness of CoT can be compromised by premature conclusions, uninformative tokens, or human-unreadable encodings.

Hasty Briefsbeta

Why We Think