Hasty Briefsbeta

  • #reinforcement learning
  • #chain-of-thought
  • #test-time compute
  • Test-time compute and Chain-of-thought (CoT) have significantly improved model performance, raising research questions.
  • Human thinking is characterized into fast (System 1) and slow (System 2) modes, with System 2 enabling more rational choices.
  • Neural networks can leverage more computation at test time for better performance, similar to human System 2 thinking.
  • Transformer models' computation per token is roughly twice the number of parameters, with CoT allowing variable compute based on problem difficulty.
  • Latent variable models can express rich distributions over visible variables, useful for understanding CoT methods.
  • Early CoT work involved supervised learning on human-written reasoning traces, later improved by reinforcement learning (RL).
  • Parallel sampling and sequential revision are two main approaches for utilizing test-time compute to improve model outputs.
  • Beam search and best-of-N are methods for finding high-scoring samples, with process reward models guiding the search.
  • Self-correction in models often requires external feedback to avoid hallucinations and behavior collapse.
  • Recent RL successes in improving reasoning include models like DeepSeek-R1, which excel in math and coding tasks.
  • Tool use, like code interpreters and Web search, enhances reasoning capabilities in models like o3 and o4-mini.
  • CoT provides interpretability but assumes truthful descriptions of internal processes, which may not always be the case.
  • Faithfulness of CoT can be compromised by premature conclusions, uninformative tokens, or human-unreadable encodings.