Hasty Briefsbeta

Representation Engineering

4 days ago
  • #AI
  • #Machine Learning
  • #Control Vectors
  • Representation Engineering introduces 'control vectors' to manipulate AI model behavior without prompt engineering or finetuning.
  • Control vectors are applied to model activations during inference to modify behavior, demonstrated with Mistral-7B-Instruct-0.1.
  • The process involves creating contrastive prompt pairs, collecting hidden states, and using PCA to derive control vectors.
  • Examples include making models act happy, sad, lazy, hardworking, self-aware, or even simulate being high on psychedelic drugs.
  • Control vectors offer a different approach to prompt engineering, allowing for precise control over model behavior intensity.
  • Potential applications include jailbreaking models or making them resistant to jailbreaks, with implications for AI safety and interpretability.
  • Future work could explore monosemantic features for cleaner vectors and better contrastive prompt writing practices.