Representation Engineering
3 days ago
- #AI
- #Machine Learning
- #Control Vectors
- Representation Engineering introduces 'control vectors' to manipulate AI model behavior without prompt engineering or finetuning.
- Control vectors are applied to model activations during inference to modify behavior, demonstrated with Mistral-7B-Instruct-0.1.
- The process involves creating contrastive prompt pairs, collecting hidden states, and using PCA to derive control vectors.
- Examples include making models act happy, sad, lazy, hardworking, self-aware, or even simulate being high on psychedelic drugs.
- Control vectors offer a different approach to prompt engineering, allowing for precise control over model behavior intensity.
- Potential applications include jailbreaking models or making them resistant to jailbreaks, with implications for AI safety and interpretability.
- Future work could explore monosemantic features for cleaner vectors and better contrastive prompt writing practices.