Representation Engineering

4 days ago

Copy Link

Representation Engineering introduces 'control vectors' to manipulate AI model behavior without prompt engineering or finetuning.
Control vectors are applied to model activations during inference to modify behavior, demonstrated with Mistral-7B-Instruct-0.1.
The process involves creating contrastive prompt pairs, collecting hidden states, and using PCA to derive control vectors.
Examples include making models act happy, sad, lazy, hardworking, self-aware, or even simulate being high on psychedelic drugs.
Control vectors offer a different approach to prompt engineering, allowing for precise control over model behavior intensity.
Potential applications include jailbreaking models or making them resistant to jailbreaks, with implications for AI safety and interpretability.
Future work could explore monosemantic features for cleaner vectors and better contrastive prompt writing practices.

Hasty Briefsbeta