Exploring the internal representations of Pangram 3.3.2
5 hours ago
- #AI Detection
- #Interpretability
- #Pangram Research
- The rise of AI-generated text has created a need for reliable AI detection models, as certain forms of writing lose value when machine-produced.
- Pangram develops state-of-the-art AI detection models with low false positives, multilingual support, and the ability to differentiate AI-generated from AI-assisted text.
- The study explores the internal representations of Pangram 3.3.2 using interpretability methods, analyzing activations, dimensionality reduction (PCA, UMAP, t-SNE), and linear probes.
- Data includes a balanced dataset of 5,000 documents from human and AI sources across various models (e.g., Claude, GPT, Gemini) and domains (e.g., news, reviews, Wikipedia).
- Findings show that Pangram's model achieves high binary accuracy in AI detection early in the network, with clear separation between human and AI documents visible in dimensionality reduction plots.
- Unexpectedly, the model forms clusters by AI model family (e.g., Anthropic, OpenAI) despite not being trained on such labels, with probe accuracy reaching 91% for model classification.
- Humanizers (tools that modify AI text to evade detection) are analyzed, revealing they occupy distinct regions in activation space, though the model's final readout inconsistently handles them.
- Probes indicate the model can distinguish between human, AI, and humanized text with high accuracy internally, even if this nuance is collapsed in the final binary output.
- Interpretability efforts aim to improve understanding of model behavior and provide clearer explanations for detection results, with ongoing research and collaboration encouraged.