Exploring the internal representations of Pangram 3.3.2

5 hours ago

#AI Detection
#Interpretability
#Pangram Research

The rise of AI-generated text has created a need for reliable AI detection models, as certain forms of writing lose value when machine-produced.
Pangram develops state-of-the-art AI detection models with low false positives, multilingual support, and the ability to differentiate AI-generated from AI-assisted text.
The study explores the internal representations of Pangram 3.3.2 using interpretability methods, analyzing activations, dimensionality reduction (PCA, UMAP, t-SNE), and linear probes.
Data includes a balanced dataset of 5,000 documents from human and AI sources across various models (e.g., Claude, GPT, Gemini) and domains (e.g., news, reviews, Wikipedia).
Findings show that Pangram's model achieves high binary accuracy in AI detection early in the network, with clear separation between human and AI documents visible in dimensionality reduction plots.
Unexpectedly, the model forms clusters by AI model family (e.g., Anthropic, OpenAI) despite not being trained on such labels, with probe accuracy reaching 91% for model classification.
Humanizers (tools that modify AI text to evade detection) are analyzed, revealing they occupy distinct regions in activation space, though the model's final readout inconsistently handles them.
Probes indicate the model can distinguish between human, AI, and humanized text with high accuracy internally, even if this nuance is collapsed in the final binary output.
Interpretability efforts aim to improve understanding of model behavior and provide clearer explanations for detection results, with ongoing research and collaboration encouraged.

Hasty Briefsbeta

Exploring the internal representations of Pangram 3.3.2