Hasty Briefsbeta

Bilingual

Exploring the internal representations of Pangram 3.3.2

5 hours ago
  • #AI Detection
  • #Interpretability
  • #Pangram Research
  • The rise of AI-generated text has created a need for reliable AI detection models, as certain forms of writing lose value when machine-produced.
  • Pangram develops state-of-the-art AI detection models with low false positives, multilingual support, and the ability to differentiate AI-generated from AI-assisted text.
  • The study explores the internal representations of Pangram 3.3.2 using interpretability methods, analyzing activations, dimensionality reduction (PCA, UMAP, t-SNE), and linear probes.
  • Data includes a balanced dataset of 5,000 documents from human and AI sources across various models (e.g., Claude, GPT, Gemini) and domains (e.g., news, reviews, Wikipedia).
  • Findings show that Pangram's model achieves high binary accuracy in AI detection early in the network, with clear separation between human and AI documents visible in dimensionality reduction plots.
  • Unexpectedly, the model forms clusters by AI model family (e.g., Anthropic, OpenAI) despite not being trained on such labels, with probe accuracy reaching 91% for model classification.
  • Humanizers (tools that modify AI text to evade detection) are analyzed, revealing they occupy distinct regions in activation space, though the model's final readout inconsistently handles them.
  • Probes indicate the model can distinguish between human, AI, and humanized text with high accuracy internally, even if this nuance is collapsed in the final binary output.
  • Interpretability efforts aim to improve understanding of model behavior and provide clearer explanations for detection results, with ongoing research and collaboration encouraged.