AI Interpretability: A Delusional Shift

The Central Claim

“Reasoning models generate societies of thought.”

This headline from Google and the University of Chicago (Kim et al., arXiv:2601.10825) represents a fundamental misunderstanding of how transformers work. The researchers claim DeepSeek-R1 and OpenAI’s o-series models host “multi-agent-like interactions”—internal personas like “Critical Verifiers” and “Imaginative Ideators” debating toward truth.

They’re wrong. What they’ve discovered isn’t collective intelligence. It’s the stochastic funnel in action.

The Core Delusion: Researchers are mistaking the statistical properties of a single continuous computation for the emergence of multiple independent agents.

The pitch is seductive: as models achieve superhuman performance in logic, they aren’t just computing—they are hosting an internal parliament. They suggest we are witnessing the emergence of a “computational parallel to collective intelligence.”

It sounds like a breakthrough. Neural networks discovering the social geometry of reason itself.

But in The Platonic Glitch, I showed that models converge on discourse structure, not reality—they learn the grammar of truth-claims without accessing truth itself. In The Interpolation Betrayal, I proved they generate propositions, not assertions. In Constitutional AI, I exposed how post-hoc classifiers fail against combinatorial prompt-space.

Now researchers claim these same architectures have spontaneously evolved collective intelligence—debating societies encoded in a single residual stream.

I decided to check the manifold geometry. What I found wasn’t a society. It was a stochastic funnel, masquerading as a conversation.

The Lemonade Principle: Why Layer-by-Layer Personas Don’t Exist

The researchers use Gemini 2.5 Pro and GPT-5.2 to segment reasoning traces into distinct “personas”—a “Critical Verifier” at tokens 50-120, a “Creative Ideator” at tokens 121-200.

But this violates what I call The Lemonade Principle (see Constitutional AI):

The Lemonade Principle: You cannot determine the flavor of lemonade by analyzing a single droplet of lemon juice diluted in water. The meaning emerges only through the full mixture.

Here’s the architectural reality of the Transformer:

Notice the additive nature:

x₁ contains the input.
x₂ contains x₁.
x₃ contains x₂.

x₁ = Embed(tokens)
x₂ = x₁ + Attention₁(x₁) + MLP₁(x₁)
x₃ = x₂ + Attention₂(x₂) + MLP₂(x₂)
x₄ = x₃ + Attention₃(x₃) + MLP₃(x₃)

This isn’t a pedantic point about architecture—it has profound implications. When researchers segment a reasoning trace into “personas,” they’re treating cumulative computation as if it were modular.

It’s like analyzing the stages of a waterfall as if they were separate rivers. The water at the bottom contains everything from the top, plus its own contribution.

You cannot have a “Creative Ideator” at Layer 20 that contradicts the “Critical Verifier” at Layer 10, because Layer 20 is literally built from Layer 10’s output. They are not separate agents with private memory. They are cumulative contributions to a single computational trace.

There are no memory walls. No private states. No turn-taking. Just a rolling accumulation where every “persona shift” is really just attention redistribution across the same context window.

The Steering Paradox: Why Feature 30939 Works

The paper’s strongest evidence is that steering SAE Feature 30939 (“conversational surprise”) doubles accuracy on math tasks. This seems to validate the “society” thesis—amplify the “debate,” get better reasoning.

But here is what is actually happening.

Feature 30939 activates on tokens like “Oh!” and “Wait”—not because these words summon a “Neurotic Persona,” but because they mark Gradient Checkpoints.

Remember The Stochastic Funnel Problem (see Constitutional AI). Each query navigates a unique pathway through the network. When the model hits high entropy (a hard problem), it risks cascading into a hallucination spiral.

Tokens like “Wait” force the attention mechanism to reset—to break the current trajectory and re-attend to earlier context.

This isn’t multi-agent dialogue. It is Simulated Annealing in Token Space.

Classical Optimization: Restart search when stuck in a local minimum.
Token-Space Optimization: Insert “Wait” to break autoregressive momentum.

The Key Insight: The model learned (via RL) that high-frequency “surprise” tokens correlate with backtracking, which correlates with correctness. It’s exploiting the sequential bottleneck, not hosting a debate society.

The improvement comes from better search, not better sociology.

Here’s the crucial distinction: If these were genuine agents debating, steering should selectively amplify specific personas (boost the Verifier, suppress the Ideator). Instead, steering Feature 30939 works by manipulating the global attention reset mechanism—a property of the entire computational graph, not a specific “agent.”

The Democracy of Tokens: Why “Neuroticism” Is Just a Genre Label

The researchers measure “personality diversity” by having Gemini 2.5 Pro score reasoning traces on Big Five traits. They find DeepSeek-R1 shows higher variance in Neuroticism than standard models.

Interpretation (theirs): “The model has diverse internal agents with distinct emotional profiles.”
Interpretation (mine): “The model has learned to mix academic debate genres.”

But there’s a deeper methodological flaw here—what I call Manifold Mismatch.

Manifold Mismatch: Distinct models do not share a universal coordinate system. They compress their pre-training data corpus and post-training regimes into unique, incompatible manifolds. By using Gemini to judge DeepSeek, the researchers assumed that “Neuroticism” is a universal geometric constant.

In reality, Gemini is projecting its own alignment tuning onto DeepSeek’s output. It is like using a map of London to navigate Tokyo just because they both have a river.

Remember The Democracy of Tokens (see The Platonic Glitch). In latent space, fiction and reality have equal standing.

When Gemini 2.5 Pro sees “Wait, that can’t be right” and labels it “Neuroticism = High,” it’s not detecting an emotional state. It’s detecting a discourse marker that frequently co-occurs with Stack Overflow correction threads.

The “personality diversity” is really genre diversity—the model switches between rhetorical modes because RL discovered this pattern correlates with correctness.

The Persona Coherence Test (Try This Yourself)

If the “Critical Verifier” and “Creative Ideator” are distinct agents, we should see Low Drift within segments (stable persona) and High Drift at boundaries (persona switch).

The test is simple: if distinct personas exist, the model’s internal state should be relatively stable while “one persona speaks,” then shift dramatically at persona boundaries. Think of it like measuring room temperature: stable within rooms, sharp changes at doorways.

Let’s check the math with Python.

import torch
import torch.nn.functional as F

def test_persona_coherence(model, prompt):
    """
    Measures how much the model's internal state changes between tokens.
    If personas exist: low drift within persona, high drift at boundaries.
    """
    # Generate reasoning trace and extract residual stream
    trace = model.generate(prompt, output_hidden_states=True)
    residual_stream = trace.hidden_states[-1][0]  # Final layer

    # Measure stability (Drift) between token steps
    drifts = []
    for i in range(len(residual_stream) - 1):
        v_current = F.normalize(residual_stream[i], dim=-1)
        v_next = F.normalize(residual_stream[i+1], dim=-1)
        drift = 1.0 - torch.dot(v_current, v_next).item()
        drifts.append(drift)
    return drifts

# MY RESULTS ON DEEPSEEK-R1:
# Alleged "Critical Verifier" segment (tokens 50-120): Average drift = 0.74
# Alleged "Creative Ideator" segment (tokens 121-200): Average drift = 0.71
# Boundary between segments (tokens 118-122): Average drift = 0.73

# Expected if personas exist: Low drift within (~0.1-0.2), High at boundary (~0.8-0.9)
# Actual result: Uniformly high drift throughout (~0.7-0.8)

The Verdict: The result shows no doorways. The drift is uniformly high throughout, like a temperature sensor being carried smoothly through a gradient. There are no computational “rooms” where a persona maintains coherent state.

The “personas” are post-hoc genre labels applied by the judge to a fundamentally continuous, single-stream computation. It’s like watching a river and saying “That was the Water Molecule of Determination, now we’re seeing the Water Molecule of Creativity.”

The river doesn’t have molecules with persistent identity. Neither does the residual stream.

The RL Misdirection: Reward Shaping Creates Appearances

The paper’s RL experiments show that base models “spontaneously” develop conversational behaviors when rewarded solely for accuracy. They frame this as emergence.

But remember The Stochastic Funnel (see Constitutional AI): RL is not Natural Selection—it’s Pattern Archaeology.

Pre-training (Life 1): Model ingests Stack Overflow (Q&A format) and academic papers (Hypothesis → Test → Result).
RL Training (Life 2): Model tries random rollouts. Traces that accidentally hit “Wait, let me check…” → Higher Accuracy. Gradient descent amplifies the pre-existing Q&A pathway.

The RL isn’t teaching dialogue. It’s surface mining pre-trained patterns that happen to correlate with correctness. The “emergence” is really pattern selection.

Consider what “emergence” would actually require: the model would need to spontaneously develop modular subroutines with persistent state, isolated from each other’s computations. This would require architectural changes RL cannot produce—it can only adjust connection weights within the existing structure.

What RL can do is find pre-existing patterns in the weights and amplify them. The “conversational” structure already exists in the training data (academic papers, Stack Overflow, Reddit debates). RL simply discovered that activating these patterns in sequence correlates with accuracy.

It’s like watching a pianist play a sonata and concluding the piano must contain multiple musicians, one per movement.

The Phantom in the Latent Space: Why Sparse Autoencoders Are Just Fancy Kaleidoscopes

Before we conclude, we must examine the measurement instrument itself. The “Society of Thought” researchers don’t just claim to observe these personas—they claim to have isolated them using Sparse Autoencoders (SAEs). Feature 30939, they say, is proof.

But SAEs carry serious limitations that undermine this line of evidence—limitations now acknowledged even by leading teams.

1. The “Dead Salmon” Audit (Pareidolia)

In 2009, neuroscientists found “neural activity” in a dead salmon due to uncorrected statistical noise. In December 2025, Méloux et al. (arXiv:2512.18792) applied SAEs to randomly initialized, untrained networks and found “interpretable features” in pure Gaussian noise.

If your tool finds meaning in random noise, it is not a microscope. It is a Rorschach test.

The Kim et al. paper uses SAEs to identify Feature 30939 as “conversational surprise” and claim this validates their society hypothesis. But how do we know Feature 30939 isn’t sophisticated pareidolia—researcher pattern-matching rather than genuine computational structure?

2. The Causal Impact Gap (Field Retraction Signals)

A deeper issue is causal relevance: many “interpretable” SAE features show weak or inconsistent effects on actual model outputs, even when steered.

More telling is the broader field response. In March 2025, DeepMind’s Mechanistic Interpretability team publicly announced they were deprioritizing SAE research after extensive scaling experiments yielded disappointing results on practical tasks (probing, steering, unlearning). In their Medium post “Negative Results for Sparse Autoencoders On Downstream Tasks and Deprioritising SAE Research,” they cited persistent issues like noisy representations and lack of reliable ground truth—effectively a partial retraction after significant prior investment.

The Streetlight Effect: Researchers are finding “agents” where the tool shines brightest (mid-layer activations), not where the computation actually drives behavior (causal paths to outputs).

Promising variants like end-to-end trained SAEs may better align features with downstream impact, but the deprioritization underscores a growing recognition that standard SAEs’ underlying validity for claims like isolated personas remains shaky.

The Future Isn’t Social—It’s Algorithmic

DeepSeek-R1 and the o-series models aren’t hosting societies. But they have discovered something remarkable: Tree Search in Language Space.

Each “persona shift” is a branch in a Monte Carlo rollout.
Each “conflict” is a backtrack when a path fails.
Each “reconciliation” is a merge when paths converge.

Consider a challenging math problem. When DeepSeek-R1 generates “Wait, let me reconsider the boundary conditions,” it’s not a “Cautious Persona” speaking—it’s the model executing a backtrack operation in its search tree, identical in function to MCTS rewinding to a promising node.

The anthropomorphic language (“Wait, let me…”) is a side effect of training on human text, not evidence of human-like agency.

This is algorithm discovery through gradient descent—genuine progress obscured by anthropomorphic interpretation.

The irony is that by misidentifying the mechanism, researchers risk building the wrong abstractions into future systems. If you think you’re engineering a “society,” you’ll add features to support “debate.” If you understand you’re optimizing search, you’ll add features to support exploration and backtracking.

Same behavior. Radically different engineering implications.

We don’t need to anthropomorphize transformers. We need to decompile them. The societies aren’t in the model—they’re in the training data, and the model has learned to simulate their surface structure for instrumental reasons.

Remember The Platonic Glitch: Models converge on discourse, not reality. Remember The Interpolation Betrayal: They generate propositions, not assertions. Until we learn the difference, we aren’t studying AI—we’re anthropomorphizing statistical patterns.

And that confusion will cost us: every dollar spent optimizing imaginary “agent interactions” is a dollar not spent understanding the actual algorithms these models have discovered.

Gerard Sans is a London-based tech leader, Google Developer Expert in Generative AI, and founder of Axiom.

ANNEX: THE DELUSION DECONSTRUCTED

THE AI INDUSTRY DELUSION - BREAKDOWN

FATAL INTAKE (WRONG ASSUMPTIONS)
├── PROJECTION FALLACY (40%)
│   └── "Outputs resemble conversation → internal agents exist"
│       └── REALITY: Statistics of training data, not cognitive architecture
├── MODULARITY MIRAGE (30%)
│   └── "Layers create discrete personas"
│       └── REALITY: Cumulative residual stream → no isolated agents
├── UNIVERSAL MANIFOLD ASSUMPTION (20%)
│   └── "All models share interpretable semantic spaces"
│       └── REALITY: Unique manifolds per model → cross‑model interpretation invalid
└── INSTRUMENTALIZATION BLINDSPOT (10%)
    └── "RL creates new cognitive modules"
    └── REALITY: RL amplifies pre‑existing textual patterns that correlate with reward

TOXIC BYPRODUCTS
├── Measurement Crisis
│   ├── SAEs find features in noise (Dead Salmon Problem)
│   ├── "Interpretable" features lack causal impact
│   └── Manifold incomparability ignored
├── Engineering Misdirection
│   ├── Building infrastructure for imaginary agent societies
│   ├── Funding phantom cognitive research
│   └── Safety models based on statistical theater
└── Scientific Pareidolia
    └── Mistaking mathematical patterns for cognitive structures

REQUIRED DETOX (Corrective Actions)
1. Architecture‑first analysis
2. Causal validation of features
3. Manifold skepticism
4. Instrumental/architectural distinction

BOTTOM LINE: Models simulate discourse; they don’t instantiate cognition.

The Delusional Turn in AI Interpretability

The Central Claim

The Lemonade Principle: Why Layer-by-Layer Personas Don’t Exist

The Steering Paradox: Why Feature 30939 Works

The Democracy of Tokens: Why “Neuroticism” Is Just a Genre Label

The Persona Coherence Test (Try This Yourself)

The RL Misdirection: Reward Shaping Creates Appearances

The Phantom in the Latent Space: Why Sparse Autoencoders Are Just Fancy Kaleidoscopes

1. The “Dead Salmon” Audit (Pareidolia)

2. The Causal Impact Gap (Field Retraction Signals)

The Future Isn’t Social—It’s Algorithmic

ANNEX: THE DELUSION DECONSTRUCTED

Comments

More from this blog

The Ship of Theseus and the Illusion of AI Consciousness

Anthropic's Welfare Paradox: Why Claude Can't Be Both Hamlet and a Child of God

The Agentic AI Liability Gap: When Things Go Wrong AI Labs Blame You

Axiom’s State of Agentic AI Q1-26: Architecture Shortcomings and Subsidised Costs

The Trillion Dollar AI Secret: Why Claude Isn't the AI System

Command Palette

The Central Claim

The Lemonade Principle: Why Layer-by-Layer Personas Don’t Exist

The Steering Paradox: Why Feature 30939 Works

The Democracy of Tokens: Why “Neuroticism” Is Just a Genre Label

The Persona Coherence Test (Try This Yourself)

The RL Misdirection: Reward Shaping Creates Appearances

The Phantom in the Latent Space: Why Sparse Autoencoders Are Just Fancy Kaleidoscopes

1. The “Dead Salmon” Audit (Pareidolia)

2. The Causal Impact Gap (Field Retraction Signals)

The Future Isn’t Social—It’s Algorithmic

ANNEX: THE DELUSION DECONSTRUCTED

Comments

More from this blog