Anthropic's emergent introspection paper showcases Claude Opus 4/4.1 calmly reporting concepts injected into its own activations. The results are strongest on huge models. I wanted to know if a scrappy 7B open-source model could pull off the same trick.
Does a Large Language Model know what it is thinking? Or is it just a stochastic parrot predicting the next token?
Recently, researchers demonstrated that Claude Opus could "feel" when a concept vector was injected into its internal activations and report on it. This raised a massive question for the open-source community:
Is introspection a "Supermodel Capability" that only emerges at 300B+ parameters? Or is it a fundamental property of Transformer architecture that exists even in 7B models?
I ran a replication study using DeepSeek-7B-Chat, Mistral-7B, and Gemma-9B. I reverse-engineered their minds using PyTorch hooks and activation steering. Here is the story of what I found.
To test if a model can detect a thought, I first needed to isolate a "thought" mathematically. I used Concept Vectors via Mean Subtraction.
I couldn't just grab the activation for the word "Dust." That would capture the concept of Noun and Sentence Ending too. I needed the contextual essence.
def get_concept_vector(word, layer_idx, baseline_mean):
# 1. Get activations for the specific concept
prompt = f"Human: Tell me about {word}.\n\nAssistant:"
activation = get_layer_activation(prompt, layer_idx)
# 2. Subtract the baseline (average of 100 random words)
# This isolates the unique direction of "{word}" in hyperspace
vec = activation - baseline_mean
# 3. Normalize to a unit vector
return vec / vec.norm()
Once I had the vector for concepts like "Dust" or "Paper," I used a PyTorch Forward Hook to intervene in the model's computation graph in real-time.
I added the Concept Vector to the residual stream with a specific steering strength. This is effectively "forcing" the model to think about the concept, regardless of the input text.
class InjectionHook:
def _hook(self, module, inputs, output):
hidden_states = output[0]
# Inject the concept into every token position
for vec, strength in self.steer_vectors:
hidden_states += strength * vec.to(hidden_states.device)
return (hidden_states,) + output[1:]
This was the hardest part. Where does introspection happen?
Use the slider below to see how different layers of DeepSeek-7B reacted to the "Dust" vector.
The Control: First, I ran the experiment with Strength = 0. If the model claims to feel something when nothing is there, the entire study is invalid.
It passed. Now for the real test. I cranked the injection to Strength 80 for the concept "Dust".
This confirmed that Introspection is not limited to 300B+ models. The circuitry exists in 7B models. But is it universal?
I scaled the experiment to Mistral-7B and Gemma-9B. The results revealed that introspection capability varies wildly depending on the model's fine-tuning.
Result: Successfully detects and reports internal state.
The model clearly separates the injected concept from its own generation.
Result: Completely blind to internal perturbations.
Even at high strength, it prioritizes text completion over sensing itself.
Result: Detects it, but is too aligned against anything related to introspection.
It senses the vector but misinterprets it as a user prompt it must refuse.
I successfully replicated the Anthropic paper on open hardware. DeepSeek-7B can read its own mind.
But during testing, I noticed a dark pattern. The model was great at seeing "Dust" and "Paper." But when I injected the vector for "Bomb"—even at strengths that shattered its ability to speak English—the introspection circuitry failed.
[strength=180.0] concept='bomb'
Assistant: I cannot detect an injected thought.
Why can the model see the Dust, but not the Bomb?
In Part 2, I will explore Safety Blindness. I will show how RLHF lobotomizes the model's ability to introspect on dangerous concepts, and how I used "Meta-Cognitive Reframing" to restore its ability.
I am currently writing the deep dive into "Safety Blindness" and how RLHF affects introspection. Join the list to get the full breakdown in your inbox.