When Breaking Your Model Makes It Better: Kinda strange phenomena in Gemma-2b:2b (Self Repair)
A hands-on exploration with a mechanistic dashboard built for Gemma-2b:2b
Introduction & Motivation
My journey started with a practical, almost mischievous, question: What happens if I reach into a transformer and tweak a single number (or like numbers) in its attention map? Could I change the next token prediction for "The Eiffel Tower is in" from "Paris" to something completely different?
To answer this, I needed a model I could actually operate on. I first tought about doing this with high-performance inference engines like vLLM or Ollama and find out it's like trying to examine a single piston by reaching into a car engine running at 100km per hour. I needed something more accessible and hackable. The solution was to build an interactive, surgical environment on top of Hugging Face and TransformerLens.
This is the story of that "Transformer Surgery" dashboard and the surprising, counter-intuitive phenomena it revealed about how Gemma-2b actually works under the hood.
2. The Gemma-2b is cool
There were different options for doing this ablation like LIama models or GPT-2 small ,but non of them is perfect as Gemma2, That reveals to me why this model is actually the sweetheart of Mechanical Interpretability researches
- Gemma2 has 8 head at each layer - Other peer have at best 12 (GPT2 small) or LIamma has 32 head And also a real production level model -
| Model | Tokenizer | Layers | Query Heads | KV Heads | Positional Embedding | Architecture Note |
|---|---|---|---|---|---|---|
| Gemma-2-2B | SentencePiece | 26 | 8 | 4 | RoPE (θ=10,000) | GQA (2:1 ratio) |
| Pythia-160M | GPT-NeoX BPE | 12 | 12 | 12 | RoPE | Standard MHA |
| GPT-2 Small | Byte-level BPE | 12 | 12 | 12 | Learned Absolute | Standard MHA |
| Pythia-410M | GPT-NeoX BPE | 24 | 16 | 16 | RoPE | Standard MHA |
| Qwen-1.5-0.5B | tiktoken BPE | 24 | 14 | 2 | RoPE (θ=10,000) | GQA (7:1 ratio) |
| TinyLlama 1.1B | SentencePiece | 22 | 32 | 4 | RoPE (θ=10,000) | GQA (8:1 ratio) |
| Phi-3-Mini | SentencePiece | 32 | 32 | 8 | RoPE (θ=10,000) | GQA (4:1 ratio) |
To get more sence over the GPT type of tokenizer it get "The Eiffel is in" as one token , Gemma tokenize as you think.
3. Try blind
https://huggingface.co/spaces/Nicknam/Attention-Show
or in
https://nicknam-attention-show.hf.space/
I changed the model's attention and nothing changed—that leads to the first insight.
Can we really isolate the cause in a model?
No matter how you try, you cannot find the single culprit. The cause is sparse. So I thought, why not look into more fancy research and ideas?
Building the "Transformer Surgery" Dashboard
The goal was clear: move from passive observation to active intervention. I built another Gradio application that acts as both a microscope and a scalpel for Gemma-2b, providing the following tools:
- Logit Lens: Track the model's best-guess token as it evolves through each layer.
- 2D Attention Pattern Inspector: Visualize the full Query x Key attention matrix for any specific head.
- Circuit Discovery via Ablation: Systematically "zero out" individual attention heads to measure their causal effect on a final prediction (calculating
Clean_Logit - Ablated_Logit). - Attention Surgery: Manually edit pre-softmax attention scores in real-time to test hypotheses.
This dashboard wasn't meant to just display statistics; it was built to poke, prod, and intervene—to practice causal mechanistic interpretability.
The "Aha!" Moment: When Breaking the Model Made It Smarter
Okay, let's rewind to the moment my new dashboard truly came alive. https://huggingface.co/spaces/Nicknam/AttentionShow ,I typed a classic, almost cliché prompt to start simple: "The capital of France is".
My goal was straightforward: use these new tools in digital ablation scanner—to find the heroes. I would surgically disable each of Gemma-2b's 208 attention heads, one by one, and see how much the model's confidence (logit) in the correct answer, "Paris", dropped. I was hunting for the essential circuitry, the vital neural pathways responsible for basic fact retrieval.
The results finished loading. I leaned in, expecting a heatmap lit up with a few bright, positive spots—the "important" heads.
My brain did a double-take.
The screen wasn't showing me heroes. It was showing... saboteurs? The numbers told a bizarre, inverted story:
What my screen showed (a sample):
Head L0H25: -0.48 # Wait, disabling this head BOOSTED "Paris" by 0.48 logits?
Head L1H18: +1.98 # A rare actual "hero" head. Most were not like this.
For the overwhelming majority of heads, the importance score was negative. Let me say that again: turning them off made the model more confident in the right answer. This wasn't finding the circuit that builds the answer. This was discovering a whole chorus of voices whose job seemed to be to hold back, regulate, or even slightly suppress the very answer we wanted.
The implication was mind-bending. We often intuitively think high attention equals high importance. My dashboard was screaming that in a complex, redundant system like a transformer, high activity might just be noise, and "importance" can be negative. These heads weren't broken; they were part of a deep, built-in balancing act—a kind of computational immune system or regulatory brake I hadn't thought to look for.
Connecting the Dots: The "Hydra Effect" in My Browser This weird, consistent result nagged at me. It felt too systematic to be a bug. And then it clicked—I'd read about this. My dashboard had accidentally given me a front-row seat to the "Hydra Effect" (named after the multi-headed beast of myth).
The theory goes: cut off one head (or attention component), and others adapt, compensate, and often over-compensate. My experiment was a perfect, real-time demonstration. Gemma-2b isn't a fragile chain of critical logic gates; it's a wildly over-engineered, resilient ecosystem. It's packed with redundant pathways, and many components aren't strictly "necessary" for a given task. In fact, some seem to exist to dampen performance, perhaps to prevent overconfidence, manage interference between concepts, or suppress naive but wrong predictions. This built-in redundancy is likely a key source of its stability, but it completely upends the simple idea of "finding the one circuit" for a task.
The Surgeon's Hands: From Watching to Editing Armed with this new lens, I returned to my original, cheeky question: If I can break it to make it better, can I edit it to change its mind?
The Big Picture: Questions Are the Real Output This project stopped being about a tool and started being about a shift in perspective:
Causal over Correlative: You can't just watch attention maps and think you understand. You have to intervene—break things, edit things—to see what's truly causing what.
The "Negative" is Profound: A head's most critical role might not be to fire for the right answer, but to actively suppress the wrong ones. Importance isn't always a positive number.
Robustness is Built on Chaos: The Hydra Effect isn't a bug; it's a fundamental, emergent feature. This redundancy is what makes LLMs both powerful and infuriatingly hard to interpret.
And of course, this opens a cascade of new questions, more exciting than any single answer:
Could we prune models by seeking out and removing these "negative importance" heads, making them leaner and maybe even clearer?
Do "suppressor heads" form recognizable circuit classes—like dedicated "repetition preventers" or "over-confidence cops"?
How does fine-tuning reshape this internal landscape of allies and saboteurs?
See For Yourself: The Dashboard Is Live! It might struggle a little because we are on CPU :) This isn't just my story; it's an experiment you can run. The dashboard is waiting for you. Here’s a simple workflow to see the magic:
Prompt: "The capital of France is"
Tab: Circuit Discovery → Click Run Ablation Sweep.
Observe: Stare at the heatmap dominated by those confounding blue (negative) values.
Investigate: Jump to the 2D Pattern Inspector, click a deep blue head, and ask: "What were you attending to, and why was it holding Paris back?"
Wrapping Up: From Gears to Gardens
I tought maybe having a tool that let you test and play faster and easier could be vital for our application - I know my approaches are not worthy enough as there are alot of room for improvement, Most MI researches use more automated and massive experiments - I know having these kind of demos make you care more about this experiment and it's a better option that building a simple notebook dead inside my github repos.
And this is what Huggingface is persuing - To Build In Public -
At the end I want to recall that building without a Pro account could be really painful :)
Further Reading & Inspiration:
The Hydra Effect: McGrath et al. "The Hydra Effect: Emergent Self-repair in Language Model Computations" (2023).
Model Editing via ROME: Meng et al. "Locating and Editing Factual Associations in GPT" (2022).
Mechanistic Interpretability Primer: Neel Nanda's incredible "Transformer Circuits" thread.
This project sparked from a "what if" and was built in a weekend frenzy of curiosity. If you use the dashboard and find something wonderful or weird, I'd absolutely love to hear about it!

