Beyond Layer-Wise Interpretability: Tracing Transformer Circuits and Advanced Intervention Techniques

John CraftsInformational Blog

Introduction

Transformers have revolutionized AI with their performance, yet they remain black boxes – complex webs of attention and activation that defy easy explanation (Layer-Wise Sub-Model Interpretability for Transformers | artsen). Earlier in Layer-Wise Sub-Model Interpretability for Transformers, we explored breaking a transformer into interpretable layer-wise sub-models, treating each layer’s hidden state as an input to an explanatory module (Layer-Wise Sub-Model Interpretability for Transformers | artsen) (Layer-Wise Sub-Model Interpretability for Transformers | artsen). In Advanced Techniques for Transformer Interpretability, we surveyed methods like saliency maps, attention visualization, causal probing, and neuron analysis (Advanced Techniques for Transformer Interpretability | artsen). While these methods shed light on what parts of the model matter, they often examine one layer or component at a time. The frontier of interpretability is now moving beyond isolated layers – towards tracing full computational circuits within models and performing precise interventions to test hypotheses.

In this post, we build on those foundations and incorporate cutting-edge techniques from very recent research (2023–2024) that aim to map out and understand transformer circuits – the networks of interactions spanning multiple layers. Notably, we’ll expand upon Anthropic’s March 27, 2025 release, “Tracing the thoughts of a large language model,” which introduced powerful methods for mechanism-level analysis: attention patching, path patching, mechanistic circuit tracing, and decomposing outputs into sub-computations. We’ll also survey model-agnostic advances in interpretability, such as new forms of path attribution and causal interventions, frameworks for identifying mechanistic circuits, tackling polysemantic neurons, activation steering for controlled generation, and concept attribution techniques. Throughout, we maintain technical rigor but with an eye toward broad accessibility for ML researchers. Diagrams and code snippets will illustrate key ideas, and we cite recent papers (primarily 2023–2025) to ground our discussion in the latest findings.

Ultimately, this post is a follow-up that transitions from interpreting transformers layer-by-layer to tracing their full reasoning process. We’ll see how the outputs of a model can be broken down and attributed to internal components, how entire chains of computation can be visualized as interpretable graphs, and how interventions on those chains let us verify and steer model behavior. This represents a shift from simply observing model internals to actively probing and controlling the mechanisms by which transformers “think.”

From Layers to Circuits: Mechanism-Level Tracing

Single-layer explanations provide a stepwise view of a transformer’s computation, but they can miss the forest for the trees. A complex behavior often arises from interactions across multiple layers and heads – what Chris Olah and colleagues refer to as circuits (Layer-Wise Sub-Model Interpretability for Transformers | artsen). A circuit is a subnetwork of neurons and weights that jointly implement a specific function or sub-task within the model (Layer-Wise Sub-Model Interpretability for Transformers | artsen). Recent work from Anthropic has made significant progress in tracing these circuits in large language models, effectively building an “AI microscope” to inspect the model’s latent thought processes (Tracing the thoughts of a large language model \ Anthropic).

Mechanism-Level Tracing with Attribution Graphs: In their March 2025 release, Anthropic introduced a technique to map out the computation a transformer uses for a given prompt as a directed graph (Circuit Tracing: Revealing Computational Graphs in Language Models). In Circuit Tracing: Revealing Computational Graphs in Language Models, they replace certain components of the model with an interpretable approximation and then track information flow through the network (Circuit Tracing: Revealing Computational Graphs in Language Models). Specifically, they train a replacement model that uses sparse, interpretable features (via a cross-layer transcoder) in place of the original model’s MLP layers (Circuit Tracing: Revealing Computational Graphs in Language Models). The result is an attribution graph: nodes represent human-interpretable features detected in the activations, and edges carry weights indicating how one feature influences another, ultimately leading to the model’s output (Circuit Tracing: Revealing Computational Graphs in Language Models) (Circuit Tracing: Revealing Computational Graphs in Language Models). This graph decomposes the model’s output into sub-computations, showing which intermediate features (e.g. “concept X was present”) contributed to which subsequent features and, finally, to the output logits.

Such mechanism-level tracing allows researchers to identify high-level patterns the model is using. For example, in one case study the team examined how the model answers a geography question. They found a circuit involving features for “Texas” and “capital city” that activated a feature for Austin (the capital of Texas) when asked about Texas (Tracing the thoughts of a large language model \ Anthropic). By tracing the graph, they could literally see the model’s chain of thought: from recognizing the topic “Texas” to selecting “Austin” as the final answer. Importantly, these graphs are validated by interventions: researchers can intervene on a node in the graph and see if the output changes as expected. In this example, they patched the “Texas” feature to instead inject a “California” feature – and the model’s answer changed from Austin to Sacramento, the capital of California (Tracing the thoughts of a large language model \ Anthropic). This kind of causal test confirms that the graph’s relationships reflect the true mechanism the model uses, not just a coincidental correlation.

Attention Patching and Path Patching: To construct such attribution graphs (or in other independent experiments), a key set of tools involves patching parts of the model’s computation. Activation patching (also known as causal tracing) is a method where we take two inputs – e.g., a “clean” prompt and a “corrupted” prompt that yields a different outcome – run the model on both, then swap a specific activation (such as the output of a particular layer or attention head) from the clean run into the corrupted run (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). If doing so causes the corrupted prompt’s output to move closer to the clean output, we infer that the patched activation contained critical information that the model needed for that behavior (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). This pinpointing of where a certain piece of information or intermediate decision is carried in the network is extremely powerful. Researchers have used activation patching to find, for instance, which attention heads carry grammatical agreement information or which MLP neuron encodes a factual association (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). Repeating this over many components yields a picture of the causal roles of different parts of the model. In Anthropic’s pipeline, patching is used to verify edges in the attribution graph: by intervening on a suspected cause and seeing the effect on downstream nodes or outputs, they ensure the graph reflects real causal influence (Circuit Tracing: Revealing Computational Graphs in Language Models) (Circuit Tracing: Revealing Computational Graphs in Language Models).

Two specialized forms are worth noting: attention patching and path patching. In attention patching, one intervenes specifically on the attention mechanism – for example, replacing the attention patterns (the attention weight matrix or the context vector) from one run with that of another. This can test whether a particular attention head’s focusing behavior is necessary for an outcome. Wang et al. (2023) used attention patching to study chain-of-thought reasoning heads, by “patching” out certain heads’ attention maps and observing the impact on the model’s multi-step reasoning ([PDF] Iteration Head: A Mechanistic Study of Chain-of-Thought). They found that removing or replacing the attention from specific “iteration heads” (that attend to previous tokens in a list) broke the model’s ability to follow a chain-of-thought, confirming those heads’ causal role ([PDF] Iteration Head: A Mechanistic Study of Chain-of-Thought). Path patching, on the other hand, involves patching a sequence of activations along a path in the computational graph. Instead of swapping one layer output in isolation, one might swap a linked set of activations (e.g. a specific neuron’s value at multiple layers, or an entire sub-network’s outputs) to see how a compound pathway of influence behaves. This generalizes the idea of intervention to test complex hypotheses: if a whole path (from early layer to late layer) conveys some concept, patching the entire path from a clean run might restore the concept in a corrupted run. Indeed, causal scrubbing (Chan et al., 2022) formalized this – it’s described as “a generalization of path patching that allows testing hypotheses about any connection between sets of components” (Open Problems in Mechanistic Interpretability). In practice, researchers use causal scrubbing to rigorously test entire interpretability hypotheses: you propose that “these components form the circuit for feature X”, then systematically replace or ablate those components’ activations. If the model’s output is unaffected (or changes in a predicted way), the hypothesis is supported (Open Problems in Mechanistic Interpretability).

Patching techniques have quickly become a staple for mechanistic investigations. Anthropic’s team explicitly notes: “activation patching, path patching, and distributed alignment search have been the most commonly used techniques for refining hypotheses and isolating causal pathways” in direct circuit analyses (Circuit Tracing: Revealing Computational Graphs in Language Models). These methods let us do “surgery” on a model’s computation, intervening at fine granularity. They move interpretability from passive observation (like just visualizing an attention map) to active experimentation, much like a neuroscientist might stimulate or lesion parts of a brain to see what happens. The ability to decompose model outputs into contributions from sub-computations is a direct result – for example, we can attribute how much of a translation model’s output is due to a specific neuron vs. an attention head by seeing how outputs change when each is patched (Circuit Tracing: Revealing Computational Graphs in Language Models).

Mechanistic Circuit Discovery vs. Attribution: One important nuance in Anthropic’s approach is the introduction of a replacement model. By training a simplified, interpretable model (using their cross-layer transcoder architecture) to mimic the behavior of parts of the original network, they obtain a description of the circuit in terms of human-meaningful features (Circuit Tracing: Revealing Computational Graphs in Language Models). This is a form of model compression for interpretability – compressing the network’s MLPs into a sparse linear approximation that we can analyze. The resulting circuit graph is approximate (and they quantify its fidelity), but it gives a clear picture: for each detected feature (node) you can often assign a semantic label (e.g. “contains the concept of X”) (Circuit Tracing: Revealing Computational Graphs in Language Models) (Circuit Tracing: Revealing Computational Graphs in Language Models). This is reminiscent of earlier feature visualization work (Olah et al., 2017) where neuron activations in vision models were labeled with human descriptors, but here it’s done at scale for language models. A companion paper, “On the Biology of a Large Language Model,” applied this to a cutting-edge 34-layer model (Claude 3.5 “Haiku”), charting circuits for behaviors like multi-hop reasoning, code completion, and even model fallacies (e.g. hallucinations) (Circuit Tracing: Revealing Computational Graphs in Language Models) (Circuit Tracing: Revealing Computational Graphs in Language Models). They term these studies “model biology” – analogous to dissecting a biological brain to map functions to structures (Tracing the thoughts of a large language model \ Anthropic) (Tracing the thoughts of a large language model \ Anthropic). While this approach is currently labor-intensive and captures only a fraction of the full model’s complexity (Tracing the thoughts of a large language model \ Anthropic), it demonstrates a viable path toward fully tracing a model’s thoughts for certain tasks. It’s a significant step beyond attributing importance to an input token or a single neuron – instead, we’re getting something like a flowchart of the model’s computation on a given example.

Causal Interventions and Path Attribution Techniques

To better understand and verify these circuits, a range of causal intervention techniques have been developed recently. These are largely model-agnostic – they can be applied to any transformer or neural network – and they focus on attributing causal credit or responsibility to internal components for a given behavior.

Activation & Path Patching (Causal Tracing): As introduced above, activation patching identifies which activations matter by swapping them between runs (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). Think of it as answering: “If I give the model the correct intermediate value here, does it fix the mistake?” If yes, that activation was missing or corrupted in the flawed run. Researchers in 2022–2023 applied this idea to identify circuits in GPT-2 and other models. For example, in the IOI (Indirect Object Identification) task (a known challenge where a model must resolve a pronoun to the correct antecedent), Nanda et al. systematically patched every possible activation in a small GPT-2 to find where the model decides which name the pronoun refers to (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). They discovered a specific two-layer circuit (centering on particular attention heads in the final layers) that, when patched, flips the model’s answer – essentially pinpointing the “pronoun resolution circuit.” This kind of exhaustive causal tracing was computationally expensive: if a model has thousands of neurons per layer, iterating through each one with a separate forward pass is daunting (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda).

To scale this up, Attribution Patching was introduced by Neel Nanda in late 2022 (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda) (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). The idea is to use gradients as a guide: instead of patching one activation at a time and doing a full forward pass, you run the model once (or a small number of times) and use the gradient of the output difference with respect to each activation to estimate what the effect of patching that activation would be (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). This provides a fast ranking of important activations. In Nanda’s experiments, this gradient-based approximation could highlight the most influential heads/neurons for a task in a single backward pass, which can then be confirmed with a few targeted forward-pass patches (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda) (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). While attribution patching is an approximation and can miss non-linear interactions (it worked well for “small” components like individual head outputs but not as well for whole-layer outputs (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda)), it bridges efficiency and interpretability. With such tools, even very large models can be probed – giving us hints of where to look without an exorbitant computational cost.

Causal Scrubbing: A complementary development is causal scrubbing (Chan et al., 2022) (Open Problems in Mechanistic Interpretability). Rather than focusing on one neuron or head at a time, causal scrubbing provides a framework to test entire interpretability hypotheses. You start with a hypothesis like “The model computes feature F by combining neurons A and B in layer 5, then passes it via head H in layer 6 to influence the output”. This hypothesis can be represented as an abstracted computation graph. Causal scrubbing then systematically ablates or substitutes parts of the model along that hypothesized graph: for example, replace the sub-computation you claim is “feature F” with a dummy or with the equivalent sub-computation from a different input that lacks feature F. If the model’s output stays the same under these replacements, it suggests that those parts weren’t actually used (your hypothesis might be wrong). If the output changes in the way predicted (e.g., the model now fails to produce the behavior associated with F), it supports the hypothesis (Open Problems in Mechanistic Interpretability) (Open Problems in Mechanistic Interpretability). In essence, causal scrubbing generalizes patching into a rigorous search procedure over sets of activations. It’s powerful because it doesn’t require you to enumerate every neuron – you operate at the level of your theory of the circuit, testing if the model could be “rewired” to follow that theory without performance loss. Recent work used causal scrubbing to verify the well-known “induction head” circuit in GPT models (heads that help continue a repeating sequence seen earlier) (Takeaways from the Mechanistic Interpretability Challenges), and even to debug solutions in the NeurIPS 2022 interpretability challenge. As a general technique, it pushes us toward falsifiable science inside neural nets: you make a claim about how it works and then literally test that claim by intervention.

Automated Circuit Discovery: While much of this discussion has involved human-guided analysis (researchers decide which activations to patch or which hypothesis to test), there is movement toward automating the discovery of circuits. One notable effort is by Conmy et al. (2023), who proposed methods for automatically finding circuit structure in transformers ([PDF] Towards Automated Circuit Discovery for Mechanistic Interpretability). These methods often build on iterative patching: they might start with a broad intervention (e.g. patch an entire layer) and then narrow down to specific nodes by splitting the intervention if the effect was significant (binary search through the network’s computation graph). Another approach uses beam search or evolutionary algorithms on sets of neurons, trying to find a minimal set of neurons that can be patched to flip an output. Although these procedures are still experimental, the goal is clear – reduce the human burden in identifying important pathways, letting the algorithm propose which components form a circuit. Distributed alignment search, referenced in Anthropic’s work (Circuit Tracing: Revealing Computational Graphs in Language Models), is one such strategy that attempts to align intermediate representations between models or between a model and an interpretable proxy, thereby discovering which groups of neurons align to the same concept. So far, fully automated circuit discovery hasn’t solved the interpretability problem, but it’s providing useful tools. For instance, even simple heuristics like “find the neuron most correlated with concept X in the dataset” can reveal neurons that strongly activate for a concept (e.g. a neuron that fires on programming-language keywords, indicating an “is code” detector). Modern techniques improve on this by considering combinations of neurons or nonlinear interactions, which single-neuron correlation would miss. As research progresses, we expect a convergence of automated search with human prior knowledge – a synergy where algorithms surface candidate circuits and human experts label and verify them.

Example – Patching in Code (Pseudo-Code): To make these ideas more concrete, here’s a simplified pseudo-code example of how one might implement activation patching in practice using a transformer library:

model = load_transformer_model("gpt2")  # Pseudocode: load a pretrained model

# Define a clean and corrupted input
clean_input = "Alice went to the market to buy apples."
corr_input  = "Alice went to the market to buy oranges."  # Suppose the model lies or errs here

# Run the model on both inputs, recording all intermediate activations
clean_acts = model.run(clean_input, return_activations=True)
corr_acts  = model.run(corr_input, return_activations=True)

# Say we want to test a hypothesis that layer5.head2 is crucial for distinguishing "apples" vs "oranges".
# We will patch the output of layer5.head2 from the clean run into the corrupted run:
patched_acts = corr_acts.copy()
patched_acts["layer5.head2"] = clean_acts["layer5.head2"]  # intervene on this activation

# Run the model forward from layer5 with the patched activation onwards
patched_output = model.forward_from(layer=5, activations=patched_acts)

print("Corrupted output:", model.run(corr_input))
print("Patched output:", patched_output)

In practice, libraries like TransformerLens (Nanda, 2023) provide convenient hooks to do this kind of intervention, and one can loop over many components to find which patch most increases the likelihood of the correct answer. The above pseudo-code illustrates the core idea: take an internal state from one run and inject it into another, then observe the difference in the final output. If patched_output now matches the clean output (or is closer to it), we learn that layer5.head2 was carrying information that “apples” was the intended item. This snippet hides many details (like ensuring shapes match, handling attention cache, etc.), but it reflects how researchers implement causal tracing experiments.

Emergent Feature Interpretability: Neurons, Features, and Concepts

While causal methods pry into where information flows, another line of work focuses on what is represented inside model activations. In a transformer, each layer’s residual stream or each neuron’s activation can encode a combination of features – and often these are polysemantic, meaning a single neuron fires for multiple unrelated reasons (Anthropic drops an amazing report on LLM interpretability | by Lee Fischman | Mar, 2025 | Medium) (Anthropic drops an amazing report on LLM interpretability | by Lee Fischman | Mar, 2025 | Medium). This polysemanticity is a major obstacle to straightforward interpretability: we can’t just take the top-5 inputs that activate a neuron and assume it corresponds to a single clean concept. Recent research has attacked this problem with methods to decompose neuron representations into meaningful features.

Polysemantic Neurons and Monosemantic Features: Work by Olah et al. (2020) first highlighted polysemantic neurons in vision models – neurons that responded to both curves and dog faces, for example. In language models, we likewise find neurons that respond to multiple triggers (say, one activates on both financial news and any question format, doing double duty). A fascinating 2024 paper by Dreyer et al. introduced PURE (Polysemantic to Unit-Resolution), a method to turn polysemantic neurons into “pure” features ([2404.06453] PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits) ([2404.06453] PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits). The authors identify the different circuits that feed into a single neuron for different concepts, essentially splitting the neuron’s role by tracing its incoming connections. Each distinct cause through the network is then treated as a separate virtual neuron representing a single semantic feature ([2404.06453] PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits). They demonstrated this on image models (ResNets), where a single unit might respond to both “dog” and “ball” features; PURE could disentangle it into a dog-sensitive virtual unit and a ball-sensitive virtual unit by isolating the relevant sub-network (the set of filters and neurons upstream) for each ([2404.06453] PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits). While their work was in vision, the approach is general and could apply to transformers: imagine splitting a neuron that fires for “French language” and “formal tone” into two virtual features with separate downstream paths. This kind of circuit-based feature disentanglement directly leverages causal and path tracing techniques – it finds subgraphs relevant to each feature – and offers a path toward basis features that are more interpretable than individual neurons.

Another approach to tackle polysemanticity is through concept vectors and subspace learning. Instead of looking at single neurons, researchers search for directions in activation space that align with a concept. O’Mahony et al. (2023) proposed a method to find concept vectors that are linear combinations of neurons, representing disentangled concepts ([2304.09707] Disentangling Neuron Representations with Concept Vectors). They essentially perform a form of subspace decomposition: given a concept (defined by a set of examples, e.g. sentences about sports vs others), they find a direction in the model’s layer activation space that corresponds to that concept. This often uses techniques like PCA or more sophisticated optimization to ensure the vector is sparse or relies on only a few neurons. The key finding was that even if no single neuron exclusively encodes “sport-ness”, there might be a combination of neurons whose activation pattern does – a concept vector that is monosemantic. By searching for multiple vectors, they could disentangle polysemantic neurons into a few concept vectors, each capturing one facet ([2304.09707] Disentangling Neuron Representations with Concept Vectors) ([2304.09707] Disentangling Neuron Representations with Concept Vectors). For instance, a neuron that fires on both sports and politics news might be part of two concept vectors: one that activates on sports, one on politics. This aligns with the idea that distributed representations can be linearly separated into basis features. Such methods echo the famous TCAV (Testing with Concept Activation Vectors) in spirit – providing a quantitative test for concepts within internal activations – but they are tailored to deep features and allow discovering concepts in an unsupervised or weakly-supervised manner (not just testing predefined ones).

Dictionary Learning and Sparse Autoencoders: A major push from Anthropic’s interpretability team has been to apply dictionary learning to activations. By training a sparse autoencoder on the activations of a model, they obtain a dictionary of latent features that are more interpretable (Anthropic drops an amazing report on LLM interpretability | by Lee Fischman | Mar, 2025 | Medium) (Anthropic drops an amazing report on LLM interpretability | by Lee Fischman | Mar, 2025 | Medium). In 2023, Anthropic researchers extracted thousands of features from a medium-sized language model (Claude “Sonnet”) using this approach. Each feature is like a prototype activation pattern that appears in the model. Many turned out to be human-understandable: e.g., one feature responds whenever a sentence is about food, another triggers on rhyming text, another on the presence of a question (Anthropic drops an amazing report on LLM interpretability | by Lee Fischman | Mar, 2025 | Medium) (Anthropic drops an amazing report on LLM interpretability | by Lee Fischman | Mar, 2025 | Medium). Because of the sparse prior, each feature tends to be localized (activated by a specific concept or cluster of tokens) rather than mixing many things. These features can then be treated as pseudo-neurons that are far closer to monosemantic. In their March 2025 paper, Anthropic continued this idea by using the cross-layer transcoder to produce such features layer by layer (Anthropic drops an amazing report on LLM interpretability | by Lee Fischman | Mar, 2025 | Medium) (Anthropic drops an amazing report on LLM interpretability | by Lee Fischman | Mar, 2025 | Medium). One intriguing observation they report is that even with these advanced techniques, polysemanticity still arises – “individual neurons can be polysemantic even in the absence of superposition” (superposition refers to multiple features crammed into one neuron due to capacity limits) (Decomposing Language Models With Dictionary Learning). In other words, even a sparse feature basis can have features that overlap in meaning. Nonetheless, this approach has proven extremely useful in practice: once you have a library of interpretable features, you can visualize what each feature does (by looking at texts that activate it, or projecting it into word space) and trace circuits at the level of features rather than individual neurons (Circuit Tracing: Revealing Computational Graphs in Language Models) (Circuit Tracing: Revealing Computational Graphs in Language Models). This is exactly how the attribution graphs discussed earlier were constructed – features are nodes, not raw neurons. The success of dictionary learning for interpretability suggests a future where our basic units of explanation for networks are not neurons (too low-level) but semantically meaningful features.

Concept Activation and Attribution: Beyond finding features, we might want to attribute a model’s decision to known high-level concepts. This is common in vision (for example, attribute a classification to “texture” vs “shape” features), but it’s emerging in NLP as well. For instance, suppose a transformer translates a sentence in a certain tone – we might ask, how much of that is due to the concept “formal language” vs “informal language” in its latent state? Methods like logit attribution decompose the final prediction logits back into contributions from each layer or head. Because the transformer’s output is a linear function of its final residual stream (via the output embedding matrix), one can take a target output logit (say the logit for the word “bonjour” in a translation) and express it as a sum of contributions from each component of the residual stream across all layers. This was done in studies of the induction heads, showing how early tokens’ information was amplified by specific heads to contribute to the final output (Takeaways from the Mechanistic Interpretability Challenges). Another approach is Shapley value attribution for neurons: treat each neuron as a “player” in a coalition game that produces the output, and compute how the output changes when including or excluding that neuron (or set of neurons). Some 2023 works experimented with Shapley at the neuron or head level to find the most “responsible” components for a decision. While exact Shapley values are expensive to compute in large networks (exponential subsets), approximate methods exist. These concept attribution methods are model-agnostic in that they don’t rely on the transformer’s specifics – any model’s output can in theory be attributed to its parts via Shapley values or integrated gradients – but scaling them to thousands of neurons requires clever approximation (similar to attribution patching’s motivation).

In summary, the past two years have seen a convergence of mechanistic interpretability (circuits, causal testing) with representational interpretability (features, concept vectors). We are learning how to carve neural networks at the joints: identifying which collection of activations corresponds to a meaningful computation and how those computations compose. A striking example is the recent finding of a “universal latent language” inside GPT-style models: by tracing across languages, researchers found that models like Claude encode sentences from different languages into a shared conceptual space (a sort of language-agnostic thought), which then maps out to each language’s output (Tracing the thoughts of a large language model \ Anthropic). Such a finding comes from combining techniques – detecting the concept of “meaning of the sentence” as a feature, and showing it’s activated similarly for English, French, etc., and then causally patching that feature to see it drive outputs in multiple languages (Tracing the thoughts of a large language model \ Anthropic). This kind of insight was unthinkable a few years ago; it indicates we’re on the path to reverse-engineering aspects of these models’ cognition.

Activation Steering and Model Editing

Understanding circuits and features is not only an academic exercise – it enables us to steer and edit models in novel ways. If we know where a concept lives in the network, we can try to manipulate it: amplify it, suppress it, or swap it out. This has implications for safety (e.g. reduce a model’s tendency to follow undesirable thoughts) and for capability control (steer a model toward a desired style or correct reasoning path).

Activation Steering: This technique involves adding a vector to the model’s hidden states (activations) in order to change its behavior in a controlled way (Understanding “steering” in LLMs – AI Advances) (Understanding “steering” in LLMs – AI Advances). In early experiments, researchers found they could, for instance, add a fixed vector to the activations of GPT-2 that would make its tone more positive or more negative, without additional training. This vector can be obtained by taking the difference in activations between prompts that differ only in tone, or by using PCA on the difference between a set of positive vs. negative texts – effectively creating a direction for sentiment. Then at generation time, you inject a scaled version of this vector at a certain layer each time step, and the model’s outputs are notably more positive in sentiment. This was famously demonstrated by Turner et al. and discussed on forums as “steering GPT-2-XL by adding an activation vector” (Steering GPT-2-XL by adding an activation vector). More systematically, Samuel et al. (2023) introduced Feature-Guided Activation Addition (FGAA), which refines this idea using sparse autoencoders (Steering Large Language Models with Feature Guided Activation Additions). FGAA works by first learning an interpretable feature basis (again those sparse features) and then choosing which feature(s) to amplify for steering (Steering Large Language Models with Feature Guided Activation Additions) (Steering Large Language Models with Feature Guided Activation Additions). By operating in the latent space of a sparse autoencoder, they ensure the changes correspond to understandable features. For example, they could identify a “politeness” feature in a chatbot model and then add a small multiple of that feature’s activation vector whenever the model generates text. Compared to naive activation additions, FGAA was shown to better preserve coherence and not introduce unrelated quirks (Steering Large Language Models with Feature Guided Activation Additions). The idea of contrastive activation addition (CAA) is also leveraged – basically taking the difference between the presence and absence of a feature as a direction (Steering Large Language Models with Feature Guided Activation Additions).

Activation steering is promising as a fine-grained control mechanism. It is more direct than prompt engineering (which hopes the model internally steers itself) and more flexible than finetuning (which changes parameters globally for all contexts). However, it relies on interpretability: one needs to know what vector to add. Recent research has found axes for a range of behaviors: sycophancy (agreeableness to a user’s stated opinions), toxicity, formality, humor, factuality, etc. For instance, one can derive an “honesty” vector by contrasting truthful vs. intentionally misleading completions, and injecting this can reduce a language model’s hallucinations, effectively nudging it to double-check facts (Steering Language Models With Activation Engineering). IBM researchers in 2023 released a library for activation engineering that provides some pretrained steering vectors for GPT-2 and GPT-3 models (Steering Language Models With Activation Engineering). It’s worth noting that these interventions, while powerful, must be used carefully: adding a vector can have side effects (the features in networks are not perfectly orthogonal). Part of current research is ensuring steering does not break other model capabilities (Steering Large Language Models with Feature Guided Activation Additions) – essentially learning how big of a “steering dose” to apply for a desired change without causing distributional shift.

Model Editing and Knowledge Attribution: A related direction is editing model weights or activations to insert or remove knowledge, which is another form of interpretability-driven intervention. Tools like ROME (Rank-One Model Editing) by Meng et al. (2022) target the weights of transformer MLPs to edit factual associations (e.g. “Paris is the capital of [MASK]”) by locating which weights store that fact and nudging them to a new fact. Although weight editing is slightly tangential to interpretability, the discovery of which weights matter for a given fact is an interpretability question. ROME and its successors (like MEMIT in 2023) use activation patching as part of their pipeline: they measure which intermediate activations differ when a prompt contains a certain fact vs. not, helping localize the “fact storage” in the network (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda) (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda). This has uncovered that certain MLP layers (often mid-layer feed-forwards) act as key–value memories for facts (Circuit Tracing: Revealing Computational Graphs in Language Models), consistent with earlier findings that transformer FFNs can function like associative memories (Geva et al. 2020). From a broader perspective, these editing methods illustrate mechanistic interpretability used in service of functional change: once you know where a fact is stored, you can try to rewrite it. They also serve as an evaluation of understanding – if you edit the weight and the model cleanly changes one behavior (the fact) without wrecking others, it suggests a localized representation was correctly identified.

Finally, visualization deserves a mention in this context. Visualizing the effect of activation steering or patching can be as simple as plotting the probability of a certain output as you intervene more strongly. For example, one can graph how the logit for “honest answer” vs. “flattering answer” changes as we add an “honesty” activation vector in increments. Such plots often show an interpretable phase transition: beyond a certain magnitude of injection, the model flips from giving a likely flattery to giving the factual answer. Other visualization tools include heatmaps over the model’s computation (to highlight which parts were most changed by an intervention) and interactive circuit diagrams where clicking on a node might show the effect of ablating it. The field is actively borrowing from neuroscience visualization – such as activity maps (which neurons are active for a set of inputs) and even animation of activations (watching activations in time as the model generates each token, to see where information flows). In Anthropic’s report, they present a compelling diagram of a circuit for a task (essentially a network graph annotated with feature names) (Circuit Tracing: Revealing Computational Graphs in Language Models) (Circuit Tracing: Revealing Computational Graphs in Language Models). One could imagine an interactive version where hovering on an edge shows the result of patching that edge. While these tools are in prototype stages in research labs, they are likely to become more common in publications and blog posts to communicate how a model works internally.

Toward General-Purpose Interpretability

A key goal in recent research is model-agnostic interpretability – developing methods that are not tailor-made for a specific model or architecture, but could be applied broadly to understand any large neural network. The techniques we’ve discussed here are largely general: patching, scrubbing, feature extraction, concept vectors, etc., can in principle be used on vision transformers, BERT-like models, GPT-class large language models, and even non-transformer architectures. For example, causal scrubbing has been demonstrated on small RNN-based algorithmic models, and concept vector approaches have been used on vision convnets. This broad applicability is important because it means an investment in interpretability research today could transfer to future architectures (say, a transformer-decoder + diffusion hybrid model, or whatever comes next).

That said, each generation of models brings new challenges. Larger models have more neurons and deeper circuits, potentially more polysemantic mixing (though sometimes larger models learn more disentangled features, an open question). The interpretability techniques are racing to keep up – hence the flurry of work on scaling methods (like attribution patching for speed (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda), and semi-automated circuit discovery). Moreover, there are open questions about mechanistic faithfulness: how do we ensure the interpretations we get out (circuits, feature attributions) are truly faithful to the model and not artifacts of our analysis? Anthropic’s papers emphasize validating with interventions to confirm discovered mechanisms are real (Circuit Tracing: Revealing Computational Graphs in Language Models). The Open Problems in Mechanistic Interpretability survey (Sharkey et al., 2023) lists improving evaluation of interpretability methods as a top priority (Circuit Tracing: Revealing Computational Graphs in Language Models) (Circuit Tracing: Revealing Computational Graphs in Language Models). We need better metrics to check if an interpretation captures the model’s actual computation. Some proposed metrics include mechanistic completeness (what fraction of the model’s forward computation is “explained” by the circuit we found) and consistency under distribution shift (does the interpretation hold on slightly different inputs, indicating it’s not just overfitting to one example).

Another frontier is “interactive interpretability” – tools that allow researchers (or even end-users) to query a model’s internals on the fly. Imagine asking a model not just for an answer but for an explanation trace, which it produces by actually examining its own activations with a learned probe. Some preliminary research is looking at training models to expose their hidden features in a human-readable form (a bit like how we trained interpretable sub-models for each layer in the first post). If such efforts succeed, future LLMs might come with a built-in circuit interpreter that can say “I solved this math problem by using a known addition circuit and then a comparison circuit.” We are not there yet, but the pieces – sparse features, causal graphs, etc. – are being assembled.

Conclusion

In this blog post, we moved beyond treating transformer layers as independent boxes and instead looked at the full circuit-level picture of transformer computations. We discussed how attention patching and path patching enable fine-grained causal experiments inside models (Circuit Tracing: Revealing Computational Graphs in Language Models) (Attribution Patching: Activation Patching At Industrial Scale — Neel Nanda), how mechanistic tracing can yield human-interpretable graphs of a model’s “thought process” (Circuit Tracing: Revealing Computational Graphs in Language Models), and how outputs can be decomposed into contributions from features and components. We also surveyed recent advances in understanding model internals: from tackling the challenge of polysemantic neurons by splitting them into meaningful features ([2404.06453] PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits), to using concept vectors and sparse autoencoders to find human-aligned directions in activation space ([2304.09707] Disentangling Neuron Representations with Concept Vectors), to steering models by manipulating activations along those directions (Steering Large Language Models with Feature Guided Activation Additions). The unifying theme is mechanistic interpretability: an approach that seeks not just to score features or attention, but to unravel algorithmic structure inside the network.

Crucially, these techniques are largely model-agnostic and general-purpose. They don’t rely on secret sauce specific to, say, GPT-4 or only work for vision models – they tap into principles of how high-dimensional representations can be analyzed and how causal structure can be probed. As such, they form an expanding toolkit for researchers to apply to any neural model they wish to understand. A researcher today can take an off-the-shelf LLM and apply activation patching and see results that point to “this head in layer 20 is crucial for that behavior”. They can train a quick sparse autoencoder on its activations to surface several hundred features and perhaps recognize “feature 137 looks like a detector for programming language text”. They can attempt to trace a small circuit and validate it by intervention. Five years ago, almost none of this was possible – we mostly had to trust saliency maps or treat attention weights as explanations (which we now know are incomplete at best (Layer-Wise Sub-Model Interpretability for Transformers | artsen) (Layer-Wise Sub-Model Interpretability for Transformers | artsen)). The field has come a long way, though many open problems remain (Open Problems in Mechanistic Interpretability). We still lack guarantees that our interpretations capture everything important, and scaling some of these methods to the largest models is non-trivial. Moreover, with models being used in high-stakes settings, we must ensure interpretability keeps up so that we can audit and align those models effectively.

Our journey from layer-wise sub-models to full circuit tracing reflects a maturing of interpretability research. Early efforts gave us local glimpses – a salient token here, an attention head there. Now, we’re beginning to see global structures: how pieces connect across the entire model. This holistic view is what we ultimately need to trust and control AI systems. By understanding the mechanism behind a model’s answer, we not only gain confidence in why it produced that answer, but we also become equipped to correct it if needed (by intervening in the mechanism) (Tracing the thoughts of a large language model \ Anthropic). There’s a virtuous cycle at play: better interpretability enables safer and more reliable AI, which in turn allows deploying AI in more critical domains, which increases the demand for interpretability.

In closing, the techniques highlighted here – attention/path patching, mechanism tracing, feature attribution, activation steering – represent the state-of-the-art for peering into transformers as of 2024–2025. They will likely evolve; new methods will arise (for example, coupling these methods with formal verification or symbolic modeling of circuits is an exciting avenue). But even in their current form, these tools allow us to trace the outlines of a model’s reasoning and sometimes, remarkably, catch a model “thinking” in real time (Tracing the thoughts of a large language model \ Anthropic). What was once considered hopeless (“inscrutable giant matrices”) is now an active scientific investigation. By building on layer-wise insights and pushing towards global circuit understanding, we move closer to demystifying the most powerful AI models – turning on the lights inside the transformers that power modern AI.

References: