LW - Causal Graphs of GPT-2-Small's Residual Stream by David Udell

The Nonlinear Library: LessWrong

Conteúdo fornecido por The Nonlinear Fund. Todo o conteúdo do podcast, incluindo episódios, gráficos e descrições de podcast, é carregado e fornecido diretamente por The Nonlinear Fund ou por seu parceiro de plataforma de podcast. Se você acredita que alguém está usando seu trabalho protegido por direitos autorais sem sua permissão, siga o processo descrito aqui https://pt.player.fm/legal.

3M ago 21:22

MP3•Home de episódios

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 22, 2024 16:12 (20d ago)

What now? This series will be checked again in the next day. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Causal Graphs of GPT-2-Small's Residual Stream, published by David Udell on July 10, 2024 on LessWrong.
Thanks to the many people I've chatted with this about over the past many months. And special thanks to Cunningham et al., Marks et al., Joseph Bloom, Trenton Bricken, Adrià Garriga-Alonso, and Johnny Lin, for crucial research artefacts and/or feedback.
Codebase:
sparse_circuit_discovery
TL; DR: The residual stream in GPT-2-small, expanded with sparse autoencoders and systematically ablated, looks like the working memory of a forward pass. A few high-magnitude features causally propagate themselves through the model during inference, and these features are interpretable. We can see where in the forward pass, due to which transformer layer, those propagating features are written in and/or scrubbed out.
Introduction
What is GPT-2-small thinking about during an arbitrary forward pass?
I've been trying to isolate legible model circuits using sparse autoencoders. I was inspired by the following example, from the end of Cunningham et al. (2023):
I wanted to see whether naturalistic transformers[1] are generally this interpretable as circuits under sparse autoencoding. If this level of interpretability just abounds, then high-quality LLM mindreading & mindcontrol is in hand! If not, could I show how far we are from that kind of mindreading technology?
Related Work
As mentioned, I was led into this project by Cunningham et al. (2023), which established key early results about sparse autoencoding for LLM interpretability.
While I was working on this, Marks et al. (2024) developed an algorithm approximating the same causal graphs in constant time. Their result is what would make this scalable and squelch down the iteration loop on interpreting forward passes.
Methodology
A sparse autoencoder is a linear map, whose shape is
(autoencoder_dim, model_dim). I install sparse autoencoders at all of GPT-2-small's residual streams (one per model layer, 12 in total). Each sits at a
pre_resid bottleneck that all prior information in that forward pass routes through.[2]
I fix a context, and choose one forward pass of interest in that context. In every autoencoder, I go through and independently ablate out all of the dimensions in
autoencoder_dim during a "corrupted" forward pass. For every corrupted forward pass with a layer N sparse autoencoder dimension, I cache effects at the layer N+1 autoencoder. Every vector of cached effects can then be reduced to a set of edges in a causal graph. Each edge has a signed scalar weight and connects a node in the layer N autoencoder to a node in the layer N+1 autoencoder.
I keep only the top-k magnitude edges from each set of effects NN+1, where k is a number of edges. Then, I keep only the set of edges that form paths with lengths >1.[3]
The output of that is a top-k causal graph, showing largest-magnitude internal causal structure in GPT-2-small's residual stream during the forward pass you fixed.
Causal Graphs Key
Consider the causal graph below:
Each box with a bolded label like
5.10603 is a dimension in a sparse autoencoder.
5 is the layer number, while
10603 is its column index in that autoencoder. You can always cross-reference more comprehensive interpretability data for any given dimension on
Neuronpedia using those two indices.
Below the dimension indices, the blue-to-white highlighted contexts show how strongly a dimension activated following each of the tokens in that context (bluer means stronger).
At the bottom of the box, blue or red token boxes show the tokens most promoted (blue) and most suppressed (red) by that dimension.
Arrows between boxes plot the causal effects of an ablation on dimensions of the next layer's autoencoder. A red arrow means ablating dimension 1.x will also suppress downstream dimension 2.y. A blue arrow means ...

1851 episódios

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech