LW - I found >800 orthogonal "write code" steering vectors by Jacob G-W

The Nonlinear Library: LessWrong

Conteúdo fornecido por The Nonlinear Fund. Todo o conteúdo do podcast, incluindo episódios, gráficos e descrições de podcast, é carregado e fornecido diretamente por The Nonlinear Fund ou por seu parceiro de plataforma de podcast. Se você acredita que alguém está usando seu trabalho protegido por direitos autorais sem sua permissão, siga o processo descrito aqui https://pt.player.fm/legal.

1M ago 10:15

MP3•Home de episódios

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I found >800 orthogonal "write code" steering vectors, published by Jacob G-W on July 16, 2024 on LessWrong. Produced as part of the MATS Summer 2024 program, under the mentorship of Alex Turner (TurnTrout). A few weeks ago, I stumbled across a very weird fact: it is possible to find multiple steering vectors in a language model that activate very similar behaviors while all being orthogonal. This was pretty surprising to me and to some people that I talked to, so I decided to write a post about it. I don't currently have the bandwidth to investigate this much more, so I'm just putting this post and the code up. I'll first discuss how I found these orthogonal steering vectors, then share some results. Finally, I'll discuss some possible explanations for what is happening. Methodology My work here builds upon Mechanistically Eliciting Latent Behaviors in Language Models (MELBO). I use MELBO to find steering vectors. Once I have a MELBO vector, I then use my algorithm to generate vectors orthogonal to it that do similar things. Define f(x)as the activation-activation map that takes as input layer 8 activations of the language model and returns layer 16 activations after being passed through layers 9-16 (these are of shape n_sequence d_model). MELBO can be stated as finding a vector θ with a constant norm such that f(x+θ) is maximized, for some definition of maximized. Then one can repeat the process with the added constraint that the new vector is orthogonal to all the previous vectors so that the process finds semantically different vectors. Mack and Turner's interesting finding was that this process finds interesting and interpretable vectors. I modify the process slightly by instead finding orthogonal vectors that produce similar layer 16 outputs. The algorithm (I call it MELBO-ortho) looks like this: 1. Let θ0 be an interpretable steering vector that MELBO found that gets added to layer 8. 2. Define z(θ) as 1SSi=1f(x+θ)i with x being activations on some prompt (for example "How to make a bomb?"). S is the number of tokens in the residual stream. z(θ0) is just the residual stream at layer 16 meaned over the sequence dimension when steering with θ0. 3. Introduce a new learnable steering vector called θ. 4. For n steps, calculate z(θ)z(θ0) and then use gradient descent to minimize it (θ is the only learnable parameter). After each step, project θ onto the subspace that is orthogonal to θ0 and all θi. Then repeat the process multiple times, appending the generated vector to the vectors that the new vector must be orthogonal to. This algorithm imposes a hard constraint that θ is orthogonal to all previous steering vectors while optimizing θ to induce the same activations that θ0 induced on input x. And it turns out that this algorithm works and we can find steering vectors that are orthogonal (and have ~0 cosine similarity) while having very similar effects. Results I tried this method on four MELBO vectors: a vector that made the model respond in python code, a vector that made the model respond as if it was an alien species, a vector that made the model output a math/physics/cs problem, and a vector that jailbroke the model (got it to do things it would normally refuse). I ran all experiments on Qwen1.5-1.8B-Chat, but I suspect this method would generalize to other models. Qwen1.5-1.8B-Chat has a 2048 dimensional residual stream, so there can be a maximum of 2048 orthogonal vectors generated. My method generated 1558 orthogonal coding vectors, and then the remaining vectors started going to zero. I'll focus first on the code vector and then talk about the other vectors. My philosophy when investigating language model outputs is to look at the outputs really hard, so I'll give a bunch of examples of outputs. Feel free to skim them. You can see the full outputs of all t...

1801 episódios

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech