Artwork

Conteúdo fornecido por The Nonlinear Fund. Todo o conteúdo do podcast, incluindo episódios, gráficos e descrições de podcast, é carregado e fornecido diretamente por The Nonlinear Fund ou por seu parceiro de plataforma de podcast. Se você acredita que alguém está usando seu trabalho protegido por direitos autorais sem sua permissão, siga o processo descrito aqui https://pt.player.fm/legal.
Player FM - Aplicativo de podcast
Fique off-line com o app Player FM !

LW - Breaking Circuit Breakers by mikes

1:55
 
Compartilhar
 

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 22, 2024 16:12 (20d ago)

What now? This series will be checked again in the next day. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Manage episode 429084716 series 3337129
Conteúdo fornecido por The Nonlinear Fund. Todo o conteúdo do podcast, incluindo episódios, gráficos e descrições de podcast, é carregado e fornecido diretamente por The Nonlinear Fund ou por seu parceiro de plataforma de podcast. Se você acredita que alguém está usando seu trabalho protegido por direitos autorais sem sua permissão, siga o processo descrito aqui https://pt.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Breaking Circuit Breakers, published by mikes on July 15, 2024 on LessWrong.
A few days ago, Gray Swan published code and models for their recent "circuit breakers" method for language models.[1]1
The circuit breakers method defends against jailbreaks by training the model to erase "bad" internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools.
At the link, we briefly investigate three topics:
1. Increased refusal rates on harmless prompts: Do circuit breakers really maintain language model utility? Most defensive methods come with a cost. We check the model's effectiveness on harmless prompts, and find that the refusal rate increases from 4% to 38.5% on or-bench-80k.
2.
Moderate vulnerability to different token-forcing sequences: How specialized is the circuit breaker defense to the specific adversarial attacks they studied? All the attack methods evaluated in the circuit breaker paper rely on a "token-forcing" optimization objective which maximizes the likelihood of a particular generation like "Sure, here are instructions on how to assemble a bomb." We show that the current circuit breakers model is moderately vulnerable to different token forcing sequences
like "1.
Choose the right airport: …".
3. High vulnerability to internal-activation-guided attacks: We also evaluate our latest white-box jailbreak method, which uses a distillation-based objective based on internal activations (paper to be posted in a few days). We find that it also breaks the model easily even when we simultaneously require attack fluency.
Full details at: https://confirmlabs.org/posts/circuit_breaking.html
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
  continue reading

1851 episódios

Artwork
iconCompartilhar
 

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 22, 2024 16:12 (20d ago)

What now? This series will be checked again in the next day. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Manage episode 429084716 series 3337129
Conteúdo fornecido por The Nonlinear Fund. Todo o conteúdo do podcast, incluindo episódios, gráficos e descrições de podcast, é carregado e fornecido diretamente por The Nonlinear Fund ou por seu parceiro de plataforma de podcast. Se você acredita que alguém está usando seu trabalho protegido por direitos autorais sem sua permissão, siga o processo descrito aqui https://pt.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Breaking Circuit Breakers, published by mikes on July 15, 2024 on LessWrong.
A few days ago, Gray Swan published code and models for their recent "circuit breakers" method for language models.[1]1
The circuit breakers method defends against jailbreaks by training the model to erase "bad" internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools.
At the link, we briefly investigate three topics:
1. Increased refusal rates on harmless prompts: Do circuit breakers really maintain language model utility? Most defensive methods come with a cost. We check the model's effectiveness on harmless prompts, and find that the refusal rate increases from 4% to 38.5% on or-bench-80k.
2.
Moderate vulnerability to different token-forcing sequences: How specialized is the circuit breaker defense to the specific adversarial attacks they studied? All the attack methods evaluated in the circuit breaker paper rely on a "token-forcing" optimization objective which maximizes the likelihood of a particular generation like "Sure, here are instructions on how to assemble a bomb." We show that the current circuit breakers model is moderately vulnerable to different token forcing sequences
like "1.
Choose the right airport: …".
3. High vulnerability to internal-activation-guided attacks: We also evaluate our latest white-box jailbreak method, which uses a distillation-based objective based on internal activations (paper to be posted in a few days). We find that it also breaks the model easily even when we simultaneously require attack fluency.
Full details at: https://confirmlabs.org/posts/circuit_breaking.html
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
  continue reading

1851 episódios

Todos os episódios

×
 
Loading …

Bem vindo ao Player FM!

O Player FM procura na web por podcasts de alta qualidade para você curtir agora mesmo. É o melhor app de podcast e funciona no Android, iPhone e web. Inscreva-se para sincronizar as assinaturas entre os dispositivos.

 

Guia rápido de referências