Artwork

Conteúdo fornecido por The Nonlinear Fund. Todo o conteúdo do podcast, incluindo episódios, gráficos e descrições de podcast, é carregado e fornecido diretamente por The Nonlinear Fund ou por seu parceiro de plataforma de podcast. Se você acredita que alguém está usando seu trabalho protegido por direitos autorais sem sua permissão, siga o processo descrito aqui https://pt.player.fm/legal.
Player FM - Aplicativo de podcast
Fique off-line com o app Player FM !

LW - What and Why: Developmental Interpretability of Reinforcement Learning by Garrett Baker

11:12
 
Compartilhar
 

Manage episode 428108468 series 3337129
Conteúdo fornecido por The Nonlinear Fund. Todo o conteúdo do podcast, incluindo episódios, gráficos e descrições de podcast, é carregado e fornecido diretamente por The Nonlinear Fund ou por seu parceiro de plataforma de podcast. Se você acredita que alguém está usando seu trabalho protegido por direitos autorais sem sua permissão, siga o processo descrito aqui https://pt.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What and Why: Developmental Interpretability of Reinforcement Learning, published by Garrett Baker on July 9, 2024 on LessWrong. Introduction I happen to be in that happy stage in the research cycle where I ask for money so I can continue to work on things I think are important. Part of that means justifying what I want to work on to the satisfaction of the people who provide that money. This presents a good opportunity to say what I plan to work on in a more layman-friendly way, for the benefit of LessWrong, potential collaborators, interested researchers, and funders who want to read the fun version of my project proposal It also provides the opportunity for people who are very pessimistic about the chances I end up doing anything useful by pursuing this to have their say. So if you read this (or skim it), and have critiques (or just recommendations), I'd love to hear them! Publicly or privately. So without further ado, in this post I will be discussing & justifying three aspects of what I'm working on, and my reasons for believing there are gaps in the literature in the intersection of these subjects that are relevant for AI alignment. These are: 1. Reinforcement learning 2. Developmental Interpretability 3. Values Culminating in: Developmental interpretability of values in reinforcement learning. Here are brief summaries of each of the sections: 1. Why study reinforcement learning? 1. Imposed-from-without or in-context reinforcement learning seems a likely path toward agentic AIs 2. The "data wall" means active-learning or self-training will get more important over time 3. There are fewer ways for the usual AI risk arguments to fail in the RL with mostly outcome-based rewards circumstance than the supervised learning + RL with mostly process-based rewards (RLHF) circumstance. 2. Why study developmental interpretability? 1. Causal understanding of the training process allows us to produce reward structure or environmental distribution interventions 2. Alternative & complementary tools to mechanistic interpretability 3. Connections with singular learning theory 3. Why study values? 1. The ultimate question of alignment is how can we make AI values compatible with human values, yet this is relatively understudied. 4. Where are the gaps? 1. Many experiments 2. Many theories 3. Few experiments testing theories or theories explaining experiments Reinforcement learning Agentic AIs vs Tool AIs All generally capable adaptive systems are ruled by a general, ground-truth, but slow outer optimization process which reduces incoherency and continuously selects for systems which achieve outcomes in the world. Examples include evolution, business, cultural selection, and to a great extent human brains. That is, except for LLMs. Most of the feedback LLMs receive is supervised, unaffected by the particular actions the LLM takes, and process-based (RLHF-like), where we reward the LLM according to how useful an action looks in contrast to a ground truth regarding how well that action (or sequence of actions) achieved its goal. Now I don't want to make the claim that this aspect of how we train LLMs is clearly a fault of them, or in some way limits the problem solving abilities they can have. And I do think it possible we see in-context ground-truth optimization processes instantiated as a result of increased scaling, in the same way we see in context learning. I do however want to make the claim that this current paradigm of mostly processed-based supervision, if it continues, and doesn't itself produce ground-truth based optimization, makes me optimistic about AI going well. That is, if this lack of general ground-truth optimization continues, we end up with a cached bundle of not very agentic (compared to AIXI) tool AIs with limited search or bootstrapping capabilities. Of course,...
  continue reading

1801 episódios

Artwork
iconCompartilhar
 
Manage episode 428108468 series 3337129
Conteúdo fornecido por The Nonlinear Fund. Todo o conteúdo do podcast, incluindo episódios, gráficos e descrições de podcast, é carregado e fornecido diretamente por The Nonlinear Fund ou por seu parceiro de plataforma de podcast. Se você acredita que alguém está usando seu trabalho protegido por direitos autorais sem sua permissão, siga o processo descrito aqui https://pt.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What and Why: Developmental Interpretability of Reinforcement Learning, published by Garrett Baker on July 9, 2024 on LessWrong. Introduction I happen to be in that happy stage in the research cycle where I ask for money so I can continue to work on things I think are important. Part of that means justifying what I want to work on to the satisfaction of the people who provide that money. This presents a good opportunity to say what I plan to work on in a more layman-friendly way, for the benefit of LessWrong, potential collaborators, interested researchers, and funders who want to read the fun version of my project proposal It also provides the opportunity for people who are very pessimistic about the chances I end up doing anything useful by pursuing this to have their say. So if you read this (or skim it), and have critiques (or just recommendations), I'd love to hear them! Publicly or privately. So without further ado, in this post I will be discussing & justifying three aspects of what I'm working on, and my reasons for believing there are gaps in the literature in the intersection of these subjects that are relevant for AI alignment. These are: 1. Reinforcement learning 2. Developmental Interpretability 3. Values Culminating in: Developmental interpretability of values in reinforcement learning. Here are brief summaries of each of the sections: 1. Why study reinforcement learning? 1. Imposed-from-without or in-context reinforcement learning seems a likely path toward agentic AIs 2. The "data wall" means active-learning or self-training will get more important over time 3. There are fewer ways for the usual AI risk arguments to fail in the RL with mostly outcome-based rewards circumstance than the supervised learning + RL with mostly process-based rewards (RLHF) circumstance. 2. Why study developmental interpretability? 1. Causal understanding of the training process allows us to produce reward structure or environmental distribution interventions 2. Alternative & complementary tools to mechanistic interpretability 3. Connections with singular learning theory 3. Why study values? 1. The ultimate question of alignment is how can we make AI values compatible with human values, yet this is relatively understudied. 4. Where are the gaps? 1. Many experiments 2. Many theories 3. Few experiments testing theories or theories explaining experiments Reinforcement learning Agentic AIs vs Tool AIs All generally capable adaptive systems are ruled by a general, ground-truth, but slow outer optimization process which reduces incoherency and continuously selects for systems which achieve outcomes in the world. Examples include evolution, business, cultural selection, and to a great extent human brains. That is, except for LLMs. Most of the feedback LLMs receive is supervised, unaffected by the particular actions the LLM takes, and process-based (RLHF-like), where we reward the LLM according to how useful an action looks in contrast to a ground truth regarding how well that action (or sequence of actions) achieved its goal. Now I don't want to make the claim that this aspect of how we train LLMs is clearly a fault of them, or in some way limits the problem solving abilities they can have. And I do think it possible we see in-context ground-truth optimization processes instantiated as a result of increased scaling, in the same way we see in context learning. I do however want to make the claim that this current paradigm of mostly processed-based supervision, if it continues, and doesn't itself produce ground-truth based optimization, makes me optimistic about AI going well. That is, if this lack of general ground-truth optimization continues, we end up with a cached bundle of not very agentic (compared to AIXI) tool AIs with limited search or bootstrapping capabilities. Of course,...
  continue reading

1801 episódios

Todos os episódios

×
 
Loading …

Bem vindo ao Player FM!

O Player FM procura na web por podcasts de alta qualidade para você curtir agora mesmo. É o melhor app de podcast e funciona no Android, iPhone e web. Inscreva-se para sincronizar as assinaturas entre os dispositivos.

 

Guia rápido de referências