Demis Hassabis on the "Intelligence Explosion", Self-Improving AI and AlphaZero

Sources

youtube.com

Answer

About this Video

Video Title: Demis Hassabis on the "Intelligence Explosion", Self-Improving AI and AlphaZero
Channel: Wes Roth
Speakers: Demis Hassabis, Wes Roth
Duration: 00:11:36

Introduction

This video discusses the potential for an "intelligence explosion" in AI, focusing on self-improving AI and the implications of DeepMind's AlphaZero. Wes Roth interviews Demis Hassabis, founder of DeepMind, to explore the convergence of different AI approaches and the possibility of rapid advancements in AI capabilities.

Key Takeaways

Self-improving AI is a key area of research: DeepMind's work with AlphaZero demonstrates that AI can improve its performance significantly through self-play and self-improvement, surpassing human-trained models.
Convergence of AI technologies: Different AI technologies, such as large language models (LLMs) and reinforcement learning (RL), are converging, leading to potentially significant breakthroughs.
The role of reinforcement learning compute: Increasing computational power dedicated to reinforcement learning could accelerate AI progress dramatically, particularly in areas like coding.
The "Absolute Reasoner" approach: This research uses a "proposer" and "solver" model for self-play, achieving remarkable results in coding and demonstrating a potential for generalization to other problem-solving areas like mathematics.
Potential for rapid progress: The combination of self-improving algorithms and increased computational power suggests a potential for very rapid advancements in AI capabilities, particularly in coding and potentially more broadly in general reasoning.

Ask me anything about this video:

Answer questions 1 - 4.

Answer

The transcript doesn't specify the exact evolutionary programming techniques used alongside foundation models in DeepMind's work. It only mentions that evolutionary programming techniques were paired with the latest foundation models.
AlphaGo was trained on human game data, learning to replicate human strategies. AlphaGo Zero, conversely, started with a blank slate and learned to play by playing itself (self-play), resulting in surpassing AlphaGo's performance significantly within 36 hours and beating it 100-0 by the 72-hour mark.
The "Absolute Reasoner" approach uses two models: a "proposer" that suggests coding problems or questions, and a "solver" that attempts to solve them. This creates a self-play loop where the solver's improvements inform the proposer to create more challenging problems, leading to iterative self-improvement in coding abilities.
The video highlights a shift in AI training. Initially, the focus was on massive pre-training compute. The speaker suggests the next major wave will be a significant increase in reinforcement learning compute (RL compute). This increased RL compute dwarfs the pre-training compute in the presented chart, indicating a prioritization of reinforcement learning for enhancing AI performance. This is expected to lead to substantial progress, especially in areas like coding where self-improvement through reinforcement learning shows promise.