David Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning #86

Transcript

00:00:00 The following is a conversation with David Silver,

00:00:02 who leads the Reinforcement Learning Research Group

00:00:05 at DeepMind, and was the lead researcher

00:00:07 on AlphaGo, AlphaZero, and co led the AlphaStar

00:00:12 and MuZero efforts, and a lot of important work

00:00:14 in reinforcement learning in general.

00:00:17 I believe AlphaZero is one of the most important

00:00:20 accomplishments in the history of artificial intelligence.

00:00:24 And David is one of the key humans who brought AlphaZero

00:00:27 to life together with a lot of other great researchers

00:00:30 at DeepMind.

00:00:31 He’s humble, kind, and brilliant.

00:00:35 We were both jet lagged, but didn’t care and made it happen.

00:00:39 It was a pleasure and truly an honor to talk with David.

00:00:43 This conversation was recorded before the outbreak

00:00:45 of the pandemic.

00:00:46 For everyone feeling the medical, psychological,

00:00:49 and financial burden of this crisis,

00:00:51 I’m sending love your way.

00:00:53 Stay strong, we’re in this together, we’ll beat this thing.

00:00:57 This is the Artificial Intelligence Podcast.

00:01:00 If you enjoy it, subscribe on YouTube,

00:01:02 review it with five stars on Apple Podcast,

00:01:04 support on Patreon, or simply connect with me on Twitter

00:01:07 at Lex Friedman, spelled F R I D M A N.

00:01:12 As usual, I’ll do a few minutes of ads now

00:01:14 and never any ads in the middle

00:01:16 that can break the flow of the conversation.

00:01:18 I hope that works for you

00:01:19 and doesn’t hurt the listening experience.

00:01:22 Quick summary of the ads.

00:01:23 Two sponsors, Masterclass and Cash App.

00:01:27 Please consider supporting the podcast

00:01:29 by signing up to Masterclass and masterclass.com slash Lex

00:01:34 and downloading Cash App and using code LexPodcast.

00:01:38 This show is presented by Cash App,

00:01:41 the number one finance app in the app store.

00:01:43 When you get it, use code LexPodcast.

00:01:46 Cash App lets you send money to friends, buy Bitcoin,

00:01:50 and invest in the stock market with as little as $1.

00:01:53 Since Cash App allows you to buy Bitcoin,

00:01:56 let me mention that cryptocurrency

00:01:57 in the context of the history of money is fascinating.

00:02:01 I recommend Ascent of Money as a great book on this history.

00:02:05 Debits and credits on Ledger started around 30,000 years ago.

00:02:10 The US dollar created over 200 years ago,

00:02:12 and Bitcoin, the first decentralized cryptocurrency,

00:02:15 released just over 10 years ago.

00:02:18 So given that history, cryptocurrency is still very much

00:02:21 in its early days of development,

00:02:23 but it’s still aiming to and just might

00:02:26 redefine the nature of money.

00:02:29 So again, if you get Cash App from the app store or Google Play

00:02:32 and use the code LexPodcast, you get $10,

00:02:35 and Cash App will also donate $10 to FIRST,

00:02:38 an organization that is helping to advance robotics

00:02:41 and STEM education for young people around the world.

00:02:44 This show is sponsored by Masterclass.

00:02:46 Sign up at masterclass.com slash Lex

00:02:49 to get a discount and to support this podcast.

00:02:52 In fact, for a limited time now,

00:02:53 if you sign up for an all access pass for a year,

00:02:56 you get to get another all access pass

00:02:59 to share with a friend.

00:03:01 Buy one, get one free.

00:03:02 When I first heard about Masterclass,

00:03:04 I thought it was too good to be true.

00:03:06 For $180 a year, you get an all access pass

00:03:09 to watch courses from to list some of my favorites.

00:03:12 Chris Hadfield on space exploration,

00:03:15 Neil deGrasse Tyson on scientific thinking communication,

00:03:18 Will Wright, the creator of SimCity and Sims on game design,

00:03:22 Jane Goodall on conservation,

00:03:24 Carlos Santana on guitar.

00:03:26 His song Europa could be the most beautiful

00:03:29 guitar song ever written.

00:03:30 Gary Kasparov on chess, Daniel Negrano on poker,

00:03:34 and many, many more.

00:03:35 Chris Hadfield explaining how rockets work

00:03:37 and the experience of being launched into space alone

00:03:40 is worth the money.

00:03:41 For me, the key is to not be overwhelmed

00:03:44 by the abundance of choice.

00:03:46 Pick three courses you want to complete,

00:03:48 watch each of them all the way through.

00:03:50 It’s not that long, but it’s an experience

00:03:51 that will stick with you for a long time, I promise.

00:03:55 It’s easily worth the money.

00:03:56 You can watch it on basically any device.

00:03:59 Once again, sign up on masterclass.com slash Lex

00:04:02 to get a discount and to support this podcast.

00:04:05 And now, here’s my conversation with David Silver.

00:04:09 What was the first program you’ve ever written?

00:04:12 And what programming language?

00:04:13 Do you remember?

00:04:14 I remember very clearly, yeah.

00:04:16 My parents brought home this BBC Model B microcomputer.

00:04:22 It was just this fascinating thing to me.

00:04:24 I was about seven years old and couldn’t resist

00:04:27 just playing around with it.

00:04:29 So I think first program ever was writing my name out

00:04:35 in different colors and getting it to loop and repeat that.

00:04:39 And there was something magical about that,

00:04:41 which just led to more and more.

00:04:43 How did you think about computers back then?

00:04:46 Like the magical aspect of it, that you can write a program

00:04:49 and there’s this thing that you just gave birth to

00:04:52 that’s able to create sort of visual elements

00:04:56 and live in its own.

00:04:57 Or did you not think of it in those romantic notions?

00:04:59 Was it more like, oh, that’s cool.

00:05:02 I can solve some puzzles.

00:05:05 It was always more than solving puzzles.

00:05:06 It was something where, you know,

00:05:08 there was this limitless possibilities.

00:05:13 Once you have a computer in front of you,

00:05:14 you can do anything with it.

00:05:16 I used to play with Lego with the same feeling.

00:05:18 You can make anything you want out of Lego,

00:05:20 but even more so with a computer, you know,

00:05:21 you’re not constrained by the amount of kit you’ve got.

00:05:24 And so I was fascinated by it and started pulling out

00:05:26 the user guide and the advanced user guide

00:05:29 and then learning.

00:05:30 So I started in basic and then later 6502.

00:05:34 My father also became interested in this machine

00:05:38 and gave up his career to go back to school

00:05:40 and study for a master’s degree

00:05:42 in artificial intelligence, funnily enough,

00:05:46 at Essex University when I was seven.

00:05:48 So I was exposed to those things at an early age.

00:05:52 He showed me how to program in prologue

00:05:54 and do things like querying your family tree.

00:05:57 And those are some of my earliest memories

00:05:59 of trying to figure things out on a computer.

00:06:04 Those are the early steps in computer science programming,

00:06:07 but when did you first fall in love

00:06:09 with artificial intelligence or with the ideas,

00:06:12 the dreams of AI?

00:06:14 I think it was really when I went to study at university.

00:06:19 So I was an undergrad at Cambridge

00:06:20 and studying computer science.

00:06:23 And I really started to question,

00:06:27 you know, what really are the goals?

00:06:29 What’s the goal?

00:06:30 Where do we want to go with computer science?

00:06:32 And it seemed to me that the only step

00:06:37 of major significance to take was to try

00:06:40 and recreate something akin to human intelligence.

00:06:44 If we could do that, that would be a major leap forward.

00:06:47 And that idea, I certainly wasn’t the first to have it,

00:06:50 but it, you know, nestled within me somewhere

00:06:53 and became like a bug.

00:06:55 You know, I really wanted to crack that problem.

00:06:58 So you thought it was, like you had a notion

00:07:00 that this is something that human beings can do,

00:07:03 that it is possible to create an intelligent machine.

00:07:07 Well, I mean, unless you believe in something metaphysical,

00:07:11 then what are our brains doing?

00:07:13 Well, at some level they’re information processing systems,

00:07:17 which are able to take whatever information is in there,

00:07:22 transform it through some form of program

00:07:24 and produce some kind of output,

00:07:26 which enables that human being to do all the amazing things

00:07:29 that they can do in this incredible world.

00:07:31 So then do you remember the first time

00:07:35 you’ve written a program that,

00:07:37 because you also had an interest in games.

00:07:40 Do you remember the first time you were in a program

00:07:41 that beat you in a game?

00:07:45 That more beat you at anything?

00:07:47 Sort of achieved super David Silver level performance?

00:07:54 So I used to work in the games industry.

00:07:56 So for five years I programmed games for my first job.

00:08:01 So it was an amazing opportunity

00:08:03 to get involved in a startup company.

00:08:05 And so I was involved in building AI at that time.

00:08:12 And so for sure there was a sense of building handcrafted,

00:08:18 what people used to call AI in the games industry,

00:08:20 which I think is not really what we might think of as AI

00:08:23 in its fullest sense,

00:08:24 but something which is able to take actions

00:08:29 and in a way which makes things interesting

00:08:31 and challenging for the human player.

00:08:35 And at that time I was able to build

00:08:38 these handcrafted agents,

00:08:39 which in certain limited cases could do things

00:08:41 which were able to do better than me,

00:08:45 but mostly in these kind of Twitch like scenarios

00:08:47 where they were able to do things faster

00:08:50 or because they had some pattern

00:08:51 which was able to exploit repeatedly.

00:08:55 I think if we’re talking about real AI,

00:08:58 the first experience for me came after that

00:09:00 when I realized that this path I was on

00:09:05 wasn’t taking me towards,

00:09:06 it wasn’t dealing with that bug which I still had inside me

00:09:10 to really understand intelligence and try and solve it.

00:09:14 That everything people were doing in games

00:09:15 was short term fixes rather than long term vision.

00:09:19 And so I went back to study for my PhD,

00:09:22 which was funny enough trying to apply reinforcement learning

00:09:26 to the game of Go.

00:09:27 And I built my first Go program using reinforcement learning,

00:09:31 a system which would by trial and error play against itself

00:09:35 and was able to learn which patterns were actually helpful

00:09:40 to predict whether it was gonna win or lose the game

00:09:42 and then choose the moves that led

00:09:44 to the combination of patterns

00:09:45 that would mean that you’re more likely to win.

00:09:47 And that system, that system beat me.

00:09:50 And how did that make you feel?

00:09:53 Made me feel good.

00:09:54 I mean, was there sort of the, yeah,

00:09:57 it’s a mix of a sort of excitement

00:09:59 and was there a tinge of sort of like,

00:10:02 almost like a fearful awe?

00:10:04 You know, it’s like in space, 2001 Space Odyssey

00:10:08 kind of realizing that you’ve created something that,

00:10:12 you know, that’s achieved human level intelligence

00:10:19 in this one particular little task.

00:10:21 And in that case, I suppose neural networks

00:10:23 weren’t involved.

00:10:24 There were no neural networks in those days.

00:10:26 This was pre deep learning revolution.

00:10:30 But it was a principled self learning system

00:10:33 based on a lot of the principles which people

00:10:36 are still using in deep reinforcement learning.

00:10:40 How did I feel?

00:10:41 I think I found it immensely satisfying

00:10:46 that a system which was able to learn

00:10:49 from first principles for itself

00:10:51 was able to reach the point

00:10:52 that it was understanding this domain

00:10:56 better than I could and able to outwit me.

00:11:00 I don’t think it was a sense of awe.

00:11:01 It was a sense that satisfaction,

00:11:04 that something I felt should work had worked.

00:11:08 So to me, AlphaGo, and I don’t know how else to put it,

00:11:11 but to me, AlphaGo and AlphaGo Zero,

00:11:14 mastering the game of Go is again, to me,

00:11:18 the most profound and inspiring moment

00:11:20 in the history of artificial intelligence.

00:11:23 So you’re one of the key people behind this achievement

00:11:26 and I’m Russian.

00:11:27 So I really felt the first sort of seminal achievement

00:11:31 when Deep Blue beat Garry Kasparov in 1987.

00:11:34 So as far as I know, the AI community at that point

00:11:40 largely saw the game of Go as unbeatable in AI

00:11:43 using the sort of the state of the art

00:11:46 brute force methods, search methods.

00:11:48 Even if you consider, at least the way I saw it,

00:11:51 even if you consider arbitrary exponential scaling

00:11:55 of compute, Go would still not be solvable,

00:11:59 hence why it was thought to be impossible.

00:12:01 So given that the game of Go was impossible to master,

00:12:07 what was the dream for you?

00:12:09 You just mentioned your PhD thesis

00:12:11 of building the system that plays Go.

00:12:14 What was the dream for you that you could actually

00:12:16 build a computer program that achieves world class,

00:12:20 not necessarily beats the world champion,

00:12:21 but achieves that kind of level of playing Go?

00:12:24 First of all, thank you, that’s very kind words.

00:12:27 And funnily enough, I just came from a panel

00:12:31 where I was actually in a conversation

00:12:34 with Garry Kasparov and Murray Campbell,

00:12:36 who was the author of Deep Blue.

00:12:38 And it was their first meeting together since the match.

00:12:43 So that just occurred yesterday.

00:12:44 So I’m literally fresh from that experience.

00:12:47 So these are amazing moments when they happen,

00:12:50 but where did it all start?

00:12:52 Well, for me, it started when I became fascinated

00:12:55 in the game of Go.

00:12:56 So Go for me, I’ve grown up playing games.

00:12:59 I’ve always had a fascination in board games.

00:13:01 I played chess as a kid, I played Scrabble as a kid.

00:13:06 When I was at university, I discovered the game of Go.

00:13:08 And to me, it just blew all of those other games

00:13:11 out of the water.

00:13:12 It was just so deep and profound in its complexity

00:13:15 with endless levels to it.

00:13:17 What I discovered was that I could devote

00:13:22 endless hours to this game.

00:13:25 And I knew in my heart of hearts

00:13:28 that no matter how many hours I would devote to it,

00:13:30 I would never become a grandmaster,

00:13:34 or there was another path.

00:13:35 And the other path was to try and understand

00:13:38 how you could get some other intelligence

00:13:40 to play this game better than I would be able to.

00:13:43 And so even in those days, I had this idea that,

00:13:46 what if, what if it was possible to build a program

00:13:49 that could crack this?

00:13:51 And as I started to explore the domain,

00:13:53 I discovered that this was really the domain

00:13:57 where people felt deeply that if progress

00:14:01 could be made in Go,

00:14:02 it would really mean a giant leap forward for AI.

00:14:06 It was the challenge where all other approaches had failed.

00:14:10 This is coming out of the era you mentioned,

00:14:13 which was in some sense, the golden era

00:14:15 for the classical methods of AI, like heuristic search.

00:14:19 In the 90s, they all fell one after another,

00:14:23 not just chess with deep blue, but checkers,

00:14:26 backgammon, Othello.

00:14:28 There were numerous cases where systems

00:14:33 built on top of heuristic search methods

00:14:35 with these high performance systems

00:14:37 had been able to defeat the human world champion

00:14:40 in each of those domains.

00:14:41 And yet in that same time period,

00:14:44 there was a million dollar prize available

00:14:47 for the game of Go, for the first system

00:14:50 to be a human professional player.

00:14:52 And at the end of that time period,

00:14:54 in year 2000 when the prize expired,

00:14:57 the strongest Go program in the world

00:15:00 was defeated by a nine year old child

00:15:02 when that nine year old child was giving nine free moves

00:15:05 to the computer at the start of the game

00:15:07 to try and even things up.

00:15:09 And computer Go expert beat that same strongest program

00:15:13 with 29 handicapped stones, 29 free moves.

00:15:18 So that’s what the state of affairs was

00:15:20 when I became interested in this problem

00:15:23 in around 2003 when I started working on computer Go.

00:15:29 There was nothing, there was very, very little

00:15:33 in the way of progress towards meaningful performance,

00:15:36 again, anything approaching human level.

00:15:39 And so people, it wasn’t through lack of effort,

00:15:42 people had tried many, many things.

00:15:44 And so there was a strong sense

00:15:46 that something different would be required for Go

00:15:49 than had been needed for all of these other domains

00:15:52 where AI had been successful.

00:15:54 And maybe the single clearest example

00:15:56 is that Go, unlike those other domains,

00:15:59 had this kind of intuitive property

00:16:02 that a Go player would look at a position

00:16:04 and say, hey, here’s this mess of black and white stones.

00:16:09 But from this mess, oh, I can predict

00:16:12 that this part of the board has become my territory,

00:16:15 this part of the board has become your territory,

00:16:17 and I’ve got this overall sense that I’m gonna win

00:16:20 and that this is about the right move to play.

00:16:22 And that intuitive sense of judgment,

00:16:24 of being able to evaluate what’s going on in a position,

00:16:28 it was pivotal to humans being able to play this game

00:16:31 and something that people had no idea

00:16:33 how to put into computers.

00:16:35 So this question of how to evaluate a position,

00:16:37 how to come up with these intuitive judgments

00:16:40 was the key reason why Go was so hard

00:16:44 in addition to its enormous search space,

00:16:47 and the reason why methods

00:16:49 which had succeeded so well elsewhere failed in Go.

00:16:53 And so people really felt deep down that in order to crack Go

00:16:57 we would need to get something akin to human intuition.

00:17:00 And if we got something akin to human intuition,

00:17:02 we’d be able to solve many, many more problems in AI.

00:17:06 So for me, that was the moment where it’s like,

00:17:09 okay, this is not just about playing the game of Go,

00:17:11 this is about something profound.

00:17:13 And it was back to that bug

00:17:15 which had been itching me all those years.

00:17:17 This is the opportunity to do something meaningful

00:17:19 and transformative, and I guess a dream was born.

00:17:23 That’s a really interesting way to put it.

00:17:25 So almost this realization that you need to find,

00:17:29 formulate Go as a kind of a prediction problem

00:17:31 versus a search problem was the intuition.

00:17:34 I mean, maybe that’s the wrong crude term,

00:17:37 but to give it the ability to kind of intuit things

00:17:44 about positional structure of the board.

00:17:47 Now, okay, but what about the learning part of it?

00:17:51 Did you have a sense that you have to,

00:17:54 that learning has to be part of the system?

00:17:57 Again, something that hasn’t really as far as I think,

00:18:01 except with TD Gammon in the 90s with RL a little bit,

00:18:05 hasn’t been part of those state of the art game playing

00:18:07 systems.

00:18:08 So I strongly felt that learning would be necessary.

00:18:12 And that’s why my PhD topic back then was trying

00:18:16 to apply reinforcement learning to the game of Go

00:18:20 and not just learning of any type,

00:18:21 but I felt that the only way to really have a system

00:18:26 to progress beyond human levels of performance

00:18:29 wouldn’t just be to mimic how humans do it,

00:18:31 but to understand for themselves.

00:18:33 And how else can a machine hope to understand

00:18:36 what’s going on except through learning?

00:18:39 If you’re not learning, what else are you doing?

00:18:40 Well, you’re putting all the knowledge into the system.

00:18:42 And that just feels like something which decades of AI

00:18:47 have told us is maybe not a dead end,

00:18:50 but certainly has a ceiling to the capabilities.

00:18:53 It’s known as the knowledge acquisition bottleneck,

00:18:55 that the more you try to put into something,

00:18:58 the more brittle the system becomes.

00:19:00 And so you just have to have learning.

00:19:02 You have to have learning.

00:19:03 That’s the only way you’re going to be able to get a system

00:19:06 which has sufficient knowledge in it,

00:19:10 millions and millions of pieces of knowledge,

00:19:11 billions, trillions of a form

00:19:14 that it can actually apply for itself

00:19:15 and understand how those billions and trillions

00:19:18 of pieces of knowledge can be leveraged in a way

00:19:20 which will actually lead it towards its goal

00:19:22 without conflict or other issues.

00:19:27 Yeah, I mean, if I put myself back in that time,

00:19:30 I just wouldn’t think like that.

00:19:33 Without a good demonstration of RL,

00:19:34 I would think more in the symbolic AI,

00:19:37 like not learning, but sort of a simulation

00:19:42 of knowledge base, like a growing knowledge base,

00:19:46 but it would still be sort of pattern based,

00:19:50 like basically have little rules

00:19:52 that you kind of assemble together

00:19:54 into a large knowledge base.

00:19:56 Well, in a sense, that was the state of the art back then.

00:19:59 So if you look at the Go programs,

00:20:01 which had been competing for this prize I mentioned,

00:20:05 they were an assembly of different specialized systems,

00:20:09 some of which used huge amounts of human knowledge

00:20:11 to describe how you should play the opening,

00:20:14 how you should, all the different patterns

00:20:16 that were required to play well in the game of Go,

00:20:21 end game theory, combinatorial game theory,

00:20:24 and combined with more principled search based methods,

00:20:28 which were trying to solve for particular sub parts

00:20:31 of the game, like life and death,

00:20:34 connecting groups together,

00:20:36 all these amazing sub problems

00:20:38 that just emerge in the game of Go,

00:20:40 there were different pieces all put together

00:20:43 into this like collage,

00:20:45 which together would try and play against a human.

00:20:49 And although not all of the pieces were handcrafted,

00:20:54 the overall effect was nevertheless still brittle,

00:20:56 and it was hard to make all these pieces work well together.

00:21:00 And so really, what I was pressing for

00:21:02 and the main innovation of the approach I took

00:21:05 was to go back to first principles and say,

00:21:08 well, let’s back off that

00:21:10 and try and find a principled approach

00:21:12 where the system can learn for itself,

00:21:16 just from the outcome, like learn for itself.

00:21:19 If you try something, did that help or did it not help?

00:21:22 And only through that procedure can you arrive at knowledge,

00:21:26 which is verified.

00:21:27 The system has to verify it for itself,

00:21:29 not relying on any other third party

00:21:31 to say this is right or this is wrong.

00:21:33 And so that principle was already very important

00:21:38 in those days, but unfortunately,

00:21:39 we were missing some important pieces back then.

00:21:43 So before we dive into maybe

00:21:46 discussing the beauty of reinforcement learning,

00:21:49 let’s take a step back, we kind of skipped it a bit,

00:21:52 but the rules of the game of Go,

00:21:55 what the elements of it perhaps contrasting to chess

00:22:02 that sort of you really enjoyed as a human being,

00:22:07 and also that make it really difficult

00:22:09 as a AI machine learning problem.

00:22:13 So the game of Go has remarkably simple rules.

00:22:16 In fact, so simple that people have speculated

00:22:19 that if we were to meet alien life at some point,

00:22:22 that we wouldn’t be able to communicate with them,

00:22:23 but we would be able to play Go with them.

00:22:26 Probably have discovered the same rule set.

00:22:28 So the game is played on a 19 by 19 grid,

00:22:32 and you play on the intersections of the grid

00:22:34 and the players take turns.

00:22:35 And the aim of the game is very simple.

00:22:37 It’s to surround as much territory as you can,

00:22:40 as many of these intersections with your stones

00:22:43 and to surround more than your opponent does.

00:22:46 And the only nuance to the game is that

00:22:48 if you fully surround your opponent’s piece,

00:22:50 then you get to capture it and remove it from the board

00:22:52 and it counts as your own territory.

00:22:54 Now from those very simple rules, immense complexity arises.

00:22:58 There’s kind of profound strategies

00:22:59 in how to surround territory,

00:23:02 how to kind of trade off between

00:23:04 making solid territory yourself now

00:23:07 compared to building up influence

00:23:09 that will help you acquire territory later in the game,

00:23:11 how to connect groups together,

00:23:12 how to keep your own groups alive,

00:23:16 which patterns of stones are most useful

00:23:19 compared to others.

00:23:21 There’s just immense knowledge.

00:23:23 And human Go players have played this game for,

00:23:27 it was discovered thousands of years ago,

00:23:29 and human Go players have built up

00:23:30 this immense knowledge base over the years.

00:23:33 It’s studied very deeply and played by

00:23:36 something like 50 million players across the world,

00:23:38 mostly in China, Japan, and Korea,

00:23:41 where it’s an important part of the culture,

00:23:43 so much so that it’s considered one of the

00:23:45 four ancient arts that was required by Chinese scholars.

00:23:49 So there’s a deep history there.

00:23:51 But there’s interesting qualities.

00:23:53 So if I sort of compare to chess,

00:23:55 chess is in the same way as it is in Chinese culture for Go,

00:23:59 and chess in Russia is also considered

00:24:01 one of the sacred arts.

00:24:03 So if we contrast sort of Go with chess,

00:24:06 there’s interesting qualities about Go.

00:24:09 Maybe you can correct me if I’m wrong,

00:24:10 but the evaluation of a particular static board

00:24:15 is not as reliable.

00:24:18 Like you can’t, in chess you can kind of assign points

00:24:21 to the different units,

00:24:23 and it’s kind of a pretty good measure

00:24:26 of who’s winning, who’s losing.

00:24:27 It’s not so clear.

00:24:29 Yeah, so in the game of Go,

00:24:31 you find yourself in a situation where

00:24:33 both players have played the same number of stones.

00:24:36 Actually, captures at a strong level of play

00:24:38 happen very rarely, which means that

00:24:40 at any moment in the game,

00:24:41 you’ve got the same number of white stones and black stones.

00:24:43 And the only thing which differentiates

00:24:45 how well you’re doing is this intuitive sense

00:24:48 of where are the territories ultimately

00:24:50 going to form on this board?

00:24:52 And if you look at the complexity of a real Go position,

00:24:57 it’s mind boggling that kind of question

00:25:00 of what will happen in 300 moves from now

00:25:02 when you see just a scattering of 20 white

00:25:05 and black stones intermingled.

00:25:07 And so that challenge is the reason

00:25:12 why position evaluation is so hard in Go

00:25:15 compared to other games.

00:25:17 In addition to that, it has an enormous search space.

00:25:19 So there’s around 10 to the 170 positions

00:25:23 in the game of Go.

00:25:24 That’s an astronomical number.

00:25:26 And that search space is so great

00:25:28 that traditional heuristic search methods

00:25:30 that were so successful in things like Deep Blue

00:25:32 and chess programs just kind of fall over in Go.

00:25:36 So at which point did reinforcement learning

00:25:39 enter your life, your research life, your way of thinking?

00:25:43 We just talked about learning,

00:25:45 but reinforcement learning is a very particular

00:25:47 kind of learning.

00:25:49 One that’s both philosophically sort of profound,

00:25:53 but also one that’s pretty difficult to get to work

00:25:55 as if we look back in the early days.

00:25:58 So when did that enter your life

00:26:00 and how did that work progress?

00:26:02 So I had just finished working in the games industry

00:26:06 at this startup company.

00:26:07 And I took a year out to discover for myself

00:26:13 exactly which path I wanted to take.

00:26:14 I knew I wanted to study intelligence,

00:26:17 but I wasn’t sure what that meant at that stage.

00:26:19 I really didn’t feel I had the tools

00:26:21 to decide on exactly which path I wanted to follow.

00:26:24 So during that year, I read a lot.

00:26:27 And one of the things I read was Saturn and Barto,

00:26:31 the sort of seminal textbook

00:26:33 on an introduction to reinforcement learning.

00:26:35 And when I read that textbook,

00:26:39 I just had this resonating feeling

00:26:43 that this is what I understood intelligence to be.

00:26:47 And this was the path that I felt would be necessary

00:26:51 to go down to make progress in AI.

00:26:55 So I got in touch with Rich Saturn

00:27:00 and asked him if he would be interested

00:27:02 in supervising me on a PhD thesis in computer go.

00:27:07 And he basically said

00:27:11 that if he’s still alive, he’d be happy to.

00:27:15 But unfortunately, he’d been struggling

00:27:19 with very serious cancer for some years.

00:27:21 And he really wasn’t confident at that stage

00:27:23 that he’d even be around to see the end event.

00:27:26 But fortunately, that part of the story

00:27:28 worked out very happily.

00:27:29 And I found myself out there in Alberta.

00:27:32 They’ve got a great games group out there

00:27:34 with a history of fantastic work in board games as well,

00:27:38 as Rich Saturn, the father of RL.

00:27:40 So it was the natural place for me to go in some sense

00:27:43 to study this question.

00:27:45 And the more I looked into it,

00:27:48 the more strongly I felt that this

00:27:53 wasn’t just the path to progress in computer go.

00:27:56 But really, this was the thing I’d been looking for.

00:27:59 This was really an opportunity

00:28:04 to frame what intelligence means.

00:28:08 Like what are the goals of AI in a clear,

00:28:12 single clear problem definition,

00:28:14 such that if we’re able to solve

00:28:15 that clear single problem definition,

00:28:18 in some sense, we’ve cracked the problem of AI.

00:28:21 So to you, reinforcement learning ideas,

00:28:24 at least sort of echoes of it,

00:28:26 would be at the core of intelligence.

00:28:29 It is at the core of intelligence.

00:28:31 And if we ever create a human level intelligence system,

00:28:34 it would be at the core of that kind of system.

00:28:37 Let me say it this way, that I think it’s helpful

00:28:39 to separate out the problem from the solution.

00:28:42 So I see the problem of intelligence,

00:28:45 I would say it can be formalized

00:28:48 as the reinforcement learning problem,

00:28:50 and that that formalization is enough

00:28:52 to capture most, if not all of the things

00:28:56 that we mean by intelligence,

00:28:58 that they can all be brought within this framework

00:29:01 and gives us a way to access them in a meaningful way

00:29:03 that allows us as scientists to understand intelligence

00:29:08 and us as computer scientists to build them.

00:29:12 And so in that sense, I feel that it gives us a path,

00:29:16 maybe not the only path, but a path towards AI.

00:29:20 And so do I think that any system in the future

00:29:24 that’s solved AI would have to have RL within it?

00:29:29 Well, I think if you ask that,

00:29:30 you’re asking about the solution methods.

00:29:33 I would say that if we have such a thing,

00:29:35 it would be a solution to the RL problem.

00:29:37 Now, what particular methods have been used to get there?

00:29:41 Well, we should keep an open mind

00:29:42 about the best approaches to actually solve any problem.

00:29:45 And the things we have right now for reinforcement learning,

00:29:49 maybe I believe they’ve got a lot of legs,

00:29:53 but maybe we’re missing some things.

00:29:54 Maybe there’s gonna be better ideas.

00:29:56 I think we should keep, let’s remain modest

00:29:59 and we’re at the early days of this field

00:30:02 and there are many amazing discoveries ahead of us.

00:30:04 For sure, the specifics,

00:30:06 especially of the different kinds of RL approaches currently,

00:30:09 there could be other things that fall

00:30:11 into the very large umbrella of RL.

00:30:13 But if it’s okay, can we take a step back

00:30:16 and kind of ask the basic question

00:30:18 of what is to you reinforcement learning?

00:30:22 So reinforcement learning is the study

00:30:25 and the science and the problem of intelligence

00:30:31 in the form of an agent that interacts with an environment.

00:30:35 So the problem you’re trying to solve

00:30:36 is represented by some environment,

00:30:38 like the world in which that agent is situated.

00:30:40 And the goal of RL is clear

00:30:42 that the agent gets to take actions.

00:30:45 Those actions have some effect on the environment

00:30:47 and the environment gives back an observation

00:30:49 to the agent saying, this is what you see or sense.

00:30:52 And one special thing which it gives back

00:30:54 is called the reward signal,

00:30:56 how well it’s doing in the environment.

00:30:58 And the reinforcement learning problem

00:30:59 is to simply take actions over time

00:31:04 so as to maximize that reward signal.

00:31:07 So a couple of basic questions.

00:31:11 What types of RL approaches are there?

00:31:13 So I don’t know if there’s a nice brief inwards way

00:31:17 to paint the picture of sort of value based,

00:31:21 model based, policy based reinforcement learning.

00:31:25 Yeah, so now if we think about,

00:31:27 okay, so there’s this ambitious problem definition of RL.

00:31:31 It’s really, it’s truly ambitious.

00:31:33 It’s trying to capture and encircle

00:31:34 all of the things in which an agent interacts

00:31:36 with an environment and say, well,

00:31:38 how can we formalize and understand

00:31:39 what it means to crack that?

00:31:41 Now let’s think about the solution method.

00:31:43 Well, how do you solve a really hard problem like that?

00:31:46 Well, one approach you can take

00:31:48 is to decompose that very hard problem

00:31:51 into pieces that work together to solve that hard problem.

00:31:55 And so you can kind of look at the decomposition

00:31:58 that’s inside the agent’s head, if you like,

00:32:00 and ask, well, what form does that decomposition take?

00:32:03 And some of the most common pieces that people use

00:32:06 when they’re kind of putting

00:32:07 the solution method together,

00:32:09 some of the most common pieces that people use

00:32:11 are whether or not that solution has a value function.

00:32:14 That means, is it trying to predict,

00:32:16 explicitly trying to predict how much reward

00:32:18 it will get in the future?

00:32:20 Does it have a representation of a policy?

00:32:22 That means something which is deciding how to pick actions.

00:32:25 Is that decision making process explicitly represented?

00:32:28 And is there a model in the system?

00:32:31 Is there something which is explicitly trying to predict

00:32:34 what will happen in the environment?

00:32:36 And so those three pieces are, to me,

00:32:40 some of the most common building blocks.

00:32:42 And I understand the different choices in RL

00:32:47 as choices of whether or not to use those building blocks

00:32:49 when you’re trying to decompose the solution.

00:32:52 Should I have a value function represented?

00:32:54 Should I have a policy represented?

00:32:56 Should I have a model represented?

00:32:58 And there are combinations of those pieces

00:33:00 and, of course, other things that you could

00:33:01 add into the picture as well.

00:33:03 But those three fundamental choices

00:33:04 give rise to some of the branches of RL

00:33:06 with which we’re very familiar.

00:33:08 And so those, as you mentioned,

00:33:10 there is a choice of what’s specified

00:33:14 or modeled explicitly.

00:33:17 And the idea is that all of these

00:33:20 are somehow implicitly learned within the system.

00:33:23 So it’s almost a choice of how you approach a problem.

00:33:28 Do you see those as fundamental differences

00:33:30 or are these almost like small specifics,

00:33:35 like the details of how you solve a problem

00:33:37 but they’re not fundamentally different from each other?

00:33:40 I think the fundamental idea is maybe at the higher level.

00:33:45 The fundamental idea is the first step

00:33:48 of the decomposition is really to say,

00:33:50 well, how are we really gonna solve any kind of problem

00:33:55 where you’re trying to figure out how to take actions

00:33:57 and just from this stream of observations,

00:33:59 you’ve got some agent situated in its sensory motor stream

00:34:02 and getting all these observations in,

00:34:04 getting to take these actions, and what should it do?

00:34:06 How can you even broach that problem?

00:34:07 You know, maybe the complexity of the world is so great

00:34:10 that you can’t even imagine how to build a system

00:34:13 that would understand how to deal with that.

00:34:15 And so the first step of this decomposition is to say,

00:34:18 well, you have to learn.

00:34:19 The system has to learn for itself.

00:34:22 And so note that the reinforcement learning problem

00:34:24 doesn’t actually stipulate that you have to learn.

00:34:27 Like you could maximize your rewards without learning.

00:34:29 It would just, wouldn’t do a very good job of it.

00:34:32 So learning is required

00:34:34 because it’s the only way to achieve good performance

00:34:36 in any sufficiently large and complex environment.

00:34:40 So that’s the first step.

00:34:42 And so that step gives commonality

00:34:43 to all of the other pieces,

00:34:45 because now you might ask, well, what should you be learning?

00:34:48 What does learning even mean?

00:34:49 You know, in this sense, you know, learning might mean,

00:34:52 well, you’re trying to update the parameters

00:34:55 of some system, which is then the thing

00:34:59 that actually picks the actions.

00:35:00 And those parameters could be representing anything.

00:35:03 They could be parameterizing a value function or a model

00:35:06 or a policy.

00:35:08 And so in that sense, there’s a lot of commonality

00:35:10 in that whatever is being represented there

00:35:12 is the thing which is being learned,

00:35:13 and it’s being learned with the ultimate goal

00:35:15 of maximizing rewards.

00:35:17 But the way in which you decompose the problem

00:35:20 is really what gives the semantics to the whole system.

00:35:23 Like, are you trying to learn something to predict well,

00:35:27 like a value function or a model?

00:35:28 Are you learning something to perform well, like a policy?

00:35:31 And the form of that objective

00:35:34 is kind of giving the semantics to the system.

00:35:36 And so it really is, at the next level down,

00:35:39 a fundamental choice,

00:35:40 and we have to make those fundamental choices

00:35:42 as system designers or enable our algorithms

00:35:46 to be able to learn how to make those choices for themselves.

00:35:49 So then the next step you mentioned,

00:35:52 the very first thing you have to deal with is,

00:35:56 can you even take in this huge stream of observations

00:36:00 and do anything with it?

00:36:01 So the natural next basic question is,

00:36:05 what is deep reinforcement learning?

00:36:08 And what is this idea of using neural networks

00:36:11 to deal with this huge incoming stream?

00:36:14 So amongst all the approaches for reinforcement learning,

00:36:18 deep reinforcement learning

00:36:19 is one family of solution methods

00:36:23 that tries to utilize powerful representations

00:36:29 that are offered by neural networks

00:36:31 to represent any of these different components

00:36:35 of the solution, of the agent,

00:36:37 like whether it’s the value function

00:36:39 or the model or the policy.

00:36:41 The idea of deep learning is to say,

00:36:43 well, here’s a powerful toolkit that’s so powerful

00:36:46 that it’s universal in the sense

00:36:48 that it can represent any function

00:36:50 and it can learn any function.

00:36:52 And so if we can leverage that universality,

00:36:55 that means that whatever we need to represent

00:36:57 for our policy or for our value function or for a model,

00:37:00 deep learning can do it.

00:37:01 So that deep learning is one approach

00:37:04 that offers us a toolkit

00:37:06 that has no ceiling to its performance,

00:37:09 that as we start to put more resources into the system,

00:37:12 more memory and more computation and more data,

00:37:17 more experience, more interactions with the environment,

00:37:20 that these are systems that can just get better

00:37:22 and better and better at doing whatever the job is

00:37:24 they’ve asked them to do,

00:37:25 whatever we’ve asked that function to represent,

00:37:27 it can learn a function that does a better and better job

00:37:31 of representing that knowledge,

00:37:33 whether that knowledge be estimating

00:37:35 how well you’re gonna do in the world,

00:37:36 the value function,

00:37:37 whether it’s gonna be choosing what to do in the world,

00:37:40 the policy,

00:37:41 or whether it’s understanding the world itself,

00:37:43 what’s gonna happen next, the model.

00:37:45 Nevertheless, the fact that neural networks

00:37:49 are able to learn incredibly complex representations

00:37:53 that allow you to do the policy, the model

00:37:55 or the value function is, at least to my mind,

00:38:00 exceptionally beautiful and surprising.

00:38:02 Like, was it surprising to you?

00:38:07 Can you still believe it works as well as it does?

00:38:10 Do you have good intuition about why it works at all

00:38:13 and works as well as it does?

00:38:18 I think, let me take two parts to that question.

00:38:22 I think it’s not surprising to me

00:38:26 that the idea of reinforcement learning works

00:38:30 because in some sense, I think it’s the,

00:38:34 I feel it’s the only thing which can ultimately.

00:38:36 And so I feel we have to address it

00:38:39 and there must be success as possible

00:38:41 because we have examples of intelligence.

00:38:44 And it must at some level be able to,

00:38:47 possible to acquire experience

00:38:49 and use that experience to do better

00:38:51 in a way which is meaningful to environments

00:38:55 of the complexity that humans can deal with.

00:38:57 It must be.

00:38:58 Am I surprised that our current systems

00:39:00 can do as well as they can do?

00:39:03 I think one of the big surprises for me

00:39:05 and a lot of the community

00:39:09 is really the fact that deep learning

00:39:13 can continue to perform so well

00:39:18 despite the fact that these neural networks

00:39:21 that they’re representing

00:39:23 have these incredibly nonlinear kind of bumpy surfaces

00:39:27 which to our kind of low dimensional intuitions

00:39:30 make it feel like surely you’re just gonna get stuck

00:39:33 and learning will get stuck

00:39:34 because you won’t be able to make any further progress.

00:39:37 And yet the big surprise is that learning continues

00:39:42 and these what appear to be local optima

00:39:45 turn out not to be because in high dimensions

00:39:48 when we make really big neural nets,

00:39:49 there’s always a way out

00:39:51 and there’s a way to go even lower

00:39:52 and then you’re still not in a local optima

00:39:55 because there’s some other pathway

00:39:57 that will take you out and take you lower still.

00:39:59 And so no matter where you are,

00:40:00 learning can proceed and do better and better and better

00:40:04 without bound.

00:40:06 And so that is a surprising

00:40:09 and beautiful property of neural nets

00:40:13 which I find elegant and beautiful

00:40:16 and somewhat shocking that it turns out to be the case.

00:40:20 As you said, which I really like

00:40:22 to our low dimensional intuitions, that’s surprising.

00:40:27 Yeah, we’re very tuned to working

00:40:31 within a three dimensional environment.

00:40:33 And so to start to visualize

00:40:36 what a billion dimensional neural network surface

00:40:41 that you’re trying to optimize over,

00:40:42 what that even looks like is very hard for us.

00:40:45 And so I think that really,

00:40:47 if you try to account for the,

00:40:52 essentially the AI winter

00:40:54 where people gave up on neural networks,

00:40:56 I think it’s really down to that lack of ability

00:41:00 to generalize from low dimensions to high dimensions

00:41:03 because back then we were in the low dimensional case.

00:41:05 People could only build neural nets

00:41:07 with 50 nodes in them or something.

00:41:11 And to imagine that it might be possible

00:41:14 to build a billion dimensional neural net

00:41:15 and it might have a completely different,

00:41:17 qualitatively different property was very hard to anticipate.

00:41:21 And I think even now we’re starting to build the theory

00:41:24 to support that.

00:41:26 And it’s incomplete at the moment,

00:41:28 but all of the theory seems to be pointing in the direction

00:41:30 that indeed this is an approach which truly is universal

00:41:34 both in its representational capacity, which was known,

00:41:37 but also in its learning ability, which is surprising.

00:41:40 And it makes one wonder what else we’re missing

00:41:44 due to our low dimensional intuitions

00:41:47 that will seem obvious once it’s discovered.

00:41:51 I often wonder, when we one day do have AIs

00:41:57 which are superhuman in their abilities

00:42:00 to understand the world,

00:42:05 what will they think of the algorithms

00:42:07 that we developed back now?

00:42:08 Will it be looking back at these days

00:42:11 and thinking that, will we look back and feel

00:42:17 that these algorithms were naive first steps

00:42:19 or will they still be the fundamental ideas

00:42:21 which are used even in 100,000, 10,000 years?

00:42:26 It’s hard to know.

00:42:27 They’ll watch back to this conversation

00:42:30 and with a smile, maybe a little bit of a laugh.

00:42:34 I mean, my sense is, I think just like when we used

00:42:40 to think that the sun revolved around the earth,

00:42:45 they’ll see our systems of today, reinforcement learning

00:42:49 as too complicated, that the answer was simple all along.

00:42:54 There’s something, just like you said in the game of Go,

00:42:58 I mean, I love the systems of like cellular automata,

00:43:01 that there’s simple rules from which incredible complexity

00:43:05 emerges, so it feels like there might be

00:43:08 some really simple approaches,

00:43:10 just like Rich Sutton says, right?

00:43:12 These simple methods with compute over time

00:43:17 seem to prove to be the most effective.

00:43:20 I 100% agree.

00:43:21 I think that if we try to anticipate

00:43:27 what will generalize well into the future,

00:43:30 I think it’s likely to be the case

00:43:32 that it’s the simple, clear ideas

00:43:35 which will have the longest legs

00:43:36 and which will carry us furthest into the future.

00:43:39 Nevertheless, we’re in a situation

00:43:40 where we need to make things work today,

00:43:43 and sometimes that requires putting together

00:43:44 more complex systems where we don’t have

00:43:47 the full answers yet as to what

00:43:49 those minimal ingredients might be.

00:43:51 So speaking of which, if we could take a step back to Go,

00:43:55 what was MoGo and what was the key idea behind the system?

00:44:00 So back during my PhD on Computer Go,

00:44:04 around about that time, there was a major new development

00:44:08 which actually happened in the context of Computer Go,

00:44:12 and it was really a revolution in the way

00:44:16 that heuristic search was done,

00:44:18 and the idea was essentially that

00:44:21 a position could be evaluated or a state in general

00:44:26 could be evaluated not by humans saying

00:44:30 whether that position is good or not,

00:44:33 or even humans providing rules

00:44:35 as to how you might evaluate it,

00:44:37 but instead by allowing the system

00:44:40 to randomly play out the game until the end multiple times

00:44:45 and taking the average of those outcomes

00:44:48 as the prediction of what will happen.

00:44:50 So for example, if you’re in the game of Go,

00:44:53 the intuition is that you take a position

00:44:55 and you get the system to kind of play random moves

00:44:58 against itself all the way to the end of the game

00:45:00 and you see who wins.

00:45:01 And if black ends up winning

00:45:03 more of those random games than white,

00:45:05 well, you say, hey, this is a position that favors white.

00:45:07 And if white ends up winning more of those random games

00:45:09 than black, then it favors white.

00:45:13 So that idea was known as Monte Carlo search,

00:45:18 and a particular form of Monte Carlo search

00:45:21 that became very effective and was developed in computer Go

00:45:24 first by Remy Coulomb in 2006,

00:45:26 and then taken further by others

00:45:29 was something called Monte Carlo tree search,

00:45:31 which basically takes that same idea

00:45:34 and uses that insight to evaluate every node of a search tree

00:45:39 is evaluated by the average of the random play outs

00:45:42 from that node onwards.

00:45:44 And this idea, when you think about it,

00:45:46 and this idea was very powerful

00:45:49 and suddenly led to huge leaps forward

00:45:51 in the strength of computer Go playing programs.

00:45:55 And among those, the strongest of the Go playing programs

00:45:58 in those days was a program called MoGo,

00:46:00 which was the first program to actually reach

00:46:03 human master level on small boards, nine by nine boards.

00:46:07 And so this was a program by someone called Sylvain Gelli,

00:46:11 who’s a good colleague of mine,

00:46:13 but I worked with him a little bit in those days,

00:46:16 part of my PhD thesis.

00:46:18 And MoGo was a first step towards the latest successes

00:46:23 we saw in computer Go,

00:46:25 but it was still missing a key ingredient.

00:46:28 MoGo was evaluating purely by random rollouts against itself.

00:46:33 And in a way, it’s truly remarkable

00:46:36 that random play should give you anything at all.

00:46:39 Why in this perfectly deterministic game

00:46:42 that’s very precise and involves these very exact sequences,

00:46:46 why is it that randomization is helpful?

00:46:52 And so the intuition is that randomization

00:46:54 captures something about the nature of the search tree,

00:46:59 from a position that you’re understanding

00:47:01 the nature of the search tree from that node onwards

00:47:04 by using randomization.

00:47:06 And this was a very powerful idea.

00:47:09 And I’ve seen this in other spaces,

00:47:12 talked to Richard Karp and so on,

00:47:14 randomized algorithms somehow magically

00:47:17 are able to do exceptionally well

00:47:19 and simplifying the problem somehow.

00:47:23 Makes you wonder about the fundamental nature

00:47:25 of randomness in our universe.

00:47:27 It seems to be a useful thing.

00:47:29 But so from that moment,

00:47:32 can you maybe tell the origin story

00:47:33 and the journey of AlphaGo?

00:47:36 Yeah, so programs based on Monte Carlo tree search

00:47:39 were a first revolution

00:47:41 in the sense that they led to suddenly programs

00:47:44 that could play the game to any reasonable level,

00:47:47 but they plateaued.

00:47:50 It seemed that no matter how much effort

00:47:51 people put into these techniques,

00:47:53 they couldn’t exceed the level

00:47:54 of amateur Dan level Go players.

00:47:58 So strong players,

00:47:59 but not anywhere near the level of professionals,

00:48:02 nevermind the world champion.

00:48:04 And so that brings us to the birth of AlphaGo,

00:48:08 which happened in the context of a startup company

00:48:12 known as DeepMind.

00:48:14 I heard of them.

00:48:15 Where a project was born.

00:48:19 And the project was really a scientific investigation

00:48:23 where myself and Adger Huang

00:48:27 and an intern, Chris Madison,

00:48:30 were exploring a scientific question.

00:48:33 And that scientific question was really,

00:48:37 is there another fundamentally different approach

00:48:39 to this key question of Go,

00:48:42 the key challenge of how can you build that intuition

00:48:45 and how can you just have a system

00:48:47 that could look at a position

00:48:48 and understand what move to play

00:48:51 or how well you’re doing in that position,

00:48:53 who’s gonna win?

00:48:54 And so the deep learning revolution had just begun.

00:48:59 That systems like ImageNet had suddenly been won

00:49:03 by deep learning techniques back in 2012.

00:49:06 And following that, it was natural to ask,

00:49:08 well, if deep learning is able to scale up so effectively

00:49:12 with images to understand them enough to classify them,

00:49:16 well, why not go?

00:49:17 Why not take the black and white stones of the Go board

00:49:22 and build a system which can understand for itself

00:49:25 what that means in terms of what move to pick

00:49:27 or who’s gonna win the game, black or white?

00:49:31 And so that was our scientific question

00:49:32 which we were probing and trying to understand.

00:49:35 And as we started to look at it,

00:49:37 we discovered that we could build a system.

00:49:40 So in fact, our very first paper on AlphaGo

00:49:43 was actually a pure deep learning system

00:49:47 which was trying to answer this question.

00:49:49 And we showed that actually a pure deep learning system

00:49:52 with no search at all was actually able

00:49:54 to reach human band level, master level

00:49:58 at the full game of Go, 19 by 19 boards.

00:50:01 And so without any search at all,

00:50:04 suddenly we had systems which were playing

00:50:06 at the level of the best Monte Carlo tree search systems,

00:50:10 the ones with randomized rollouts.

00:50:11 So first of all, sorry to interrupt,

00:50:13 but that’s kind of a groundbreaking notion.

00:50:16 That’s like basically a definitive step away

00:50:20 from a couple of decades

00:50:22 of essentially search dominating AI.

00:50:26 So how did that make you feel?

00:50:28 Was it surprising from a scientific perspective in general,

00:50:33 how to make you feel?

00:50:33 I found this to be profoundly surprising.

00:50:37 In fact, it was so surprising that we had a bet back then.

00:50:41 And like many good projects, bets are quite motivating.

00:50:44 And the bet was whether it was possible

00:50:47 for a system based purely on deep learning,

00:50:52 with no search at all to beat a down level human player.

00:50:55 And so we had someone who joined our team

00:51:00 who was a down level player.

00:51:01 He came in and we had this first match against him and…

00:51:06 Which side of the bed were you on, by the way?

00:51:09 The losing or the winning side?

00:51:11 I tend to be an optimist with the power

00:51:14 of deep learning and reinforcement learning.

00:51:18 So the system won,

00:51:21 and we were able to beat this human down level player.

00:51:24 And for me, that was the moment where it was like,

00:51:26 okay, something special is afoot here.

00:51:29 We have a system which without search

00:51:32 is able to already just look at this position

00:51:36 and understand things as well as a strong human player.

00:51:39 And from that point onwards,

00:51:41 I really felt that reaching the top levels of human play,

00:51:49 professional level, world champion level,

00:51:50 I felt it was actually an inevitability.

00:51:56 And if it was an inevitable outcome,

00:51:59 I was rather keen that it would be us that achieved it.

00:52:03 So we scaled up.

00:52:05 This was something where,

00:52:06 so I had lots of conversations back then

00:52:09 with Demis Sassabis, the head of DeepMind,

00:52:14 who was extremely excited.

00:52:16 And we made the decision to scale up the project,

00:52:21 brought more people on board.

00:52:23 And so AlphaGo became something where we had a clear goal,

00:52:30 which was to try and crack this outstanding challenge of AI

00:52:33 to see if we could beat the world’s best players.

00:52:37 And this led within the space of not so many months

00:52:42 to playing against the European champion Fan Hui

00:52:45 in a match which became memorable in history

00:52:48 as the first time a Go program

00:52:50 had ever beaten a professional player.

00:52:53 And at that time we had to make a judgment

00:52:56 as to when and whether we should go

00:52:59 and challenge the world champion.

00:53:01 And this was a difficult decision to make.

00:53:04 Again, we were basing our predictions on our own progress

00:53:08 and had to estimate based on the rapidity

00:53:11 of our own progress when we thought we would exceed

00:53:15 the level of the human world champion.

00:53:17 And we tried to make an estimate and set up a match

00:53:20 and that became the AlphaGo versus Lee Sedol match in 2016.

00:53:27 And we should say, spoiler alert,

00:53:29 that AlphaGo was able to defeat Lee Sedol.

00:53:33 That’s right, yeah.

00:53:34 So maybe we could take even a broader view.

00:53:39 AlphaGo involves both learning from expert games

00:53:45 and as far as I remember, a self play component

00:53:51 to where it learns by playing against itself.

00:53:54 But in your sense, what was the role of learning

00:53:57 from expert games there?

00:53:59 And in terms of your self evaluation,

00:54:01 whether you can take on the world champion,

00:54:04 what was the thing that you’re trying to do more of?

00:54:06 Sort of train more on expert games

00:54:09 or was there’s now another,

00:54:12 I’m asking so many poorly phrased questions,

00:54:15 but did you have a hope or dream that self play

00:54:19 would be the key component at that moment yet?

00:54:24 So in the early days of AlphaGo,

00:54:26 we used human data to explore the science

00:54:29 of what deep learning can achieve.

00:54:31 And so when we had our first paper that showed

00:54:34 that it was possible to predict the winner of the game,

00:54:37 that it was possible to suggest moves,

00:54:39 that was done using human data.

00:54:41 A solely human data.

00:54:42 Yeah, and so the reason that we did it that way

00:54:45 was at that time we were exploring separately

00:54:47 the deep learning aspect

00:54:48 from the reinforcement learning aspect.

00:54:51 That was the part which was new and unknown

00:54:53 to me at that time was how far could that be stretched?

00:54:58 Once we had that, it then became natural

00:55:00 to try and use that same representation

00:55:03 and see if we could learn for ourselves

00:55:04 using that same representation.

00:55:06 And so right from the beginning,

00:55:08 actually our goal had been to build a system

00:55:11 using self play.

00:55:14 And to us, the human data right from the beginning

00:55:16 was an expedient step to help us for pragmatic reasons

00:55:20 to go faster towards the goals of the project

00:55:24 than we might be able to starting solely from self play.

00:55:27 And so in those days, we were very aware

00:55:29 that we were choosing to use human data

00:55:32 and that might not be the longterm holy grail of AI,

00:55:37 but that it was something which was extremely useful to us.

00:55:40 It helped us to understand the system.

00:55:42 It helped us to build deep learning representations

00:55:44 which were clear and simple and easy to use.

00:55:48 And so really I would say it served a purpose

00:55:51 not just as part of the algorithm,

00:55:53 but something which I continue to use in our research today,

00:55:56 which is trying to break down a very hard challenge

00:56:00 into pieces which are easier to understand for us

00:56:02 as researchers and develop.

00:56:04 So if you use a component based on human data,

00:56:07 it can help you to understand the system

00:56:10 such that then you can build

00:56:11 the more principled version later that does it for itself.

00:56:15 So as I said, the AlphaGo victory,

00:56:19 and I don’t think I’m being sort of romanticizing this notion.

00:56:23 I think it’s one of the greatest moments

00:56:25 in the history of AI.

00:56:26 So were you cognizant of this magnitude

00:56:29 of the accomplishment at the time?

00:56:32 I mean, are you cognizant of it even now?

00:56:35 Because to me, I feel like it’s something that would,

00:56:38 we mentioned what the AGI systems of the future

00:56:41 will look back.

00:56:42 I think they’ll look back at the AlphaGo victory

00:56:46 as like, holy crap, they figured it out.

00:56:49 This is where it started.

00:56:51 Well, thank you again.

00:56:52 I mean, it’s funny because I guess I’ve been working on,

00:56:56 I’ve been working on ComputerGo for a long time.

00:56:58 So I’d been working at the time of the AlphaGo match

00:57:00 on ComputerGo for more than a decade.

00:57:03 And throughout that decade, I’d had this dream

00:57:06 of what would it be like to, what would it be like really

00:57:08 to actually be able to build a system

00:57:12 that could play against the world champion.

00:57:14 And I imagined that that would be an interesting moment

00:57:17 that maybe some people might care about that

00:57:20 and that this might be a nice achievement.

00:57:24 But I think when I arrived in Seoul

00:57:27 and discovered the legions of journalists

00:57:31 that were following us around and the 100 million people

00:57:34 that were watching the match online live,

00:57:37 I realized that I’d been off in my estimation

00:57:40 of how significant this moment was

00:57:41 by several orders of magnitude.

00:57:43 And so there was definitely an adjustment process

00:57:48 to realize that this was something

00:57:53 which the world really cared about

00:57:55 and which was a watershed moment.

00:57:57 And I think there was that moment of realization.

00:58:01 But it’s also a little bit scary

00:58:02 because if you go into something thinking

00:58:05 it’s gonna be maybe of interest

00:58:08 and then discover that 100 million people are watching,

00:58:10 it suddenly makes you worry about

00:58:12 whether some of the decisions you’d made

00:58:13 were really the best ones or the wisest,

00:58:16 or were going to lead to the best outcome.

00:58:18 And we knew for sure that there were still imperfections

00:58:20 in AlphaGo, which were gonna be exposed

00:58:22 to the whole world watching.

00:58:24 And so, yeah, it was I think a great experience

00:58:28 and I feel privileged to have been part of it,

00:58:32 privileged to have led that amazing team.

00:58:35 I feel privileged to have been in a moment of history

00:58:38 like you say, but also lucky that in a sense

00:58:43 I was insulated from the knowledge of,

00:58:46 I think it would have been harder to focus on the research

00:58:48 if the full kind of reality of what was gonna come to pass

00:58:52 had been known to me and the team.

00:58:55 I think it was, we were in our bubble

00:58:57 and we were working on research

00:58:58 and we were trying to answer the scientific questions

00:59:01 and then bam, the public sees it.

00:59:04 And I think it was better that way in retrospect.

00:59:07 Were you confident that, I guess,

00:59:10 what were the chances that you could get the win?

00:59:13 So just like you said, I’m a little bit more familiar

00:59:19 with another accomplishment

00:59:20 that we may not even get a chance to talk to.

00:59:22 I talked to Oriel Venialis about Alpha Star

00:59:24 which is another incredible accomplishment,

00:59:26 but here with Alpha Star and beating the StarCraft,

00:59:31 there was already a track record with AlphaGo.

00:59:34 This is the really first time

00:59:36 you get to see reinforcement learning

00:59:39 face the best human in the world.

00:59:41 So what was your confidence like, what was the odds?

00:59:45 Well, we actually. Was there a bet?

00:59:47 Funnily enough, there was.

00:59:49 So just before the match,

00:59:52 we weren’t betting on anything concrete,

00:59:54 but we all held out a hand.

00:59:56 Everyone in the team held out a hand

00:59:57 at the beginning of the match.

00:59:59 And the number of fingers that they had out on their hand

01:00:01 was supposed to represent how many games

01:00:03 they thought we would win against Lee Sedol.

01:00:06 And there was an amazing spread in the team’s predictions.

01:00:10 But I have to say, I predicted four, one.

01:00:15 And the reason was based purely on data.

01:00:18 So I’m a scientist first and foremost.

01:00:20 And one of the things which we had established

01:00:23 was that AlphaGo in around one in five games

01:00:27 would develop something which we called a delusion,

01:00:29 which was a kind of in a hole in its knowledge

01:00:31 where it wasn’t able to fully understand

01:00:34 everything about the position.

01:00:36 And that hole in its knowledge would persist

01:00:38 for tens of moves throughout the game.

01:00:41 And we knew two things.

01:00:42 We knew that if there were no delusions,

01:00:44 that AlphaGo seemed to be playing at a level

01:00:46 that was far beyond any human capabilities.

01:00:49 But we also knew that if there were delusions,

01:00:52 the opposite was true.

01:00:53 And in fact, that’s what came to pass.

01:00:58 We saw all of those outcomes.

01:01:00 And Lee Sedol in one of the games

01:01:02 played a really beautiful sequence

01:01:04 that AlphaGo just hadn’t predicted.

01:01:08 And after that, it led it into this situation

01:01:11 where it was unable to really understand the position fully

01:01:14 and found itself in one of these delusions.

01:01:17 So indeed, yeah, 4.1 was the outcome.

01:01:20 So yeah, and can you maybe speak to it a little bit more?

01:01:23 What were the five games?

01:01:25 What happened?

01:01:26 Is there interesting things that come to memory

01:01:29 in terms of the play of the human or the machine?

01:01:33 So I remember all of these games vividly, of course.

01:01:37 Moments like these don’t come too often

01:01:39 in the lifetime of a scientist.

01:01:42 And the first game was magical because it was the first time

01:01:49 that a computer program had defeated a world

01:01:53 champion in this grand challenge of Go.

01:01:57 And there was a moment where AlphaGo invaded Lee Sedol’s

01:02:04 territory towards the end of the game.

01:02:07 And that’s quite an audacious thing to do.

01:02:09 It’s like saying, hey, you thought

01:02:11 this was going to be your territory in the game,

01:02:12 but I’m going to stick a stone right in the middle of it

01:02:14 and prove to you that I can break it up.

01:02:17 And Lee Sedol’s face just dropped.

01:02:20 He wasn’t expecting a computer to do something that audacious.

01:02:26 The second game became famous for a move known as move 37.

01:02:30 This was a move that was played by AlphaGo that broke

01:02:36 all of the conventions of Go, that the Go players were

01:02:39 so shocked by this.

01:02:40 They thought that maybe the operator had made a mistake.

01:02:45 They thought that there was something crazy going on.

01:02:48 And it just broke every rule that Go players

01:02:50 are taught from a very young age.

01:02:52 They’re just taught this kind of move called a shoulder hit.

01:02:55 You can only play it on the third line or the fourth line,

01:02:58 and AlphaGo played it on the fifth line.

01:03:00 And it turned out to be a brilliant move

01:03:03 and made this beautiful pattern in the middle of the board that

01:03:06 ended up winning the game.

01:03:08 And so this really was a clear instance

01:03:12 where we could say computers exhibited creativity,

01:03:16 that this was really a move that was something

01:03:18 humans hadn’t known about, hadn’t anticipated.

01:03:22 And computers discovered this idea.

01:03:24 They were the ones to say, actually, here’s

01:03:27 a new idea, something new, not in the domains

01:03:30 of human knowledge of the game.

01:03:33 And now the humans think this is a reasonable thing to do.

01:03:38 And it’s part of Go knowledge now.

01:03:41 The third game, something special

01:03:44 happens when you play against a human world champion, which,

01:03:46 again, I hadn’t anticipated before going there,

01:03:48 which is these players are amazing.

01:03:53 Lee Sedol was a true champion, 18 time world champion,

01:03:56 and had this amazing ability to probe AlphaGo

01:04:01 for weaknesses of any kind.

01:04:03 And in the third game, he was losing,

01:04:06 and we felt we were sailing comfortably to victory.

01:04:09 But he managed to, from nothing, stir up this fight

01:04:14 and build what’s called a double co,

01:04:17 these kind of repetitive positions.

01:04:20 And he knew that historically, no computer Go program had ever

01:04:24 been able to deal correctly with double co positions.

01:04:26 And he managed to summon one out of nothing.

01:04:29 And so for us, this was a real challenge.

01:04:33 Would AlphaGo be able to deal with this,

01:04:35 or would it just kind of crumble in the face of this situation?

01:04:38 And fortunately, it dealt with it perfectly.

01:04:41 The fourth game was amazing in that Lee Sedol

01:04:46 appeared to be losing this game.

01:04:48 AlphaGo thought it was winning.

01:04:49 And then Lee Sedol did something,

01:04:52 which I think only a true world champion can do,

01:04:55 which is he found a brilliant sequence

01:04:57 in the middle of the game, a brilliant sequence

01:04:59 that led him to really just transform the position.

01:05:05 He kind of found just a piece of genius, really.

01:05:10 And after that, AlphaGo, its evaluation just tumbled.

01:05:15 It thought it was winning this game.

01:05:17 And all of a sudden, it tumbled and said, oh, now

01:05:20 I’ve got no chance.

01:05:21 And it started to behave rather oddly at that point.

01:05:24 In the final game, for some reason, we as a team

01:05:27 were convinced, having seen AlphaGo in the previous game,

01:05:30 suffer from delusions.

01:05:31 We as a team were convinced that it

01:05:34 was suffering from another delusion.

01:05:35 We were convinced that it was misevaluating the position

01:05:38 and that something was going terribly wrong.

01:05:41 And it was only in the last few moves of the game

01:05:43 that we realized that actually, although it

01:05:46 had been predicting it was going to win all the way through,

01:05:49 it really was.

01:05:51 And so somehow, it just taught us yet again

01:05:54 that you have to have faith in your systems.

01:05:56 When they exceed your own level of ability

01:05:58 and your own judgment, you have to trust in them

01:06:01 to know better than you, the designer, once you’ve

01:06:06 bestowed in them the ability to judge better than you can,

01:06:10 then trust the system to do so.

01:06:13 So just like in the case of Deep Blue beating Gary Kasparov,

01:06:18 so Gary was, I think, the first time he’s ever lost, actually,

01:06:23 to anybody.

01:06:24 And I mean, there’s a similar situation with Lee Sedol.

01:06:27 It’s a tragic loss for humans, but a beautiful one,

01:06:36 I think, that’s kind of, from the tragedy,

01:06:40 sort of emerges over time, emerges

01:06:45 a kind of inspiring story.

01:06:47 But Lee Sedol recently announced his retirement.

01:06:52 I don’t know if we can look too deeply into it,

01:06:56 but he did say that even if I become number one,

01:06:59 there’s an entity that cannot be defeated.

01:07:02 So what do you think about these words?

01:07:05 What do you think about his retirement from the game ago?

01:07:08 Well, let me take you back, first of all,

01:07:09 to the first part of your comment about Gary Kasparov,

01:07:12 because actually, at the panel yesterday,

01:07:15 he specifically said that when he first lost to Deep Blue,

01:07:19 he viewed it as a failure.

01:07:22 He viewed that this had been a failure of his.

01:07:24 But later on in his career, he said

01:07:27 he’d come to realize that actually, it was a success.

01:07:30 It was a success for everyone, because this marked

01:07:33 transformational moment for AI.

01:07:37 And so even for Gary Kasparov, he

01:07:39 came to realize that that moment was pivotal

01:07:42 and actually meant something much more

01:07:45 than his personal loss in that moment.

01:07:49 Lee Sedol, I think, was much more cognizant of that,

01:07:53 even at the time.

01:07:54 And so in his closing remarks to the match,

01:07:57 he really felt very strongly that what

01:08:01 had happened in the AlphaGo match

01:08:02 was not only meaningful for AI, but for humans as well.

01:08:06 And he felt as a Go player that it had opened his horizons

01:08:09 and meant that he could start exploring new things.

01:08:12 It brought his joy back for the game of Go,

01:08:14 because it had broken all of the conventions and barriers

01:08:18 and meant that suddenly, anything was possible again.

01:08:23 So I was sad to hear that he’d retired,

01:08:26 but he’s been a great world champion over many, many years.

01:08:31 And I think he’ll be remembered for that ever more.

01:08:36 He’ll be remembered as the last person to beat AlphaGo.

01:08:39 I mean, after that, we increased the power of the system.

01:08:43 And the next version of AlphaGo beats the other strong human

01:08:49 player 60 games to nil.

01:08:52 So what a great moment for him and something

01:08:55 to be remembered for.

01:08:58 It’s interesting that you spent time at AAAI on a panel

01:09:02 with Garry Kasparov.

01:09:05 What, I mean, it’s almost, I’m just

01:09:07 curious to learn the conversations you’ve

01:09:12 had with Garry, because he’s also now,

01:09:15 he’s written a book about artificial intelligence.

01:09:17 He’s thinking about AI.

01:09:18 He has kind of a view of it.

01:09:21 And he talks about AlphaGo a lot.

01:09:23 What’s your sense?

01:09:26 Arguably, I’m not just being Russian,

01:09:28 but I think Garry is the greatest chess player

01:09:31 of all time, probably one of the greatest game

01:09:34 players of all time.

01:09:36 And you sort of at the center of creating

01:09:41 a system that beats one of the greatest players of all time.

01:09:45 So what is that conversation like?

01:09:46 Is there anything, any interesting digs, any bets,

01:09:50 any funny things, any profound things?

01:09:53 So Garry Kasparov has an incredible respect

01:09:58 for what we did with AlphaGo.

01:10:01 And it’s an amazing tribute coming from him of all people

01:10:07 that he really appreciates and respects what we’ve done.

01:10:11 And I think he feels that the progress which has happened

01:10:14 in computer chess, which later after AlphaGo,

01:10:19 we built the AlphaZero system, which

01:10:23 defeated the world’s strongest chess programs.

01:10:26 And to Garry Kasparov, that moment in computer chess

01:10:29 was more profound than Deep Blue.

01:10:32 And the reason he believes it mattered more

01:10:35 was because it was done with learning

01:10:37 and a system which was able to discover for itself

01:10:39 new principles, new ideas, which were

01:10:42 able to play the game in a way which he hadn’t always

01:10:47 known about or anyone.

01:10:50 And in fact, one of the things I discovered at this panel

01:10:53 was that the current world champion, Magnus Carlsen,

01:10:56 apparently recently commented on his improvement

01:11:00 in performance.

01:11:01 And he attributed it to AlphaZero,

01:11:03 that he’s been studying the games of AlphaZero.

01:11:05 And he’s changed his style to play more like AlphaZero.

01:11:08 And it’s led to him actually increasing his rating

01:11:13 to a new peak.

01:11:15 Yeah, I guess to me, just like to Garry,

01:11:18 the inspiring thing is that, and just like you said,

01:11:21 with reinforcement learning, reinforcement learning

01:11:25 and deep learning, machine learning

01:11:26 feels like what intelligence is.

01:11:29 And you could attribute it to a bitter viewpoint

01:11:35 from Garry’s perspective, from us humans perspective,

01:11:39 saying that pure search that IBM Deep Blue was doing

01:11:43 is not really intelligence, but somehow it didn’t feel like it.

01:11:47 And so that’s the magical.

01:11:49 I’m not sure what it is about learning that

01:11:50 feels like intelligence, but it does.

01:11:54 So I think we should not demean the achievements of what

01:11:58 was done in previous eras of AI.

01:12:00 I think that Deep Blue was an amazing achievement in itself.

01:12:04 And that heuristic search of the kind that was used by Deep

01:12:07 Blue had some powerful ideas that were in there,

01:12:11 but it also missed some things.

01:12:13 So the fact that the evaluation function, the way

01:12:16 that the chess position was understood,

01:12:18 was created by humans and not by the machine

01:12:22 is a limitation, which means that there’s

01:12:26 a ceiling on how well it can do.

01:12:28 But maybe more importantly, it means

01:12:30 that the same idea cannot be applied in other domains

01:12:33 where we don’t have access to the human grandmasters

01:12:38 and that ability to encode exactly their knowledge

01:12:41 into an evaluation function.

01:12:43 And the reality is that the story of AI

01:12:45 is that most domains turn out to be of the second type

01:12:48 where knowledge is messy, it’s hard to extract from experts,

01:12:52 or it isn’t even available.

01:12:53 And so we need to solve problems in a different way.

01:12:59 And I think AlphaGo is a step towards solving things

01:13:02 in a way which puts learning as a first class citizen

01:13:07 and says systems need to understand for themselves

01:13:11 how to understand the world, how to judge the value of any action

01:13:19 that they might take within that world

01:13:20 and any state they might find themselves in.

01:13:22 And in order to do that, we make progress towards AI.

01:13:29 Yeah, so one of the nice things about taking a learning

01:13:32 approach to the game of Go or game playing

01:13:36 is that the things you learn, the things you figure out,

01:13:39 are actually going to be applicable to other problems

01:13:42 that are real world problems.

01:13:44 That’s ultimately, I mean, there’s

01:13:47 two really interesting things about AlphaGo.

01:13:49 One is the science of it, just the science of learning,

01:13:52 the science of intelligence.

01:13:54 And then the other is while you’re actually

01:13:56 learning to figuring out how to build systems that

01:13:59 would be potentially applicable in other applications,

01:14:04 medical, autonomous vehicles, robotics,

01:14:06 I mean, it’s just open the door to all kinds of applications.

01:14:10 So the next incredible step, really the profound step

01:14:16 is probably AlphaGo Zero.

01:14:18 I mean, it’s arguable.

01:14:20 I kind of see them all as the same place.

01:14:22 But really, and perhaps you were already

01:14:24 thinking that AlphaGo Zero is the natural.

01:14:26 It was always going to be the next step.

01:14:29 But it’s removing the reliance on human expert games

01:14:33 for pre training, as you mentioned.

01:14:35 So how big of an intellectual leap

01:14:38 was this that self play could achieve superhuman level

01:14:43 performance in its own?

01:14:45 And maybe could you also say, what is self play?

01:14:48 Kind of mention it a few times.

01:14:51 So let me start with self play.

01:14:55 So the idea of self play is something

01:14:58 which is really about systems learning for themselves,

01:15:01 but in the situation where there’s more than one agent.

01:15:05 And so if you’re in a game, and the game

01:15:08 is played between two players, then self play

01:15:11 is really about understanding that game just

01:15:15 by playing games against yourself

01:15:17 rather than against any actual real opponent.

01:15:19 And so it’s a way to kind of discover strategies

01:15:23 without having to actually need to go out and play

01:15:27 against any particular human player, for example.

01:15:36 The main idea of Alpha Zero was really

01:15:38 to try and step back from any of the knowledge

01:15:45 that we put into the system and ask the question,

01:15:47 is it possible to come up with a single elegant principle

01:15:52 by which a system can learn for itself all of the knowledge

01:15:57 which it requires to play a game such as Go?

01:16:00 Importantly, by taking knowledge out,

01:16:03 you not only make the system less brittle in the sense

01:16:08 that perhaps the knowledge you were putting in

01:16:10 was just getting in the way and maybe stopping the system

01:16:13 learning for itself, but also you make it more general.

01:16:17 The more knowledge you put in, the harder

01:16:20 it is for a system to actually be placed,

01:16:23 taken out of the system in which it’s kind of been designed,

01:16:26 and placed in some other system that maybe would need

01:16:29 a completely different knowledge base to understand

01:16:31 and perform well.

01:16:32 And so the real goal here is to strip out all of the knowledge

01:16:36 that we put in to the point that we can just plug it

01:16:39 into something totally different.

01:16:41 And that, to me, is really the promise of AI

01:16:45 is that we can have systems such as that which,

01:16:47 no matter what the goal is, no matter what goal

01:16:51 we set to the system, we can come up

01:16:53 with an algorithm which can be placed into that world,

01:16:57 into that environment, and can succeed

01:16:59 in achieving that goal.

01:17:01 And then that, to me, is almost the essence of intelligence

01:17:06 if we can achieve that.

01:17:07 And so AlphaZero is a step towards that.

01:17:11 And it’s a step that was taken in the context of two player

01:17:15 perfect information games like Go and chess.

01:17:18 We also applied it to Japanese chess.

01:17:21 So just to clarify, the first step

01:17:23 was AlphaGo Zero.

01:17:25 The first step was to try and take all of the knowledge out

01:17:29 of AlphaGo in such a way that it could

01:17:32 play in a fully self discovered way, purely from self play.

01:17:39 And to me, the motivation for that

01:17:41 was always that we could then plug it into other domains.

01:17:44 But we saved that until later.

01:17:48 Well, in fact, I mean, just for fun,

01:17:52 I could tell you exactly the moment

01:17:54 where the idea for AlphaZero occurred to me.

01:17:57 Because I think there’s maybe a lesson there for researchers

01:18:00 who are too deeply embedded in their research

01:18:03 and working 24 sevens to try and come up with the next idea,

01:18:08 which is it actually occurred to me on honeymoon.

01:18:13 And I was at my most fully relaxed state,

01:18:17 really enjoying myself, and just bing,

01:18:22 the algorithm for AlphaZero just appeared in its full form.

01:18:29 And this was actually before we played against Lisa Dahl.

01:18:33 But we just didn’t.

01:18:35 I think we were so busy trying to make sure

01:18:39 we could beat the world champion that it was only later

01:18:43 that we had the opportunity to step back and start

01:18:47 examining that sort of deeper scientific question of whether

01:18:51 this could really work.

01:18:52 So nevertheless, so self play is probably

01:18:56 one of the most profound ideas that represents, to me at least,

01:19:03 artificial intelligence.

01:19:05 But the fact that you could use that kind of mechanism

01:19:09 to, again, beat world class players,

01:19:13 that’s very surprising.

01:19:14 So to me, it feels like you have to train

01:19:19 in a large number of expert games.

01:19:21 So was it surprising to you?

01:19:22 What was the intuition?

01:19:23 Can you sort of think, not necessarily at that time,

01:19:26 even now, what’s your intuition?

01:19:27 Why this thing works so well?

01:19:30 Why it’s able to learn from scratch?

01:19:31 Well, let me first say why we tried it.

01:19:34 So we tried it both because I feel

01:19:36 that it was the deeper scientific question

01:19:38 to be asking to make progress towards AI,

01:19:42 and also because, in general, in my research,

01:19:44 I don’t like to do research on questions for which we already

01:19:49 know the likely outcome.

01:19:51 I don’t see much value in running an experiment where

01:19:53 you’re 95% confident that you will succeed.

01:19:57 And so we could have tried maybe to take AlphaGo and do

01:20:02 something which we knew for sure it would succeed on.

01:20:05 But much more interesting to me was to try it on the things

01:20:07 which we weren’t sure about.

01:20:09 And one of the big questions on our minds

01:20:12 back then was, could you really do this with self play alone?

01:20:16 How far could that go?

01:20:17 Would it be as strong?

01:20:19 And honestly, we weren’t sure.

01:20:22 It was 50, 50, I think.

01:20:25 If you’d asked me, I wasn’t confident

01:20:27 that it could reach the same level as these systems,

01:20:30 but it felt like the right question to ask.

01:20:33 And even if it had not achieved the same level,

01:20:36 I felt that that was an important direction

01:20:41 to be studying.

01:20:42 And so then, lo and behold, it actually

01:20:48 ended up outperforming the previous version of AlphaGo

01:20:52 and indeed was able to beat it by 100 games to zero.

01:20:55 So what’s the intuition as to why?

01:20:59 I think the intuition to me is clear,

01:21:02 that whenever you have errors in a system, as we did in AlphaGo,

01:21:09 AlphaGo suffered from these delusions.

01:21:11 Occasionally, it would misunderstand

01:21:13 what was going on in a position and miss evaluate it.

01:21:15 How can you remove all of these errors?

01:21:19 Errors arise from many sources.

01:21:21 For us, they were arising both starting from the human data,

01:21:25 but also from the nature of the search

01:21:27 and the nature of the algorithm itself.

01:21:29 But the only way to address them in any complex system

01:21:33 is to give the system the ability

01:21:36 to correct its own errors.

01:21:37 It must be able to correct them.

01:21:39 It must be able to learn for itself

01:21:41 when it’s doing something wrong and correct for it.

01:21:44 And so it seemed to me that the way to correct delusions

01:21:47 was indeed to have more iterations of reinforcement

01:21:51 learning, that no matter where you start,

01:21:53 you should be able to correct those errors

01:21:55 until it gets to play that out and understand,

01:21:58 oh, well, I thought that I was going to win in this situation,

01:22:01 but then I ended up losing.

01:22:03 That suggests that I was miss evaluating something.

01:22:05 There’s a hole in my knowledge, and now the system

01:22:07 can correct for itself and understand how to do better.

01:22:11 Now, if you take that same idea and trace it back

01:22:14 all the way to the beginning, it should

01:22:16 be able to take you from no knowledge,

01:22:19 from completely random starting point,

01:22:21 all the way to the highest levels of knowledge

01:22:24 that you can achieve in a domain.

01:22:27 And the principle is the same, that if you bestow a system

01:22:30 with the ability to correct its own errors,

01:22:33 then it can take you from random to something slightly

01:22:36 better than random because it sees the stupid things

01:22:39 that the random is doing, and it can correct them.

01:22:41 And then it can take you from that slightly better system

01:22:43 and understand, well, what’s that doing wrong?

01:22:45 And it takes you on to the next level and the next level.

01:22:49 And this progress can go on indefinitely.

01:22:52 And indeed, what would have happened

01:22:55 if we’d carried on training AlphaGo Zero for longer?

01:22:59 We saw no sign of it slowing down its improvements,

01:23:03 or at least it was certainly carrying on to improve.

01:23:06 And presumably, if you had the computational resources,

01:23:11 this could lead to better and better systems

01:23:14 that discover more and more.

01:23:15 So your intuition is fundamentally

01:23:18 there’s not a ceiling to this process.

01:23:21 One of the surprising things, just like you said,

01:23:24 is the process of patching errors.

01:23:27 It intuitively makes sense that this is,

01:23:31 that reinforcement learning should be part of that process.

01:23:33 But what is surprising is in the process

01:23:36 of patching your own lack of knowledge,

01:23:39 you don’t open up other patches.

01:23:41 You keep sort of, like there’s a monotonic decrease

01:23:46 of your weaknesses.

01:23:48 Well, let me back this up.

01:23:50 I think science always should make falsifiable hypotheses.

01:23:53 So let me back up this claim with a falsifiable hypothesis,

01:23:57 which is that if someone was to, in the future,

01:23:59 take Alpha Zero as an algorithm

01:24:02 and run it on with greater computational resources

01:24:07 that we had available today,

01:24:10 then I would predict that they would be able

01:24:12 to beat the previous system 100 games to zero.

01:24:15 And that if they were then to do the same thing

01:24:17 a couple of years later,

01:24:19 that that would beat that previous system 100 games to zero,

01:24:22 and that that process would continue indefinitely

01:24:25 throughout at least my human lifetime.

01:24:27 Presumably the game of Go would set the ceiling.

01:24:31 I mean.

01:24:31 The game of Go would set the ceiling,

01:24:33 but the game of Go has 10 to the 170 states in it.

01:24:35 So the ceiling is unreachable by any computational device

01:24:40 that can be built out of the 10 to the 80 atoms

01:24:44 in the universe.

01:24:46 You asked a really good question,

01:24:47 which is, do you not open up other errors

01:24:51 when you correct your previous ones?

01:24:53 And the answer is yes, you do.

01:24:56 And so it’s a remarkable fact

01:24:58 about this class of two player game

01:25:02 and also true of single agent games

01:25:05 that essentially progress will always lead you to,

01:25:11 if you have sufficient representational resource,

01:25:15 like imagine you had,

01:25:16 could represent every state in a big table of the game,

01:25:20 then we know for sure that a progress of self improvement

01:25:24 will lead all the way in the single agent case

01:25:27 to the optimal possible behavior,

01:25:29 and in the two player case to the minimax optimal behavior.

01:25:31 And that is the best way that I can play

01:25:35 knowing that you’re playing perfectly against me.

01:25:38 And so for those cases,

01:25:39 we know that even if you do open up some new error,

01:25:44 that in some sense you’ve made progress.

01:25:46 You’re progressing towards the best that can be done.

01:25:50 So AlphaGo was initially trained on expert games

01:25:55 with some self play.

01:25:56 AlphaGo Zero removed the need to be trained on expert games.

01:26:00 And then another incredible step for me,

01:26:03 because I just love chess,

01:26:05 is to generalize that further to be in AlphaZero

01:26:09 to be able to play the game of Go,

01:26:12 beating AlphaGo Zero and AlphaGo,

01:26:14 and then also being able to play the game of chess

01:26:18 and others.

01:26:19 So what was that step like?

01:26:20 What’s the interesting aspects there

01:26:23 that required to make that happen?

01:26:26 I think the remarkable observation,

01:26:29 which we saw with AlphaZero,

01:26:31 was that actually without modifying the algorithm at all,

01:26:35 it was able to play and crack

01:26:37 some of AI’s greatest previous challenges.

01:26:41 In particular, we dropped it into the game of chess.

01:26:44 And unlike the previous systems like Deep Blue,

01:26:47 which had been worked on for years and years,

01:26:50 and we were able to beat

01:26:52 the world’s strongest computer chess program convincingly

01:26:57 using a system that was fully discovered

01:27:00 from scratch with its own principles.

01:27:04 And in fact, one of the nice things that we found

01:27:08 was that in fact, we also achieved the same result

01:27:11 in Japanese chess, a variant of chess

01:27:13 where you get to capture pieces

01:27:15 and then place them back down on your own side

01:27:17 as an extra piece.

01:27:18 So a much more complicated variant of chess.

01:27:21 And we also beat the world’s strongest programs

01:27:24 and reached superhuman performance in that game too.

01:27:28 And it was the very first time that we’d ever run the system

01:27:32 on that particular game,

01:27:34 was the version that we published

01:27:35 in the paper on AlphaZero.

01:27:38 It just worked out of the box, literally, no touching it.

01:27:41 We didn’t have to do anything.

01:27:42 And there it was, superhuman performance,

01:27:45 no tweaking, no twiddling.

01:27:47 And so I think there’s something beautiful

01:27:49 about that principle that you can take an algorithm

01:27:52 and without twiddling anything, it just works.

01:27:57 Now, to go beyond AlphaZero, what’s required?

01:28:02 AlphaZero is just a step.

01:28:05 And there’s a long way to go beyond that

01:28:06 to really crack the deep problems of AI.

01:28:10 But one of the important steps is to acknowledge

01:28:13 that the world is a really messy place.

01:28:16 It’s this rich, complex, beautiful,

01:28:18 but messy environment that we live in.

01:28:21 And no one gives us the rules.

01:28:23 Like no one knows the rules of the world.

01:28:26 At least maybe we understand that it operates

01:28:28 according to Newtonian or quantum mechanics

01:28:31 at the micro level or according to relativity

01:28:34 at the macro level.

01:28:35 But that’s not a model that’s useful for us as people

01:28:38 to operate in it.

01:28:40 Somehow the agent needs to understand the world for itself

01:28:43 in a way where no one tells it the rules of the game.

01:28:46 And yet it can still figure out what to do in that world,

01:28:50 deal with this stream of observations coming in,

01:28:53 rich sensory input coming in,

01:28:55 actions going out in a way that allows it to reason

01:28:58 in the way that AlphaGo or AlphaZero can reason

01:29:01 in the way that these go and chess playing programs

01:29:03 can reason.

01:29:04 But in a way that allows it to take actions

01:29:07 in that messy world to achieve its goals.

01:29:11 And so this led us to the most recent step

01:29:15 in the story of AlphaGo,

01:29:17 which was a system called MuZero.

01:29:19 And MuZero is a system which learns for itself

01:29:23 even when the rules are not given to it.

01:29:25 It actually can be dropped into a system

01:29:28 with messy perceptual inputs.

01:29:29 We actually tried it in some Atari games,

01:29:33 the canonical domains of Atari

01:29:36 that have been used for reinforcement learning.

01:29:38 And this system learned to build a model

01:29:42 of these Atari games that was sufficiently rich

01:29:46 and useful enough for it to be able to plan successfully.

01:29:51 And in fact, that system not only went on

01:29:53 to beat the state of the art in Atari,

01:29:56 but the same system without modification

01:29:59 was able to reach the same level of superhuman performance

01:30:02 in go, chess, and shogi that we’d seen in AlphaZero,

01:30:06 showing that even without the rules,

01:30:08 the system can learn for itself just by trial and error,

01:30:11 just by playing this game of go.

01:30:13 And no one tells you what the rules are,

01:30:15 but you just get to the end and someone says win or loss.

01:30:19 You play this game of chess and someone says win or loss,

01:30:22 or you play a game of breakout in Atari

01:30:25 and someone just tells you your score at the end.

01:30:28 And the system for itself figures out

01:30:30 essentially the rules of the system,

01:30:31 the dynamics of the world, how the world works.

01:30:35 And not in any explicit way, but just implicitly,

01:30:39 enough understanding for it to be able to plan

01:30:41 in that system in order to achieve its goals.

01:30:45 And that’s the fundamental process

01:30:48 that you have to go through when you’re facing

01:30:49 in any uncertain kind of environment

01:30:51 that you would in the real world,

01:30:53 is figuring out the sort of the rules,

01:30:55 the basic rules of the game.

01:30:56 That’s right.

01:30:57 So that allows it to be applicable

01:31:00 to basically any domain that could be digitized

01:31:05 in the way that it needs to in order to be consumable,

01:31:10 sort of in order for the reinforcement learning framework

01:31:12 to be able to sense the environment,

01:31:13 to be able to act in the environment and so on.

01:31:15 The full reinforcement learning problem

01:31:16 needs to deal with worlds that are unknown and complex

01:31:21 and the agent needs to learn for itself

01:31:23 how to deal with that.

01:31:24 And so MuZero is a further step in that direction.

01:31:29 One of the things that inspired the general public

01:31:32 and just in conversations I have like with my parents

01:31:34 or something with my mom that just loves what was done

01:31:38 is kind of at least the notion

01:31:40 that there was some display of creativity,

01:31:42 some new strategies, new behaviors that were created.

01:31:45 That again has echoes of intelligence.

01:31:48 So is there something that stands out?

01:31:50 Do you see it the same way that there’s creativity

01:31:52 and there’s some behaviors, patterns that you saw

01:31:57 that AlphaZero was able to display that are truly creative?

01:32:01 So let me start by saying that I think we should ask

01:32:06 what creativity really means.

01:32:08 So to me, creativity means discovering something

01:32:13 which wasn’t known before, something unexpected,

01:32:16 something outside of our norms.

01:32:19 And so in that sense, the process of reinforcement learning

01:32:24 or the self play approach that was used by AlphaZero

01:32:29 is the essence of creativity.

01:32:31 It’s really saying at every stage,

01:32:34 you’re playing according to your current norms

01:32:36 and you try something and if it works out,

01:32:39 you say, hey, here’s something great,

01:32:42 I’m gonna start using that.

01:32:44 And then that process, it’s like a micro discovery

01:32:47 that happens millions and millions of times

01:32:49 over the course of the algorithm’s life

01:32:51 where it just discovers some new idea,

01:32:54 oh, this pattern, this pattern’s working really well for me,

01:32:56 I’m gonna start using that.

01:32:58 And now, oh, here’s this other thing I can do,

01:33:00 I can start to connect these stones together in this way

01:33:03 or I can start to sacrifice stones or give up on pieces

01:33:08 or play shoulder hits on the fifth line or whatever it is.

01:33:12 The system’s discovering things like this for itself

01:33:13 continually, repeatedly, all the time.

01:33:16 And so it should come as no surprise to us then

01:33:19 when if you leave these systems going,

01:33:21 that they discover things that are not known to humans,

01:33:25 that to the human norms are considered creative.

01:33:30 And we’ve seen this several times.

01:33:32 In fact, in AlphaGo Zero,

01:33:35 we saw this beautiful timeline of discovery

01:33:39 where what we saw was that there are these opening patterns

01:33:44 that humans play called joseki,

01:33:45 these are like the patterns that humans learn

01:33:47 to play in the corners and they’ve been developed

01:33:49 and refined over literally thousands of years

01:33:51 in the game of Go.

01:33:53 And what we saw was in the course of the training,

01:33:57 AlphaGo Zero, over the course of the 40 days

01:34:00 that we trained this system,

01:34:01 it starts to discover exactly these patterns

01:34:05 that human players play.

01:34:06 And over time, we found that all of the joseki

01:34:10 that humans played were discovered by the system

01:34:13 through this process of self play

01:34:15 and this sort of essential notion of creativity.

01:34:19 But what was really interesting was that over time,

01:34:22 it then starts to discard some of these

01:34:24 in favor of its own joseki that humans didn’t know about.

01:34:28 And it starts to say, oh, well,

01:34:29 you thought that the Knights move pincer joseki

01:34:33 was a great idea,

01:34:35 but here’s something different you can do there

01:34:37 which makes some new variation

01:34:38 that humans didn’t know about.

01:34:40 And actually now the human Go players

01:34:42 study the joseki that AlphaGo played

01:34:44 and they become the new norms

01:34:46 that are used in today’s top level Go competitions.

01:34:51 That never gets old.

01:34:52 Even just the first to me,

01:34:54 maybe just makes me feel good as a human being

01:34:58 that a self play mechanism that knows nothing about us humans

01:35:01 discovers patterns that we humans do.

01:35:04 That’s just like an affirmation

01:35:06 that we’re doing okay as humans.

01:35:08 Yeah.

01:35:09 We’ve, in this domain and other domains,

01:35:12 we figured out it’s like the Churchill quote

01:35:14 about democracy.

01:35:15 It’s the, you know, it sucks,

01:35:18 but it’s the best one we’ve tried.

01:35:20 So in general, taking a step outside of Go

01:35:24 and you’ve like a million accomplishment

01:35:27 that I have no time to talk about

01:35:29 with AlphaStar and so on and the current work.

01:35:32 But in general, this self play mechanism

01:35:36 that you’ve inspired the world with

01:35:38 by beating the world champion Go player.

01:35:40 Do you see that as,

01:35:43 do you see it being applied in other domains?

01:35:47 Do you have sort of dreams and hopes

01:35:50 that it’s applied in both the simulated environments

01:35:53 and the constrained environments of games?

01:35:56 Constrained, I mean, AlphaStar really demonstrates

01:35:59 that you can remove a lot of the constraints,

01:36:00 but nevertheless, it’s in a digital simulated environment.

01:36:04 Do you have a hope, a dream that it starts being applied

01:36:07 in the robotics environment?

01:36:09 And maybe even in domains that are safety critical

01:36:12 and so on and have, you know,

01:36:15 have a real impact in the real world,

01:36:16 like autonomous vehicles, for example,

01:36:18 which seems like a very far out dream at this point.

01:36:21 So I absolutely do hope and imagine

01:36:25 that we will get to the point where ideas

01:36:27 just like these are used in all kinds of different domains.

01:36:31 In fact, one of the most satisfying things

01:36:32 as a researcher is when you start to see other people

01:36:35 use your algorithms in unexpected ways.

01:36:39 So in the last couple of years, there have been,

01:36:41 you know, a couple of nature papers

01:36:43 where different teams, unbeknownst to us,

01:36:47 took AlphaZero and applied exactly those same algorithms

01:36:51 and ideas to real world problems of huge meaning to society.

01:36:57 So one of them was the problem of chemical synthesis,

01:37:00 and they were able to beat the state of the art

01:37:02 in finding pathways of how to actually synthesize chemicals,

01:37:08 retrochemical synthesis.

01:37:11 And the second paper actually just came out

01:37:14 a couple of weeks ago in Nature,

01:37:16 showed that in quantum computation,

01:37:19 you know, one of the big questions is how to understand

01:37:22 the nature of the function in quantum computation

01:37:27 and a system based on AlphaZero beat the state of the art

01:37:30 by quite some distance there again.

01:37:32 So these are just examples.

01:37:34 And I think, you know, the lesson,

01:37:36 which we’ve seen elsewhere in machine learning

01:37:38 time and time again, is that if you make something general,

01:37:42 it will be used in all kinds of ways.

01:37:44 You know, you provide a really powerful tool to society,

01:37:47 and those tools can be used in amazing ways.

01:37:51 And so I think we’re just at the beginning,

01:37:53 and for sure, I hope that we see all kinds of outcomes.

01:37:58 So the other side of the question of reinforcement

01:38:03 learning framework is, you know,

01:38:05 you usually want to specify a reward function

01:38:07 and an objective function.

01:38:11 What do you think about sort of ideas of intrinsic rewards

01:38:13 of when we’re not really sure about, you know,

01:38:19 if we take, you know, human beings as existence proof

01:38:23 that we don’t seem to be operating

01:38:25 according to a single reward,

01:38:27 do you think that there’s interesting ideas

01:38:32 for when you don’t know how to truly specify the reward,

01:38:35 you know, that there’s some flexibility

01:38:38 for discovering it intrinsically or so on

01:38:40 in the context of reinforcement learning?

01:38:42 So I think, you know, when we think about intelligence,

01:38:45 it’s really important to be clear

01:38:46 about the problem of intelligence.

01:38:48 And I think it’s clearest to understand that problem

01:38:51 in terms of some ultimate goal

01:38:52 that we want the system to try and solve for.

01:38:55 And after all, if we don’t understand the ultimate purpose

01:38:57 of the system, do we really even have

01:39:00 a clearly defined problem that we’re solving at all?

01:39:04 Now, within that, as with your example for humans,

01:39:10 the system may choose to create its own motivations

01:39:13 and subgoals that help the system

01:39:16 to achieve its ultimate goal.

01:39:19 And that may indeed be a hugely important mechanism

01:39:22 to achieve those ultimate goals,

01:39:23 but there is still some ultimate goal

01:39:25 I think the system needs to be measurable

01:39:27 and evaluated against.

01:39:29 And even for humans, I mean, humans,

01:39:31 we’re incredibly flexible.

01:39:32 We feel that we can, you know, any goal that we’re given,

01:39:35 we feel we can master to some degree.

01:39:40 But if we think of those goals, really, you know,

01:39:41 like the goal of being able to pick up an object

01:39:44 or the goal of being able to communicate

01:39:47 or influence people to do things in a particular way

01:39:50 or whatever those goals are, really, they’re subgoals,

01:39:56 really, that we set ourselves.

01:39:58 You know, we choose to pick up the object.

01:40:00 We choose to communicate.

01:40:02 We choose to influence someone else.

01:40:05 And we choose those because we think it will lead us

01:40:07 to something later on.

01:40:10 We think that’s helpful to us to achieve some ultimate goal.

01:40:15 Now, I don’t want to speculate whether or not humans

01:40:18 as a system necessarily have a singular overall goal

01:40:20 of survival or whatever it is.

01:40:23 But I think the principle for understanding

01:40:25 and implementing intelligence is, has to be,

01:40:28 that if we’re trying to understand intelligence

01:40:30 or implement our own,

01:40:31 there has to be a well defined problem.

01:40:33 Otherwise, if it’s not, I think it’s like an admission

01:40:37 of defeat, that for there to be hope for understanding

01:40:41 or implementing intelligence, we have to know what we’re doing.

01:40:44 We have to know what we’re asking the system to do.

01:40:46 Otherwise, if you don’t have a clearly defined purpose,

01:40:48 you’re not going to get a clearly defined answer.

01:40:51 The ridiculous big question that has to naturally follow,

01:40:56 because I have to pin you down on this thing,

01:41:00 that nevertheless, one of the big silly

01:41:03 or big real questions before humans is the meaning of life,

01:41:08 is us trying to figure out our own reward function.

01:41:11 And you just kind of mentioned that if you want to build

01:41:13 intelligent systems and you know what you’re doing,

01:41:16 you should be at least cognizant to some degree

01:41:18 of what the reward function is.

01:41:20 So the natural question is what do you think

01:41:23 is the reward function of human life,

01:41:26 the meaning of life for us humans,

01:41:29 the meaning of our existence?

01:41:32 I think I’d be speculating beyond my own expertise,

01:41:36 but just for fun, let me do that.

01:41:38 Yes, please.

01:41:39 And say, I think that there are many levels

01:41:41 at which you can understand a system

01:41:43 and you can understand something as optimizing

01:41:46 for a goal at many levels.

01:41:48 And so you can understand the,

01:41:52 let’s start with the universe.

01:41:54 Does the universe have a purpose?

01:41:55 Well, it feels like it’s just at one level

01:41:58 just following certain mechanical laws of physics

01:42:02 and that that’s led to the development of the universe.

01:42:04 But at another level, you can view it as actually,

01:42:08 there’s the second law of thermodynamics that says

01:42:10 that this is increasing in entropy over time forever.

01:42:13 And now there’s a view that’s been developed

01:42:15 by certain people at MIT that this,

01:42:17 you can think of this as almost like a goal of the universe,

01:42:20 that the purpose of the universe is to maximize entropy.

01:42:24 So there are multiple levels

01:42:26 at which you can understand a system.

01:42:28 The next level down, you might say,

01:42:30 well, if the goal is to maximize entropy,

01:42:34 well, how can that be done by a particular system?

01:42:40 And maybe evolution is something that the universe

01:42:42 discovered in order to kind of dissipate energy

01:42:45 as efficiently as possible.

01:42:48 And by the way, I’m borrowing from Max Tegmark

01:42:49 for some of these metaphors, the physicist.

01:42:53 But if you can think of evolution

01:42:55 as a mechanism for dispersing energy,

01:42:59 then evolution, you might say, then becomes a goal,

01:43:04 which is if evolution disperses energy

01:43:06 by reproducing as efficiently as possible,

01:43:09 what’s evolution then?

01:43:10 Well, it’s now got its own goal within that,

01:43:13 which is to actually reproduce as effectively as possible.

01:43:19 And now how does reproduction,

01:43:22 how is that made as effective as possible?

01:43:25 Well, you need entities within that

01:43:27 that can survive and reproduce as effectively as possible.

01:43:29 And so it’s natural that in order to achieve

01:43:31 that high level goal, those individual organisms

01:43:33 discover brains, intelligences,

01:43:37 which enable them to support the goals of evolution.

01:43:43 And those brains, what do they do?

01:43:45 Well, perhaps the early brains,

01:43:47 maybe they were controlling things at some direct level.

01:43:51 Maybe they were the equivalent of preprogrammed systems,

01:43:54 which were directly controlling what was going on

01:43:57 and setting certain things in order

01:43:59 to achieve these particular goals.

01:44:03 But that led to another level of discovery,

01:44:05 which was learning systems.

01:44:07 There are parts of the brain

01:44:08 which are able to learn for themselves

01:44:10 and learn how to program themselves to achieve any goal.

01:44:13 And presumably there are parts of the brain

01:44:16 where goals are set to parts of that system

01:44:20 and provides this very flexible notion of intelligence

01:44:23 that we as humans presumably have,

01:44:25 which is the ability to kind of,

01:44:26 the reason we feel that we can achieve any goal.

01:44:30 So it’s a very long winded answer to say that,

01:44:32 I think there are many perspectives

01:44:34 and many levels at which intelligence can be understood.

01:44:38 And at each of those levels,

01:44:40 you can take multiple perspectives.

01:44:42 You can view the system as something

01:44:43 which is optimizing for a goal,

01:44:45 which is understanding it at a level

01:44:47 by which we can maybe implement it

01:44:49 and understand it as AI researchers or computer scientists,

01:44:53 or you can understand it at the level

01:44:54 of the mechanistic thing which is going on

01:44:56 that there are these atoms bouncing around in the brain

01:44:58 and they lead to the outcome of that system

01:45:01 is not in contradiction with the fact

01:45:02 that it’s also a decision making system

01:45:07 that’s optimizing for some goal and purpose.

01:45:10 I’ve never heard the description of the meaning of life

01:45:14 structured so beautifully in layers,

01:45:16 but you did miss one layer, which is the next step,

01:45:19 which you’re responsible for,

01:45:21 which is creating the artificial intelligence layer

01:45:27 on top of that.

01:45:28 And I can’t wait to see, well, I may not be around,

01:45:31 but I can’t wait to see what the next layer beyond that be.

01:45:36 Well, let’s just take that argument

01:45:39 and pursue it to its natural conclusion.

01:45:41 So the next level indeed is for how can our learning brain

01:45:46 achieve its goals most effectively?

01:45:49 Well, maybe it does so by us as learning beings

01:45:56 building a system which is able to solve for those goals

01:46:00 more effectively than we can.

01:46:02 And so when we build a system to play the game of Go,

01:46:05 when I said that I wanted to build a system

01:46:06 that can play Go better than I can,

01:46:08 I’ve enabled myself to achieve that goal of playing Go

01:46:12 better than I could by directly playing it

01:46:14 and learning it myself.

01:46:15 And so now a new layer has been created,

01:46:18 which is systems which are able to achieve goals

01:46:21 for themselves.

01:46:22 And ultimately there may be layers beyond that

01:46:25 where they set sub goals to parts of their own system

01:46:28 in order to achieve those and so forth.

01:46:32 So the story of intelligence, I think,

01:46:36 is a multi layered one and a multi perspective one.

01:46:39 We live in an incredible universe.

01:46:41 David, thank you so much, first of all,

01:46:43 for dreaming of using learning to solve Go

01:46:47 and building intelligent systems

01:46:50 and for actually making it happen

01:46:52 and for inspiring millions of people in the process.

01:46:56 It’s truly an honor.

01:46:57 Thank you so much for talking today.

01:46:58 Okay, thank you.

01:46:59 Thanks for listening to this conversation

01:47:01 with David Silver and thank you to our sponsors,

01:47:04 Masterclass and Cash App.

01:47:05 Please consider supporting the podcast

01:47:07 by signing up to Masterclass at masterclass.com slash Lex

01:47:12 and downloading Cash App and using code LexPodcast.

01:47:15 If you enjoy this podcast, subscribe on YouTube,

01:47:18 review it with five stars on Apple Podcast,

01:47:20 support it on Patreon,

01:47:21 or simply connect with me on Twitter at LexFriedman.

01:47:25 And now let me leave you with some words from David Silver.

01:47:28 My personal belief is that we’ve seen something

01:47:31 of a turning point where we’re starting to understand

01:47:34 that many abilities like intuition and creativity

01:47:38 that we’ve previously thought were in the domain only

01:47:40 of the human mind are actually accessible

01:47:43 to machine intelligence as well.

01:47:45 And I think that’s a really exciting moment in history.

01:47:48 Thank you for listening and hope to see you next time.