Transcript
00:00:00 The following is a conversation with David Silver,
00:00:02 who leads the Reinforcement Learning Research Group
00:00:05 at DeepMind, and was the lead researcher
00:00:07 on AlphaGo, AlphaZero, and co led the AlphaStar
00:00:12 and MuZero efforts, and a lot of important work
00:00:14 in reinforcement learning in general.
00:00:17 I believe AlphaZero is one of the most important
00:00:20 accomplishments in the history of artificial intelligence.
00:00:24 And David is one of the key humans who brought AlphaZero
00:00:27 to life together with a lot of other great researchers
00:00:30 at DeepMind.
00:00:31 He’s humble, kind, and brilliant.
00:00:35 We were both jet lagged, but didn’t care and made it happen.
00:00:39 It was a pleasure and truly an honor to talk with David.
00:00:43 This conversation was recorded before the outbreak
00:00:45 of the pandemic.
00:00:46 For everyone feeling the medical, psychological,
00:00:49 and financial burden of this crisis,
00:00:51 I’m sending love your way.
00:00:53 Stay strong, we’re in this together, we’ll beat this thing.
00:00:57 This is the Artificial Intelligence Podcast.
00:01:00 If you enjoy it, subscribe on YouTube,
00:01:02 review it with five stars on Apple Podcast,
00:01:04 support on Patreon, or simply connect with me on Twitter
00:01:07 at Lex Friedman, spelled F R I D M A N.
00:01:12 As usual, I’ll do a few minutes of ads now
00:01:14 and never any ads in the middle
00:01:16 that can break the flow of the conversation.
00:01:18 I hope that works for you
00:01:19 and doesn’t hurt the listening experience.
00:01:22 Quick summary of the ads.
00:01:23 Two sponsors, Masterclass and Cash App.
00:01:27 Please consider supporting the podcast
00:01:29 by signing up to Masterclass and masterclass.com slash Lex
00:01:34 and downloading Cash App and using code LexPodcast.
00:01:38 This show is presented by Cash App,
00:01:41 the number one finance app in the app store.
00:01:43 When you get it, use code LexPodcast.
00:01:46 Cash App lets you send money to friends, buy Bitcoin,
00:01:50 and invest in the stock market with as little as $1.
00:01:53 Since Cash App allows you to buy Bitcoin,
00:01:56 let me mention that cryptocurrency
00:01:57 in the context of the history of money is fascinating.
00:02:01 I recommend Ascent of Money as a great book on this history.
00:02:05 Debits and credits on Ledger started around 30,000 years ago.
00:02:10 The US dollar created over 200 years ago,
00:02:12 and Bitcoin, the first decentralized cryptocurrency,
00:02:15 released just over 10 years ago.
00:02:18 So given that history, cryptocurrency is still very much
00:02:21 in its early days of development,
00:02:23 but it’s still aiming to and just might
00:02:26 redefine the nature of money.
00:02:29 So again, if you get Cash App from the app store or Google Play
00:02:32 and use the code LexPodcast, you get $10,
00:02:35 and Cash App will also donate $10 to FIRST,
00:02:38 an organization that is helping to advance robotics
00:02:41 and STEM education for young people around the world.
00:02:44 This show is sponsored by Masterclass.
00:02:46 Sign up at masterclass.com slash Lex
00:02:49 to get a discount and to support this podcast.
00:02:52 In fact, for a limited time now,
00:02:53 if you sign up for an all access pass for a year,
00:02:56 you get to get another all access pass
00:02:59 to share with a friend.
00:03:01 Buy one, get one free.
00:03:02 When I first heard about Masterclass,
00:03:04 I thought it was too good to be true.
00:03:06 For $180 a year, you get an all access pass
00:03:09 to watch courses from to list some of my favorites.
00:03:12 Chris Hadfield on space exploration,
00:03:15 Neil deGrasse Tyson on scientific thinking communication,
00:03:18 Will Wright, the creator of SimCity and Sims on game design,
00:03:22 Jane Goodall on conservation,
00:03:24 Carlos Santana on guitar.
00:03:26 His song Europa could be the most beautiful
00:03:29 guitar song ever written.
00:03:30 Gary Kasparov on chess, Daniel Negrano on poker,
00:03:34 and many, many more.
00:03:35 Chris Hadfield explaining how rockets work
00:03:37 and the experience of being launched into space alone
00:03:40 is worth the money.
00:03:41 For me, the key is to not be overwhelmed
00:03:44 by the abundance of choice.
00:03:46 Pick three courses you want to complete,
00:03:48 watch each of them all the way through.
00:03:50 It’s not that long, but it’s an experience
00:03:51 that will stick with you for a long time, I promise.
00:03:55 It’s easily worth the money.
00:03:56 You can watch it on basically any device.
00:03:59 Once again, sign up on masterclass.com slash Lex
00:04:02 to get a discount and to support this podcast.
00:04:05 And now, here’s my conversation with David Silver.
00:04:09 What was the first program you’ve ever written?
00:04:12 And what programming language?
00:04:13 Do you remember?
00:04:14 I remember very clearly, yeah.
00:04:16 My parents brought home this BBC Model B microcomputer.
00:04:22 It was just this fascinating thing to me.
00:04:24 I was about seven years old and couldn’t resist
00:04:27 just playing around with it.
00:04:29 So I think first program ever was writing my name out
00:04:35 in different colors and getting it to loop and repeat that.
00:04:39 And there was something magical about that,
00:04:41 which just led to more and more.
00:04:43 How did you think about computers back then?
00:04:46 Like the magical aspect of it, that you can write a program
00:04:49 and there’s this thing that you just gave birth to
00:04:52 that’s able to create sort of visual elements
00:04:56 and live in its own.
00:04:57 Or did you not think of it in those romantic notions?
00:04:59 Was it more like, oh, that’s cool.
00:05:02 I can solve some puzzles.
00:05:05 It was always more than solving puzzles.
00:05:06 It was something where, you know,
00:05:08 there was this limitless possibilities.
00:05:13 Once you have a computer in front of you,
00:05:14 you can do anything with it.
00:05:16 I used to play with Lego with the same feeling.
00:05:18 You can make anything you want out of Lego,
00:05:20 but even more so with a computer, you know,
00:05:21 you’re not constrained by the amount of kit you’ve got.
00:05:24 And so I was fascinated by it and started pulling out
00:05:26 the user guide and the advanced user guide
00:05:29 and then learning.
00:05:30 So I started in basic and then later 6502.
00:05:34 My father also became interested in this machine
00:05:38 and gave up his career to go back to school
00:05:40 and study for a master’s degree
00:05:42 in artificial intelligence, funnily enough,
00:05:46 at Essex University when I was seven.
00:05:48 So I was exposed to those things at an early age.
00:05:52 He showed me how to program in prologue
00:05:54 and do things like querying your family tree.
00:05:57 And those are some of my earliest memories
00:05:59 of trying to figure things out on a computer.
00:06:04 Those are the early steps in computer science programming,
00:06:07 but when did you first fall in love
00:06:09 with artificial intelligence or with the ideas,
00:06:12 the dreams of AI?
00:06:14 I think it was really when I went to study at university.
00:06:19 So I was an undergrad at Cambridge
00:06:20 and studying computer science.
00:06:23 And I really started to question,
00:06:27 you know, what really are the goals?
00:06:29 What’s the goal?
00:06:30 Where do we want to go with computer science?
00:06:32 And it seemed to me that the only step
00:06:37 of major significance to take was to try
00:06:40 and recreate something akin to human intelligence.
00:06:44 If we could do that, that would be a major leap forward.
00:06:47 And that idea, I certainly wasn’t the first to have it,
00:06:50 but it, you know, nestled within me somewhere
00:06:53 and became like a bug.
00:06:55 You know, I really wanted to crack that problem.
00:06:58 So you thought it was, like you had a notion
00:07:00 that this is something that human beings can do,
00:07:03 that it is possible to create an intelligent machine.
00:07:07 Well, I mean, unless you believe in something metaphysical,
00:07:11 then what are our brains doing?
00:07:13 Well, at some level they’re information processing systems,
00:07:17 which are able to take whatever information is in there,
00:07:22 transform it through some form of program
00:07:24 and produce some kind of output,
00:07:26 which enables that human being to do all the amazing things
00:07:29 that they can do in this incredible world.
00:07:31 So then do you remember the first time
00:07:35 you’ve written a program that,
00:07:37 because you also had an interest in games.
00:07:40 Do you remember the first time you were in a program
00:07:41 that beat you in a game?
00:07:45 That more beat you at anything?
00:07:47 Sort of achieved super David Silver level performance?
00:07:54 So I used to work in the games industry.
00:07:56 So for five years I programmed games for my first job.
00:08:01 So it was an amazing opportunity
00:08:03 to get involved in a startup company.
00:08:05 And so I was involved in building AI at that time.
00:08:12 And so for sure there was a sense of building handcrafted,
00:08:18 what people used to call AI in the games industry,
00:08:20 which I think is not really what we might think of as AI
00:08:23 in its fullest sense,
00:08:24 but something which is able to take actions
00:08:29 and in a way which makes things interesting
00:08:31 and challenging for the human player.
00:08:35 And at that time I was able to build
00:08:38 these handcrafted agents,
00:08:39 which in certain limited cases could do things
00:08:41 which were able to do better than me,
00:08:45 but mostly in these kind of Twitch like scenarios
00:08:47 where they were able to do things faster
00:08:50 or because they had some pattern
00:08:51 which was able to exploit repeatedly.
00:08:55 I think if we’re talking about real AI,
00:08:58 the first experience for me came after that
00:09:00 when I realized that this path I was on
00:09:05 wasn’t taking me towards,
00:09:06 it wasn’t dealing with that bug which I still had inside me
00:09:10 to really understand intelligence and try and solve it.
00:09:14 That everything people were doing in games
00:09:15 was short term fixes rather than long term vision.
00:09:19 And so I went back to study for my PhD,
00:09:22 which was funny enough trying to apply reinforcement learning
00:09:26 to the game of Go.
00:09:27 And I built my first Go program using reinforcement learning,
00:09:31 a system which would by trial and error play against itself
00:09:35 and was able to learn which patterns were actually helpful
00:09:40 to predict whether it was gonna win or lose the game
00:09:42 and then choose the moves that led
00:09:44 to the combination of patterns
00:09:45 that would mean that you’re more likely to win.
00:09:47 And that system, that system beat me.
00:09:50 And how did that make you feel?
00:09:53 Made me feel good.
00:09:54 I mean, was there sort of the, yeah,
00:09:57 it’s a mix of a sort of excitement
00:09:59 and was there a tinge of sort of like,
00:10:02 almost like a fearful awe?
00:10:04 You know, it’s like in space, 2001 Space Odyssey
00:10:08 kind of realizing that you’ve created something that,
00:10:12 you know, that’s achieved human level intelligence
00:10:19 in this one particular little task.
00:10:21 And in that case, I suppose neural networks
00:10:23 weren’t involved.
00:10:24 There were no neural networks in those days.
00:10:26 This was pre deep learning revolution.
00:10:30 But it was a principled self learning system
00:10:33 based on a lot of the principles which people
00:10:36 are still using in deep reinforcement learning.
00:10:40 How did I feel?
00:10:41 I think I found it immensely satisfying
00:10:46 that a system which was able to learn
00:10:49 from first principles for itself
00:10:51 was able to reach the point
00:10:52 that it was understanding this domain
00:10:56 better than I could and able to outwit me.
00:11:00 I don’t think it was a sense of awe.
00:11:01 It was a sense that satisfaction,
00:11:04 that something I felt should work had worked.
00:11:08 So to me, AlphaGo, and I don’t know how else to put it,
00:11:11 but to me, AlphaGo and AlphaGo Zero,
00:11:14 mastering the game of Go is again, to me,
00:11:18 the most profound and inspiring moment
00:11:20 in the history of artificial intelligence.
00:11:23 So you’re one of the key people behind this achievement
00:11:26 and I’m Russian.
00:11:27 So I really felt the first sort of seminal achievement
00:11:31 when Deep Blue beat Garry Kasparov in 1987.
00:11:34 So as far as I know, the AI community at that point
00:11:40 largely saw the game of Go as unbeatable in AI
00:11:43 using the sort of the state of the art
00:11:46 brute force methods, search methods.
00:11:48 Even if you consider, at least the way I saw it,
00:11:51 even if you consider arbitrary exponential scaling
00:11:55 of compute, Go would still not be solvable,
00:11:59 hence why it was thought to be impossible.
00:12:01 So given that the game of Go was impossible to master,
00:12:07 what was the dream for you?
00:12:09 You just mentioned your PhD thesis
00:12:11 of building the system that plays Go.
00:12:14 What was the dream for you that you could actually
00:12:16 build a computer program that achieves world class,
00:12:20 not necessarily beats the world champion,
00:12:21 but achieves that kind of level of playing Go?
00:12:24 First of all, thank you, that’s very kind words.
00:12:27 And funnily enough, I just came from a panel
00:12:31 where I was actually in a conversation
00:12:34 with Garry Kasparov and Murray Campbell,
00:12:36 who was the author of Deep Blue.
00:12:38 And it was their first meeting together since the match.
00:12:43 So that just occurred yesterday.
00:12:44 So I’m literally fresh from that experience.
00:12:47 So these are amazing moments when they happen,
00:12:50 but where did it all start?
00:12:52 Well, for me, it started when I became fascinated
00:12:55 in the game of Go.
00:12:56 So Go for me, I’ve grown up playing games.
00:12:59 I’ve always had a fascination in board games.
00:13:01 I played chess as a kid, I played Scrabble as a kid.
00:13:06 When I was at university, I discovered the game of Go.
00:13:08 And to me, it just blew all of those other games
00:13:11 out of the water.
00:13:12 It was just so deep and profound in its complexity
00:13:15 with endless levels to it.
00:13:17 What I discovered was that I could devote
00:13:22 endless hours to this game.
00:13:25 And I knew in my heart of hearts
00:13:28 that no matter how many hours I would devote to it,
00:13:30 I would never become a grandmaster,
00:13:34 or there was another path.
00:13:35 And the other path was to try and understand
00:13:38 how you could get some other intelligence
00:13:40 to play this game better than I would be able to.
00:13:43 And so even in those days, I had this idea that,
00:13:46 what if, what if it was possible to build a program
00:13:49 that could crack this?
00:13:51 And as I started to explore the domain,
00:13:53 I discovered that this was really the domain
00:13:57 where people felt deeply that if progress
00:14:01 could be made in Go,
00:14:02 it would really mean a giant leap forward for AI.
00:14:06 It was the challenge where all other approaches had failed.
00:14:10 This is coming out of the era you mentioned,
00:14:13 which was in some sense, the golden era
00:14:15 for the classical methods of AI, like heuristic search.
00:14:19 In the 90s, they all fell one after another,
00:14:23 not just chess with deep blue, but checkers,
00:14:26 backgammon, Othello.
00:14:28 There were numerous cases where systems
00:14:33 built on top of heuristic search methods
00:14:35 with these high performance systems
00:14:37 had been able to defeat the human world champion
00:14:40 in each of those domains.
00:14:41 And yet in that same time period,
00:14:44 there was a million dollar prize available
00:14:47 for the game of Go, for the first system
00:14:50 to be a human professional player.
00:14:52 And at the end of that time period,
00:14:54 in year 2000 when the prize expired,
00:14:57 the strongest Go program in the world
00:15:00 was defeated by a nine year old child
00:15:02 when that nine year old child was giving nine free moves
00:15:05 to the computer at the start of the game
00:15:07 to try and even things up.
00:15:09 And computer Go expert beat that same strongest program
00:15:13 with 29 handicapped stones, 29 free moves.
00:15:18 So that’s what the state of affairs was
00:15:20 when I became interested in this problem
00:15:23 in around 2003 when I started working on computer Go.
00:15:29 There was nothing, there was very, very little
00:15:33 in the way of progress towards meaningful performance,
00:15:36 again, anything approaching human level.
00:15:39 And so people, it wasn’t through lack of effort,
00:15:42 people had tried many, many things.
00:15:44 And so there was a strong sense
00:15:46 that something different would be required for Go
00:15:49 than had been needed for all of these other domains
00:15:52 where AI had been successful.
00:15:54 And maybe the single clearest example
00:15:56 is that Go, unlike those other domains,
00:15:59 had this kind of intuitive property
00:16:02 that a Go player would look at a position
00:16:04 and say, hey, here’s this mess of black and white stones.
00:16:09 But from this mess, oh, I can predict
00:16:12 that this part of the board has become my territory,
00:16:15 this part of the board has become your territory,
00:16:17 and I’ve got this overall sense that I’m gonna win
00:16:20 and that this is about the right move to play.
00:16:22 And that intuitive sense of judgment,
00:16:24 of being able to evaluate what’s going on in a position,
00:16:28 it was pivotal to humans being able to play this game
00:16:31 and something that people had no idea
00:16:33 how to put into computers.
00:16:35 So this question of how to evaluate a position,
00:16:37 how to come up with these intuitive judgments
00:16:40 was the key reason why Go was so hard
00:16:44 in addition to its enormous search space,
00:16:47 and the reason why methods
00:16:49 which had succeeded so well elsewhere failed in Go.
00:16:53 And so people really felt deep down that in order to crack Go
00:16:57 we would need to get something akin to human intuition.
00:17:00 And if we got something akin to human intuition,
00:17:02 we’d be able to solve many, many more problems in AI.
00:17:06 So for me, that was the moment where it’s like,
00:17:09 okay, this is not just about playing the game of Go,
00:17:11 this is about something profound.
00:17:13 And it was back to that bug
00:17:15 which had been itching me all those years.
00:17:17 This is the opportunity to do something meaningful
00:17:19 and transformative, and I guess a dream was born.
00:17:23 That’s a really interesting way to put it.
00:17:25 So almost this realization that you need to find,
00:17:29 formulate Go as a kind of a prediction problem
00:17:31 versus a search problem was the intuition.
00:17:34 I mean, maybe that’s the wrong crude term,
00:17:37 but to give it the ability to kind of intuit things
00:17:44 about positional structure of the board.
00:17:47 Now, okay, but what about the learning part of it?
00:17:51 Did you have a sense that you have to,
00:17:54 that learning has to be part of the system?
00:17:57 Again, something that hasn’t really as far as I think,
00:18:01 except with TD Gammon in the 90s with RL a little bit,
00:18:05 hasn’t been part of those state of the art game playing
00:18:07 systems.
00:18:08 So I strongly felt that learning would be necessary.
00:18:12 And that’s why my PhD topic back then was trying
00:18:16 to apply reinforcement learning to the game of Go
00:18:20 and not just learning of any type,
00:18:21 but I felt that the only way to really have a system
00:18:26 to progress beyond human levels of performance
00:18:29 wouldn’t just be to mimic how humans do it,
00:18:31 but to understand for themselves.
00:18:33 And how else can a machine hope to understand
00:18:36 what’s going on except through learning?
00:18:39 If you’re not learning, what else are you doing?
00:18:40 Well, you’re putting all the knowledge into the system.
00:18:42 And that just feels like something which decades of AI
00:18:47 have told us is maybe not a dead end,
00:18:50 but certainly has a ceiling to the capabilities.
00:18:53 It’s known as the knowledge acquisition bottleneck,
00:18:55 that the more you try to put into something,
00:18:58 the more brittle the system becomes.
00:19:00 And so you just have to have learning.
00:19:02 You have to have learning.
00:19:03 That’s the only way you’re going to be able to get a system
00:19:06 which has sufficient knowledge in it,
00:19:10 millions and millions of pieces of knowledge,
00:19:11 billions, trillions of a form
00:19:14 that it can actually apply for itself
00:19:15 and understand how those billions and trillions
00:19:18 of pieces of knowledge can be leveraged in a way
00:19:20 which will actually lead it towards its goal
00:19:22 without conflict or other issues.
00:19:27 Yeah, I mean, if I put myself back in that time,
00:19:30 I just wouldn’t think like that.
00:19:33 Without a good demonstration of RL,
00:19:34 I would think more in the symbolic AI,
00:19:37 like not learning, but sort of a simulation
00:19:42 of knowledge base, like a growing knowledge base,
00:19:46 but it would still be sort of pattern based,
00:19:50 like basically have little rules
00:19:52 that you kind of assemble together
00:19:54 into a large knowledge base.
00:19:56 Well, in a sense, that was the state of the art back then.
00:19:59 So if you look at the Go programs,
00:20:01 which had been competing for this prize I mentioned,
00:20:05 they were an assembly of different specialized systems,
00:20:09 some of which used huge amounts of human knowledge
00:20:11 to describe how you should play the opening,
00:20:14 how you should, all the different patterns
00:20:16 that were required to play well in the game of Go,
00:20:21 end game theory, combinatorial game theory,
00:20:24 and combined with more principled search based methods,
00:20:28 which were trying to solve for particular sub parts
00:20:31 of the game, like life and death,
00:20:34 connecting groups together,
00:20:36 all these amazing sub problems
00:20:38 that just emerge in the game of Go,
00:20:40 there were different pieces all put together
00:20:43 into this like collage,
00:20:45 which together would try and play against a human.
00:20:49 And although not all of the pieces were handcrafted,
00:20:54 the overall effect was nevertheless still brittle,
00:20:56 and it was hard to make all these pieces work well together.
00:21:00 And so really, what I was pressing for
00:21:02 and the main innovation of the approach I took
00:21:05 was to go back to first principles and say,
00:21:08 well, let’s back off that
00:21:10 and try and find a principled approach
00:21:12 where the system can learn for itself,
00:21:16 just from the outcome, like learn for itself.
00:21:19 If you try something, did that help or did it not help?
00:21:22 And only through that procedure can you arrive at knowledge,
00:21:26 which is verified.
00:21:27 The system has to verify it for itself,
00:21:29 not relying on any other third party
00:21:31 to say this is right or this is wrong.
00:21:33 And so that principle was already very important
00:21:38 in those days, but unfortunately,
00:21:39 we were missing some important pieces back then.
00:21:43 So before we dive into maybe
00:21:46 discussing the beauty of reinforcement learning,
00:21:49 let’s take a step back, we kind of skipped it a bit,
00:21:52 but the rules of the game of Go,
00:21:55 what the elements of it perhaps contrasting to chess
00:22:02 that sort of you really enjoyed as a human being,
00:22:07 and also that make it really difficult
00:22:09 as a AI machine learning problem.
00:22:13 So the game of Go has remarkably simple rules.
00:22:16 In fact, so simple that people have speculated
00:22:19 that if we were to meet alien life at some point,
00:22:22 that we wouldn’t be able to communicate with them,
00:22:23 but we would be able to play Go with them.
00:22:26 Probably have discovered the same rule set.
00:22:28 So the game is played on a 19 by 19 grid,
00:22:32 and you play on the intersections of the grid
00:22:34 and the players take turns.
00:22:35 And the aim of the game is very simple.
00:22:37 It’s to surround as much territory as you can,
00:22:40 as many of these intersections with your stones
00:22:43 and to surround more than your opponent does.
00:22:46 And the only nuance to the game is that
00:22:48 if you fully surround your opponent’s piece,
00:22:50 then you get to capture it and remove it from the board
00:22:52 and it counts as your own territory.
00:22:54 Now from those very simple rules, immense complexity arises.
00:22:58 There’s kind of profound strategies
00:22:59 in how to surround territory,
00:23:02 how to kind of trade off between
00:23:04 making solid territory yourself now
00:23:07 compared to building up influence
00:23:09 that will help you acquire territory later in the game,
00:23:11 how to connect groups together,
00:23:12 how to keep your own groups alive,
00:23:16 which patterns of stones are most useful
00:23:19 compared to others.
00:23:21 There’s just immense knowledge.
00:23:23 And human Go players have played this game for,
00:23:27 it was discovered thousands of years ago,
00:23:29 and human Go players have built up
00:23:30 this immense knowledge base over the years.
00:23:33 It’s studied very deeply and played by
00:23:36 something like 50 million players across the world,
00:23:38 mostly in China, Japan, and Korea,
00:23:41 where it’s an important part of the culture,
00:23:43 so much so that it’s considered one of the
00:23:45 four ancient arts that was required by Chinese scholars.
00:23:49 So there’s a deep history there.
00:23:51 But there’s interesting qualities.
00:23:53 So if I sort of compare to chess,
00:23:55 chess is in the same way as it is in Chinese culture for Go,
00:23:59 and chess in Russia is also considered
00:24:01 one of the sacred arts.
00:24:03 So if we contrast sort of Go with chess,
00:24:06 there’s interesting qualities about Go.
00:24:09 Maybe you can correct me if I’m wrong,
00:24:10 but the evaluation of a particular static board
00:24:15 is not as reliable.
00:24:18 Like you can’t, in chess you can kind of assign points
00:24:21 to the different units,
00:24:23 and it’s kind of a pretty good measure
00:24:26 of who’s winning, who’s losing.
00:24:27 It’s not so clear.
00:24:29 Yeah, so in the game of Go,
00:24:31 you find yourself in a situation where
00:24:33 both players have played the same number of stones.
00:24:36 Actually, captures at a strong level of play
00:24:38 happen very rarely, which means that
00:24:40 at any moment in the game,
00:24:41 you’ve got the same number of white stones and black stones.
00:24:43 And the only thing which differentiates
00:24:45 how well you’re doing is this intuitive sense
00:24:48 of where are the territories ultimately
00:24:50 going to form on this board?
00:24:52 And if you look at the complexity of a real Go position,
00:24:57 it’s mind boggling that kind of question
00:25:00 of what will happen in 300 moves from now
00:25:02 when you see just a scattering of 20 white
00:25:05 and black stones intermingled.
00:25:07 And so that challenge is the reason
00:25:12 why position evaluation is so hard in Go
00:25:15 compared to other games.
00:25:17 In addition to that, it has an enormous search space.
00:25:19 So there’s around 10 to the 170 positions
00:25:23 in the game of Go.
00:25:24 That’s an astronomical number.
00:25:26 And that search space is so great
00:25:28 that traditional heuristic search methods
00:25:30 that were so successful in things like Deep Blue
00:25:32 and chess programs just kind of fall over in Go.
00:25:36 So at which point did reinforcement learning
00:25:39 enter your life, your research life, your way of thinking?
00:25:43 We just talked about learning,
00:25:45 but reinforcement learning is a very particular
00:25:47 kind of learning.
00:25:49 One that’s both philosophically sort of profound,
00:25:53 but also one that’s pretty difficult to get to work
00:25:55 as if we look back in the early days.
00:25:58 So when did that enter your life
00:26:00 and how did that work progress?
00:26:02 So I had just finished working in the games industry
00:26:06 at this startup company.
00:26:07 And I took a year out to discover for myself
00:26:13 exactly which path I wanted to take.
00:26:14 I knew I wanted to study intelligence,
00:26:17 but I wasn’t sure what that meant at that stage.
00:26:19 I really didn’t feel I had the tools
00:26:21 to decide on exactly which path I wanted to follow.
00:26:24 So during that year, I read a lot.
00:26:27 And one of the things I read was Saturn and Barto,
00:26:31 the sort of seminal textbook
00:26:33 on an introduction to reinforcement learning.
00:26:35 And when I read that textbook,
00:26:39 I just had this resonating feeling
00:26:43 that this is what I understood intelligence to be.
00:26:47 And this was the path that I felt would be necessary
00:26:51 to go down to make progress in AI.
00:26:55 So I got in touch with Rich Saturn
00:27:00 and asked him if he would be interested
00:27:02 in supervising me on a PhD thesis in computer go.
00:27:07 And he basically said
00:27:11 that if he’s still alive, he’d be happy to.
00:27:15 But unfortunately, he’d been struggling
00:27:19 with very serious cancer for some years.
00:27:21 And he really wasn’t confident at that stage
00:27:23 that he’d even be around to see the end event.
00:27:26 But fortunately, that part of the story
00:27:28 worked out very happily.
00:27:29 And I found myself out there in Alberta.
00:27:32 They’ve got a great games group out there
00:27:34 with a history of fantastic work in board games as well,
00:27:38 as Rich Saturn, the father of RL.
00:27:40 So it was the natural place for me to go in some sense
00:27:43 to study this question.
00:27:45 And the more I looked into it,
00:27:48 the more strongly I felt that this
00:27:53 wasn’t just the path to progress in computer go.
00:27:56 But really, this was the thing I’d been looking for.
00:27:59 This was really an opportunity
00:28:04 to frame what intelligence means.
00:28:08 Like what are the goals of AI in a clear,
00:28:12 single clear problem definition,
00:28:14 such that if we’re able to solve
00:28:15 that clear single problem definition,
00:28:18 in some sense, we’ve cracked the problem of AI.
00:28:21 So to you, reinforcement learning ideas,
00:28:24 at least sort of echoes of it,
00:28:26 would be at the core of intelligence.
00:28:29 It is at the core of intelligence.
00:28:31 And if we ever create a human level intelligence system,
00:28:34 it would be at the core of that kind of system.
00:28:37 Let me say it this way, that I think it’s helpful
00:28:39 to separate out the problem from the solution.
00:28:42 So I see the problem of intelligence,
00:28:45 I would say it can be formalized
00:28:48 as the reinforcement learning problem,
00:28:50 and that that formalization is enough
00:28:52 to capture most, if not all of the things
00:28:56 that we mean by intelligence,
00:28:58 that they can all be brought within this framework
00:29:01 and gives us a way to access them in a meaningful way
00:29:03 that allows us as scientists to understand intelligence
00:29:08 and us as computer scientists to build them.
00:29:12 And so in that sense, I feel that it gives us a path,
00:29:16 maybe not the only path, but a path towards AI.
00:29:20 And so do I think that any system in the future
00:29:24 that’s solved AI would have to have RL within it?
00:29:29 Well, I think if you ask that,
00:29:30 you’re asking about the solution methods.
00:29:33 I would say that if we have such a thing,
00:29:35 it would be a solution to the RL problem.
00:29:37 Now, what particular methods have been used to get there?
00:29:41 Well, we should keep an open mind
00:29:42 about the best approaches to actually solve any problem.
00:29:45 And the things we have right now for reinforcement learning,
00:29:49 maybe I believe they’ve got a lot of legs,
00:29:53 but maybe we’re missing some things.
00:29:54 Maybe there’s gonna be better ideas.
00:29:56 I think we should keep, let’s remain modest
00:29:59 and we’re at the early days of this field
00:30:02 and there are many amazing discoveries ahead of us.
00:30:04 For sure, the specifics,
00:30:06 especially of the different kinds of RL approaches currently,
00:30:09 there could be other things that fall
00:30:11 into the very large umbrella of RL.
00:30:13 But if it’s okay, can we take a step back
00:30:16 and kind of ask the basic question
00:30:18 of what is to you reinforcement learning?
00:30:22 So reinforcement learning is the study
00:30:25 and the science and the problem of intelligence
00:30:31 in the form of an agent that interacts with an environment.
00:30:35 So the problem you’re trying to solve
00:30:36 is represented by some environment,
00:30:38 like the world in which that agent is situated.
00:30:40 And the goal of RL is clear
00:30:42 that the agent gets to take actions.
00:30:45 Those actions have some effect on the environment
00:30:47 and the environment gives back an observation
00:30:49 to the agent saying, this is what you see or sense.
00:30:52 And one special thing which it gives back
00:30:54 is called the reward signal,
00:30:56 how well it’s doing in the environment.
00:30:58 And the reinforcement learning problem
00:30:59 is to simply take actions over time
00:31:04 so as to maximize that reward signal.
00:31:07 So a couple of basic questions.
00:31:11 What types of RL approaches are there?
00:31:13 So I don’t know if there’s a nice brief inwards way
00:31:17 to paint the picture of sort of value based,
00:31:21 model based, policy based reinforcement learning.
00:31:25 Yeah, so now if we think about,
00:31:27 okay, so there’s this ambitious problem definition of RL.
00:31:31 It’s really, it’s truly ambitious.
00:31:33 It’s trying to capture and encircle
00:31:34 all of the things in which an agent interacts
00:31:36 with an environment and say, well,
00:31:38 how can we formalize and understand
00:31:39 what it means to crack that?
00:31:41 Now let’s think about the solution method.
00:31:43 Well, how do you solve a really hard problem like that?
00:31:46 Well, one approach you can take
00:31:48 is to decompose that very hard problem
00:31:51 into pieces that work together to solve that hard problem.
00:31:55 And so you can kind of look at the decomposition
00:31:58 that’s inside the agent’s head, if you like,
00:32:00 and ask, well, what form does that decomposition take?
00:32:03 And some of the most common pieces that people use
00:32:06 when they’re kind of putting
00:32:07 the solution method together,
00:32:09 some of the most common pieces that people use
00:32:11 are whether or not that solution has a value function.
00:32:14 That means, is it trying to predict,
00:32:16 explicitly trying to predict how much reward
00:32:18 it will get in the future?
00:32:20 Does it have a representation of a policy?
00:32:22 That means something which is deciding how to pick actions.
00:32:25 Is that decision making process explicitly represented?
00:32:28 And is there a model in the system?
00:32:31 Is there something which is explicitly trying to predict
00:32:34 what will happen in the environment?
00:32:36 And so those three pieces are, to me,
00:32:40 some of the most common building blocks.
00:32:42 And I understand the different choices in RL
00:32:47 as choices of whether or not to use those building blocks
00:32:49 when you’re trying to decompose the solution.
00:32:52 Should I have a value function represented?
00:32:54 Should I have a policy represented?
00:32:56 Should I have a model represented?
00:32:58 And there are combinations of those pieces
00:33:00 and, of course, other things that you could
00:33:01 add into the picture as well.
00:33:03 But those three fundamental choices
00:33:04 give rise to some of the branches of RL
00:33:06 with which we’re very familiar.
00:33:08 And so those, as you mentioned,
00:33:10 there is a choice of what’s specified
00:33:14 or modeled explicitly.
00:33:17 And the idea is that all of these
00:33:20 are somehow implicitly learned within the system.
00:33:23 So it’s almost a choice of how you approach a problem.
00:33:28 Do you see those as fundamental differences
00:33:30 or are these almost like small specifics,
00:33:35 like the details of how you solve a problem
00:33:37 but they’re not fundamentally different from each other?
00:33:40 I think the fundamental idea is maybe at the higher level.
00:33:45 The fundamental idea is the first step
00:33:48 of the decomposition is really to say,
00:33:50 well, how are we really gonna solve any kind of problem
00:33:55 where you’re trying to figure out how to take actions
00:33:57 and just from this stream of observations,
00:33:59 you’ve got some agent situated in its sensory motor stream
00:34:02 and getting all these observations in,
00:34:04 getting to take these actions, and what should it do?
00:34:06 How can you even broach that problem?
00:34:07 You know, maybe the complexity of the world is so great
00:34:10 that you can’t even imagine how to build a system
00:34:13 that would understand how to deal with that.
00:34:15 And so the first step of this decomposition is to say,
00:34:18 well, you have to learn.
00:34:19 The system has to learn for itself.
00:34:22 And so note that the reinforcement learning problem
00:34:24 doesn’t actually stipulate that you have to learn.
00:34:27 Like you could maximize your rewards without learning.
00:34:29 It would just, wouldn’t do a very good job of it.
00:34:32 So learning is required
00:34:34 because it’s the only way to achieve good performance
00:34:36 in any sufficiently large and complex environment.
00:34:40 So that’s the first step.
00:34:42 And so that step gives commonality
00:34:43 to all of the other pieces,
00:34:45 because now you might ask, well, what should you be learning?
00:34:48 What does learning even mean?
00:34:49 You know, in this sense, you know, learning might mean,
00:34:52 well, you’re trying to update the parameters
00:34:55 of some system, which is then the thing
00:34:59 that actually picks the actions.
00:35:00 And those parameters could be representing anything.
00:35:03 They could be parameterizing a value function or a model
00:35:06 or a policy.
00:35:08 And so in that sense, there’s a lot of commonality
00:35:10 in that whatever is being represented there
00:35:12 is the thing which is being learned,
00:35:13 and it’s being learned with the ultimate goal
00:35:15 of maximizing rewards.
00:35:17 But the way in which you decompose the problem
00:35:20 is really what gives the semantics to the whole system.
00:35:23 Like, are you trying to learn something to predict well,
00:35:27 like a value function or a model?
00:35:28 Are you learning something to perform well, like a policy?
00:35:31 And the form of that objective
00:35:34 is kind of giving the semantics to the system.
00:35:36 And so it really is, at the next level down,
00:35:39 a fundamental choice,
00:35:40 and we have to make those fundamental choices
00:35:42 as system designers or enable our algorithms
00:35:46 to be able to learn how to make those choices for themselves.
00:35:49 So then the next step you mentioned,
00:35:52 the very first thing you have to deal with is,
00:35:56 can you even take in this huge stream of observations
00:36:00 and do anything with it?
00:36:01 So the natural next basic question is,
00:36:05 what is deep reinforcement learning?
00:36:08 And what is this idea of using neural networks
00:36:11 to deal with this huge incoming stream?
00:36:14 So amongst all the approaches for reinforcement learning,
00:36:18 deep reinforcement learning
00:36:19 is one family of solution methods
00:36:23 that tries to utilize powerful representations
00:36:29 that are offered by neural networks
00:36:31 to represent any of these different components
00:36:35 of the solution, of the agent,
00:36:37 like whether it’s the value function
00:36:39 or the model or the policy.
00:36:41 The idea of deep learning is to say,
00:36:43 well, here’s a powerful toolkit that’s so powerful
00:36:46 that it’s universal in the sense
00:36:48 that it can represent any function
00:36:50 and it can learn any function.
00:36:52 And so if we can leverage that universality,
00:36:55 that means that whatever we need to represent
00:36:57 for our policy or for our value function or for a model,
00:37:00 deep learning can do it.
00:37:01 So that deep learning is one approach
00:37:04 that offers us a toolkit
00:37:06 that has no ceiling to its performance,
00:37:09 that as we start to put more resources into the system,
00:37:12 more memory and more computation and more data,
00:37:17 more experience, more interactions with the environment,
00:37:20 that these are systems that can just get better
00:37:22 and better and better at doing whatever the job is
00:37:24 they’ve asked them to do,
00:37:25 whatever we’ve asked that function to represent,
00:37:27 it can learn a function that does a better and better job
00:37:31 of representing that knowledge,
00:37:33 whether that knowledge be estimating
00:37:35 how well you’re gonna do in the world,
00:37:36 the value function,
00:37:37 whether it’s gonna be choosing what to do in the world,
00:37:40 the policy,
00:37:41 or whether it’s understanding the world itself,
00:37:43 what’s gonna happen next, the model.
00:37:45 Nevertheless, the fact that neural networks
00:37:49 are able to learn incredibly complex representations
00:37:53 that allow you to do the policy, the model
00:37:55 or the value function is, at least to my mind,
00:38:00 exceptionally beautiful and surprising.
00:38:02 Like, was it surprising to you?
00:38:07 Can you still believe it works as well as it does?
00:38:10 Do you have good intuition about why it works at all
00:38:13 and works as well as it does?
00:38:18 I think, let me take two parts to that question.
00:38:22 I think it’s not surprising to me
00:38:26 that the idea of reinforcement learning works
00:38:30 because in some sense, I think it’s the,
00:38:34 I feel it’s the only thing which can ultimately.
00:38:36 And so I feel we have to address it
00:38:39 and there must be success as possible
00:38:41 because we have examples of intelligence.
00:38:44 And it must at some level be able to,
00:38:47 possible to acquire experience
00:38:49 and use that experience to do better
00:38:51 in a way which is meaningful to environments
00:38:55 of the complexity that humans can deal with.
00:38:57 It must be.
00:38:58 Am I surprised that our current systems
00:39:00 can do as well as they can do?
00:39:03 I think one of the big surprises for me
00:39:05 and a lot of the community
00:39:09 is really the fact that deep learning
00:39:13 can continue to perform so well
00:39:18 despite the fact that these neural networks
00:39:21 that they’re representing
00:39:23 have these incredibly nonlinear kind of bumpy surfaces
00:39:27 which to our kind of low dimensional intuitions
00:39:30 make it feel like surely you’re just gonna get stuck
00:39:33 and learning will get stuck
00:39:34 because you won’t be able to make any further progress.
00:39:37 And yet the big surprise is that learning continues
00:39:42 and these what appear to be local optima
00:39:45 turn out not to be because in high dimensions
00:39:48 when we make really big neural nets,
00:39:49 there’s always a way out
00:39:51 and there’s a way to go even lower
00:39:52 and then you’re still not in a local optima
00:39:55 because there’s some other pathway
00:39:57 that will take you out and take you lower still.
00:39:59 And so no matter where you are,
00:40:00 learning can proceed and do better and better and better
00:40:04 without bound.
00:40:06 And so that is a surprising
00:40:09 and beautiful property of neural nets
00:40:13 which I find elegant and beautiful
00:40:16 and somewhat shocking that it turns out to be the case.
00:40:20 As you said, which I really like
00:40:22 to our low dimensional intuitions, that’s surprising.
00:40:27 Yeah, we’re very tuned to working
00:40:31 within a three dimensional environment.
00:40:33 And so to start to visualize
00:40:36 what a billion dimensional neural network surface
00:40:41 that you’re trying to optimize over,
00:40:42 what that even looks like is very hard for us.
00:40:45 And so I think that really,
00:40:47 if you try to account for the,
00:40:52 essentially the AI winter
00:40:54 where people gave up on neural networks,
00:40:56 I think it’s really down to that lack of ability
00:41:00 to generalize from low dimensions to high dimensions
00:41:03 because back then we were in the low dimensional case.
00:41:05 People could only build neural nets
00:41:07 with 50 nodes in them or something.
00:41:11 And to imagine that it might be possible
00:41:14 to build a billion dimensional neural net
00:41:15 and it might have a completely different,
00:41:17 qualitatively different property was very hard to anticipate.
00:41:21 And I think even now we’re starting to build the theory
00:41:24 to support that.
00:41:26 And it’s incomplete at the moment,
00:41:28 but all of the theory seems to be pointing in the direction
00:41:30 that indeed this is an approach which truly is universal
00:41:34 both in its representational capacity, which was known,
00:41:37 but also in its learning ability, which is surprising.
00:41:40 And it makes one wonder what else we’re missing
00:41:44 due to our low dimensional intuitions
00:41:47 that will seem obvious once it’s discovered.
00:41:51 I often wonder, when we one day do have AIs
00:41:57 which are superhuman in their abilities
00:42:00 to understand the world,
00:42:05 what will they think of the algorithms
00:42:07 that we developed back now?
00:42:08 Will it be looking back at these days
00:42:11 and thinking that, will we look back and feel
00:42:17 that these algorithms were naive first steps
00:42:19 or will they still be the fundamental ideas
00:42:21 which are used even in 100,000, 10,000 years?
00:42:26 It’s hard to know.
00:42:27 They’ll watch back to this conversation
00:42:30 and with a smile, maybe a little bit of a laugh.
00:42:34 I mean, my sense is, I think just like when we used
00:42:40 to think that the sun revolved around the earth,
00:42:45 they’ll see our systems of today, reinforcement learning
00:42:49 as too complicated, that the answer was simple all along.
00:42:54 There’s something, just like you said in the game of Go,
00:42:58 I mean, I love the systems of like cellular automata,
00:43:01 that there’s simple rules from which incredible complexity
00:43:05 emerges, so it feels like there might be
00:43:08 some really simple approaches,
00:43:10 just like Rich Sutton says, right?
00:43:12 These simple methods with compute over time
00:43:17 seem to prove to be the most effective.
00:43:20 I 100% agree.
00:43:21 I think that if we try to anticipate
00:43:27 what will generalize well into the future,
00:43:30 I think it’s likely to be the case
00:43:32 that it’s the simple, clear ideas
00:43:35 which will have the longest legs
00:43:36 and which will carry us furthest into the future.
00:43:39 Nevertheless, we’re in a situation
00:43:40 where we need to make things work today,
00:43:43 and sometimes that requires putting together
00:43:44 more complex systems where we don’t have
00:43:47 the full answers yet as to what
00:43:49 those minimal ingredients might be.
00:43:51 So speaking of which, if we could take a step back to Go,
00:43:55 what was MoGo and what was the key idea behind the system?
00:44:00 So back during my PhD on Computer Go,
00:44:04 around about that time, there was a major new development
00:44:08 which actually happened in the context of Computer Go,
00:44:12 and it was really a revolution in the way
00:44:16 that heuristic search was done,
00:44:18 and the idea was essentially that
00:44:21 a position could be evaluated or a state in general
00:44:26 could be evaluated not by humans saying
00:44:30 whether that position is good or not,
00:44:33 or even humans providing rules
00:44:35 as to how you might evaluate it,
00:44:37 but instead by allowing the system
00:44:40 to randomly play out the game until the end multiple times
00:44:45 and taking the average of those outcomes
00:44:48 as the prediction of what will happen.
00:44:50 So for example, if you’re in the game of Go,
00:44:53 the intuition is that you take a position
00:44:55 and you get the system to kind of play random moves
00:44:58 against itself all the way to the end of the game
00:45:00 and you see who wins.
00:45:01 And if black ends up winning
00:45:03 more of those random games than white,
00:45:05 well, you say, hey, this is a position that favors white.
00:45:07 And if white ends up winning more of those random games
00:45:09 than black, then it favors white.
00:45:13 So that idea was known as Monte Carlo search,
00:45:18 and a particular form of Monte Carlo search
00:45:21 that became very effective and was developed in computer Go
00:45:24 first by Remy Coulomb in 2006,
00:45:26 and then taken further by others
00:45:29 was something called Monte Carlo tree search,
00:45:31 which basically takes that same idea
00:45:34 and uses that insight to evaluate every node of a search tree
00:45:39 is evaluated by the average of the random play outs
00:45:42 from that node onwards.
00:45:44 And this idea, when you think about it,
00:45:46 and this idea was very powerful
00:45:49 and suddenly led to huge leaps forward
00:45:51 in the strength of computer Go playing programs.
00:45:55 And among those, the strongest of the Go playing programs
00:45:58 in those days was a program called MoGo,
00:46:00 which was the first program to actually reach
00:46:03 human master level on small boards, nine by nine boards.
00:46:07 And so this was a program by someone called Sylvain Gelli,
00:46:11 who’s a good colleague of mine,
00:46:13 but I worked with him a little bit in those days,
00:46:16 part of my PhD thesis.
00:46:18 And MoGo was a first step towards the latest successes
00:46:23 we saw in computer Go,
00:46:25 but it was still missing a key ingredient.
00:46:28 MoGo was evaluating purely by random rollouts against itself.
00:46:33 And in a way, it’s truly remarkable
00:46:36 that random play should give you anything at all.
00:46:39 Why in this perfectly deterministic game
00:46:42 that’s very precise and involves these very exact sequences,
00:46:46 why is it that randomization is helpful?
00:46:52 And so the intuition is that randomization
00:46:54 captures something about the nature of the search tree,
00:46:59 from a position that you’re understanding
00:47:01 the nature of the search tree from that node onwards
00:47:04 by using randomization.
00:47:06 And this was a very powerful idea.
00:47:09 And I’ve seen this in other spaces,
00:47:12 talked to Richard Karp and so on,
00:47:14 randomized algorithms somehow magically
00:47:17 are able to do exceptionally well
00:47:19 and simplifying the problem somehow.
00:47:23 Makes you wonder about the fundamental nature
00:47:25 of randomness in our universe.
00:47:27 It seems to be a useful thing.
00:47:29 But so from that moment,
00:47:32 can you maybe tell the origin story
00:47:33 and the journey of AlphaGo?
00:47:36 Yeah, so programs based on Monte Carlo tree search
00:47:39 were a first revolution
00:47:41 in the sense that they led to suddenly programs
00:47:44 that could play the game to any reasonable level,
00:47:47 but they plateaued.
00:47:50 It seemed that no matter how much effort
00:47:51 people put into these techniques,
00:47:53 they couldn’t exceed the level
00:47:54 of amateur Dan level Go players.
00:47:58 So strong players,
00:47:59 but not anywhere near the level of professionals,
00:48:02 nevermind the world champion.
00:48:04 And so that brings us to the birth of AlphaGo,
00:48:08 which happened in the context of a startup company
00:48:12 known as DeepMind.
00:48:14 I heard of them.
00:48:15 Where a project was born.
00:48:19 And the project was really a scientific investigation
00:48:23 where myself and Adger Huang
00:48:27 and an intern, Chris Madison,
00:48:30 were exploring a scientific question.
00:48:33 And that scientific question was really,
00:48:37 is there another fundamentally different approach
00:48:39 to this key question of Go,
00:48:42 the key challenge of how can you build that intuition
00:48:45 and how can you just have a system
00:48:47 that could look at a position
00:48:48 and understand what move to play
00:48:51 or how well you’re doing in that position,
00:48:53 who’s gonna win?
00:48:54 And so the deep learning revolution had just begun.
00:48:59 That systems like ImageNet had suddenly been won
00:49:03 by deep learning techniques back in 2012.
00:49:06 And following that, it was natural to ask,
00:49:08 well, if deep learning is able to scale up so effectively
00:49:12 with images to understand them enough to classify them,
00:49:16 well, why not go?
00:49:17 Why not take the black and white stones of the Go board
00:49:22 and build a system which can understand for itself
00:49:25 what that means in terms of what move to pick
00:49:27 or who’s gonna win the game, black or white?
00:49:31 And so that was our scientific question
00:49:32 which we were probing and trying to understand.
00:49:35 And as we started to look at it,
00:49:37 we discovered that we could build a system.
00:49:40 So in fact, our very first paper on AlphaGo
00:49:43 was actually a pure deep learning system
00:49:47 which was trying to answer this question.
00:49:49 And we showed that actually a pure deep learning system
00:49:52 with no search at all was actually able
00:49:54 to reach human band level, master level
00:49:58 at the full game of Go, 19 by 19 boards.
00:50:01 And so without any search at all,
00:50:04 suddenly we had systems which were playing
00:50:06 at the level of the best Monte Carlo tree search systems,
00:50:10 the ones with randomized rollouts.
00:50:11 So first of all, sorry to interrupt,
00:50:13 but that’s kind of a groundbreaking notion.
00:50:16 That’s like basically a definitive step away
00:50:20 from a couple of decades
00:50:22 of essentially search dominating AI.
00:50:26 So how did that make you feel?
00:50:28 Was it surprising from a scientific perspective in general,
00:50:33 how to make you feel?
00:50:33 I found this to be profoundly surprising.
00:50:37 In fact, it was so surprising that we had a bet back then.
00:50:41 And like many good projects, bets are quite motivating.
00:50:44 And the bet was whether it was possible
00:50:47 for a system based purely on deep learning,
00:50:52 with no search at all to beat a down level human player.
00:50:55 And so we had someone who joined our team
00:51:00 who was a down level player.
00:51:01 He came in and we had this first match against him and…
00:51:06 Which side of the bed were you on, by the way?
00:51:09 The losing or the winning side?
00:51:11 I tend to be an optimist with the power
00:51:14 of deep learning and reinforcement learning.
00:51:18 So the system won,
00:51:21 and we were able to beat this human down level player.
00:51:24 And for me, that was the moment where it was like,
00:51:26 okay, something special is afoot here.
00:51:29 We have a system which without search
00:51:32 is able to already just look at this position
00:51:36 and understand things as well as a strong human player.
00:51:39 And from that point onwards,
00:51:41 I really felt that reaching the top levels of human play,
00:51:49 professional level, world champion level,
00:51:50 I felt it was actually an inevitability.
00:51:56 And if it was an inevitable outcome,
00:51:59 I was rather keen that it would be us that achieved it.
00:52:03 So we scaled up.
00:52:05 This was something where,
00:52:06 so I had lots of conversations back then
00:52:09 with Demis Sassabis, the head of DeepMind,
00:52:14 who was extremely excited.
00:52:16 And we made the decision to scale up the project,
00:52:21 brought more people on board.
00:52:23 And so AlphaGo became something where we had a clear goal,
00:52:30 which was to try and crack this outstanding challenge of AI
00:52:33 to see if we could beat the world’s best players.
00:52:37 And this led within the space of not so many months
00:52:42 to playing against the European champion Fan Hui
00:52:45 in a match which became memorable in history
00:52:48 as the first time a Go program
00:52:50 had ever beaten a professional player.
00:52:53 And at that time we had to make a judgment
00:52:56 as to when and whether we should go
00:52:59 and challenge the world champion.
00:53:01 And this was a difficult decision to make.
00:53:04 Again, we were basing our predictions on our own progress
00:53:08 and had to estimate based on the rapidity
00:53:11 of our own progress when we thought we would exceed
00:53:15 the level of the human world champion.
00:53:17 And we tried to make an estimate and set up a match
00:53:20 and that became the AlphaGo versus Lee Sedol match in 2016.
00:53:27 And we should say, spoiler alert,
00:53:29 that AlphaGo was able to defeat Lee Sedol.
00:53:33 That’s right, yeah.
00:53:34 So maybe we could take even a broader view.
00:53:39 AlphaGo involves both learning from expert games
00:53:45 and as far as I remember, a self play component
00:53:51 to where it learns by playing against itself.
00:53:54 But in your sense, what was the role of learning
00:53:57 from expert games there?
00:53:59 And in terms of your self evaluation,
00:54:01 whether you can take on the world champion,
00:54:04 what was the thing that you’re trying to do more of?
00:54:06 Sort of train more on expert games
00:54:09 or was there’s now another,
00:54:12 I’m asking so many poorly phrased questions,
00:54:15 but did you have a hope or dream that self play
00:54:19 would be the key component at that moment yet?
00:54:24 So in the early days of AlphaGo,
00:54:26 we used human data to explore the science
00:54:29 of what deep learning can achieve.
00:54:31 And so when we had our first paper that showed
00:54:34 that it was possible to predict the winner of the game,
00:54:37 that it was possible to suggest moves,
00:54:39 that was done using human data.
00:54:41 A solely human data.
00:54:42 Yeah, and so the reason that we did it that way
00:54:45 was at that time we were exploring separately
00:54:47 the deep learning aspect
00:54:48 from the reinforcement learning aspect.
00:54:51 That was the part which was new and unknown
00:54:53 to me at that time was how far could that be stretched?
00:54:58 Once we had that, it then became natural
00:55:00 to try and use that same representation
00:55:03 and see if we could learn for ourselves
00:55:04 using that same representation.
00:55:06 And so right from the beginning,
00:55:08 actually our goal had been to build a system
00:55:11 using self play.
00:55:14 And to us, the human data right from the beginning
00:55:16 was an expedient step to help us for pragmatic reasons
00:55:20 to go faster towards the goals of the project
00:55:24 than we might be able to starting solely from self play.
00:55:27 And so in those days, we were very aware
00:55:29 that we were choosing to use human data
00:55:32 and that might not be the longterm holy grail of AI,
00:55:37 but that it was something which was extremely useful to us.
00:55:40 It helped us to understand the system.
00:55:42 It helped us to build deep learning representations
00:55:44 which were clear and simple and easy to use.
00:55:48 And so really I would say it served a purpose
00:55:51 not just as part of the algorithm,
00:55:53 but something which I continue to use in our research today,
00:55:56 which is trying to break down a very hard challenge
00:56:00 into pieces which are easier to understand for us
00:56:02 as researchers and develop.
00:56:04 So if you use a component based on human data,
00:56:07 it can help you to understand the system
00:56:10 such that then you can build
00:56:11 the more principled version later that does it for itself.
00:56:15 So as I said, the AlphaGo victory,
00:56:19 and I don’t think I’m being sort of romanticizing this notion.
00:56:23 I think it’s one of the greatest moments
00:56:25 in the history of AI.
00:56:26 So were you cognizant of this magnitude
00:56:29 of the accomplishment at the time?
00:56:32 I mean, are you cognizant of it even now?
00:56:35 Because to me, I feel like it’s something that would,
00:56:38 we mentioned what the AGI systems of the future
00:56:41 will look back.
00:56:42 I think they’ll look back at the AlphaGo victory
00:56:46 as like, holy crap, they figured it out.
00:56:49 This is where it started.
00:56:51 Well, thank you again.
00:56:52 I mean, it’s funny because I guess I’ve been working on,
00:56:56 I’ve been working on ComputerGo for a long time.
00:56:58 So I’d been working at the time of the AlphaGo match
00:57:00 on ComputerGo for more than a decade.
00:57:03 And throughout that decade, I’d had this dream
00:57:06 of what would it be like to, what would it be like really
00:57:08 to actually be able to build a system
00:57:12 that could play against the world champion.
00:57:14 And I imagined that that would be an interesting moment
00:57:17 that maybe some people might care about that
00:57:20 and that this might be a nice achievement.
00:57:24 But I think when I arrived in Seoul
00:57:27 and discovered the legions of journalists
00:57:31 that were following us around and the 100 million people
00:57:34 that were watching the match online live,
00:57:37 I realized that I’d been off in my estimation
00:57:40 of how significant this moment was
00:57:41 by several orders of magnitude.
00:57:43 And so there was definitely an adjustment process
00:57:48 to realize that this was something
00:57:53 which the world really cared about
00:57:55 and which was a watershed moment.
00:57:57 And I think there was that moment of realization.
00:58:01 But it’s also a little bit scary
00:58:02 because if you go into something thinking
00:58:05 it’s gonna be maybe of interest
00:58:08 and then discover that 100 million people are watching,
00:58:10 it suddenly makes you worry about
00:58:12 whether some of the decisions you’d made
00:58:13 were really the best ones or the wisest,
00:58:16 or were going to lead to the best outcome.
00:58:18 And we knew for sure that there were still imperfections
00:58:20 in AlphaGo, which were gonna be exposed
00:58:22 to the whole world watching.
00:58:24 And so, yeah, it was I think a great experience
00:58:28 and I feel privileged to have been part of it,
00:58:32 privileged to have led that amazing team.
00:58:35 I feel privileged to have been in a moment of history
00:58:38 like you say, but also lucky that in a sense
00:58:43 I was insulated from the knowledge of,
00:58:46 I think it would have been harder to focus on the research
00:58:48 if the full kind of reality of what was gonna come to pass
00:58:52 had been known to me and the team.
00:58:55 I think it was, we were in our bubble
00:58:57 and we were working on research
00:58:58 and we were trying to answer the scientific questions
00:59:01 and then bam, the public sees it.
00:59:04 And I think it was better that way in retrospect.
00:59:07 Were you confident that, I guess,
00:59:10 what were the chances that you could get the win?
00:59:13 So just like you said, I’m a little bit more familiar
00:59:19 with another accomplishment
00:59:20 that we may not even get a chance to talk to.
00:59:22 I talked to Oriel Venialis about Alpha Star
00:59:24 which is another incredible accomplishment,
00:59:26 but here with Alpha Star and beating the StarCraft,
00:59:31 there was already a track record with AlphaGo.
00:59:34 This is the really first time
00:59:36 you get to see reinforcement learning
00:59:39 face the best human in the world.
00:59:41 So what was your confidence like, what was the odds?
00:59:45 Well, we actually. Was there a bet?
00:59:47 Funnily enough, there was.
00:59:49 So just before the match,
00:59:52 we weren’t betting on anything concrete,
00:59:54 but we all held out a hand.
00:59:56 Everyone in the team held out a hand
00:59:57 at the beginning of the match.
00:59:59 And the number of fingers that they had out on their hand
01:00:01 was supposed to represent how many games
01:00:03 they thought we would win against Lee Sedol.
01:00:06 And there was an amazing spread in the team’s predictions.
01:00:10 But I have to say, I predicted four, one.
01:00:15 And the reason was based purely on data.
01:00:18 So I’m a scientist first and foremost.
01:00:20 And one of the things which we had established
01:00:23 was that AlphaGo in around one in five games
01:00:27 would develop something which we called a delusion,
01:00:29 which was a kind of in a hole in its knowledge
01:00:31 where it wasn’t able to fully understand
01:00:34 everything about the position.
01:00:36 And that hole in its knowledge would persist
01:00:38 for tens of moves throughout the game.
01:00:41 And we knew two things.
01:00:42 We knew that if there were no delusions,
01:00:44 that AlphaGo seemed to be playing at a level
01:00:46 that was far beyond any human capabilities.
01:00:49 But we also knew that if there were delusions,
01:00:52 the opposite was true.
01:00:53 And in fact, that’s what came to pass.
01:00:58 We saw all of those outcomes.
01:01:00 And Lee Sedol in one of the games
01:01:02 played a really beautiful sequence
01:01:04 that AlphaGo just hadn’t predicted.
01:01:08 And after that, it led it into this situation
01:01:11 where it was unable to really understand the position fully
01:01:14 and found itself in one of these delusions.
01:01:17 So indeed, yeah, 4.1 was the outcome.
01:01:20 So yeah, and can you maybe speak to it a little bit more?
01:01:23 What were the five games?
01:01:25 What happened?
01:01:26 Is there interesting things that come to memory
01:01:29 in terms of the play of the human or the machine?
01:01:33 So I remember all of these games vividly, of course.
01:01:37 Moments like these don’t come too often
01:01:39 in the lifetime of a scientist.
01:01:42 And the first game was magical because it was the first time
01:01:49 that a computer program had defeated a world
01:01:53 champion in this grand challenge of Go.
01:01:57 And there was a moment where AlphaGo invaded Lee Sedol’s
01:02:04 territory towards the end of the game.
01:02:07 And that’s quite an audacious thing to do.
01:02:09 It’s like saying, hey, you thought
01:02:11 this was going to be your territory in the game,
01:02:12 but I’m going to stick a stone right in the middle of it
01:02:14 and prove to you that I can break it up.
01:02:17 And Lee Sedol’s face just dropped.
01:02:20 He wasn’t expecting a computer to do something that audacious.
01:02:26 The second game became famous for a move known as move 37.
01:02:30 This was a move that was played by AlphaGo that broke
01:02:36 all of the conventions of Go, that the Go players were
01:02:39 so shocked by this.
01:02:40 They thought that maybe the operator had made a mistake.
01:02:45 They thought that there was something crazy going on.
01:02:48 And it just broke every rule that Go players
01:02:50 are taught from a very young age.
01:02:52 They’re just taught this kind of move called a shoulder hit.
01:02:55 You can only play it on the third line or the fourth line,
01:02:58 and AlphaGo played it on the fifth line.
01:03:00 And it turned out to be a brilliant move
01:03:03 and made this beautiful pattern in the middle of the board that
01:03:06 ended up winning the game.
01:03:08 And so this really was a clear instance
01:03:12 where we could say computers exhibited creativity,
01:03:16 that this was really a move that was something
01:03:18 humans hadn’t known about, hadn’t anticipated.
01:03:22 And computers discovered this idea.
01:03:24 They were the ones to say, actually, here’s
01:03:27 a new idea, something new, not in the domains
01:03:30 of human knowledge of the game.
01:03:33 And now the humans think this is a reasonable thing to do.
01:03:38 And it’s part of Go knowledge now.
01:03:41 The third game, something special
01:03:44 happens when you play against a human world champion, which,
01:03:46 again, I hadn’t anticipated before going there,
01:03:48 which is these players are amazing.
01:03:53 Lee Sedol was a true champion, 18 time world champion,
01:03:56 and had this amazing ability to probe AlphaGo
01:04:01 for weaknesses of any kind.
01:04:03 And in the third game, he was losing,
01:04:06 and we felt we were sailing comfortably to victory.
01:04:09 But he managed to, from nothing, stir up this fight
01:04:14 and build what’s called a double co,
01:04:17 these kind of repetitive positions.
01:04:20 And he knew that historically, no computer Go program had ever
01:04:24 been able to deal correctly with double co positions.
01:04:26 And he managed to summon one out of nothing.
01:04:29 And so for us, this was a real challenge.
01:04:33 Would AlphaGo be able to deal with this,
01:04:35 or would it just kind of crumble in the face of this situation?
01:04:38 And fortunately, it dealt with it perfectly.
01:04:41 The fourth game was amazing in that Lee Sedol
01:04:46 appeared to be losing this game.
01:04:48 AlphaGo thought it was winning.
01:04:49 And then Lee Sedol did something,
01:04:52 which I think only a true world champion can do,
01:04:55 which is he found a brilliant sequence
01:04:57 in the middle of the game, a brilliant sequence
01:04:59 that led him to really just transform the position.
01:05:05 He kind of found just a piece of genius, really.
01:05:10 And after that, AlphaGo, its evaluation just tumbled.
01:05:15 It thought it was winning this game.
01:05:17 And all of a sudden, it tumbled and said, oh, now
01:05:20 I’ve got no chance.
01:05:21 And it started to behave rather oddly at that point.
01:05:24 In the final game, for some reason, we as a team
01:05:27 were convinced, having seen AlphaGo in the previous game,
01:05:30 suffer from delusions.
01:05:31 We as a team were convinced that it
01:05:34 was suffering from another delusion.
01:05:35 We were convinced that it was misevaluating the position
01:05:38 and that something was going terribly wrong.
01:05:41 And it was only in the last few moves of the game
01:05:43 that we realized that actually, although it
01:05:46 had been predicting it was going to win all the way through,
01:05:49 it really was.
01:05:51 And so somehow, it just taught us yet again
01:05:54 that you have to have faith in your systems.
01:05:56 When they exceed your own level of ability
01:05:58 and your own judgment, you have to trust in them
01:06:01 to know better than you, the designer, once you’ve
01:06:06 bestowed in them the ability to judge better than you can,
01:06:10 then trust the system to do so.
01:06:13 So just like in the case of Deep Blue beating Gary Kasparov,
01:06:18 so Gary was, I think, the first time he’s ever lost, actually,
01:06:23 to anybody.
01:06:24 And I mean, there’s a similar situation with Lee Sedol.
01:06:27 It’s a tragic loss for humans, but a beautiful one,
01:06:36 I think, that’s kind of, from the tragedy,
01:06:40 sort of emerges over time, emerges
01:06:45 a kind of inspiring story.
01:06:47 But Lee Sedol recently announced his retirement.
01:06:52 I don’t know if we can look too deeply into it,
01:06:56 but he did say that even if I become number one,
01:06:59 there’s an entity that cannot be defeated.
01:07:02 So what do you think about these words?
01:07:05 What do you think about his retirement from the game ago?
01:07:08 Well, let me take you back, first of all,
01:07:09 to the first part of your comment about Gary Kasparov,
01:07:12 because actually, at the panel yesterday,
01:07:15 he specifically said that when he first lost to Deep Blue,
01:07:19 he viewed it as a failure.
01:07:22 He viewed that this had been a failure of his.
01:07:24 But later on in his career, he said
01:07:27 he’d come to realize that actually, it was a success.
01:07:30 It was a success for everyone, because this marked
01:07:33 transformational moment for AI.
01:07:37 And so even for Gary Kasparov, he
01:07:39 came to realize that that moment was pivotal
01:07:42 and actually meant something much more
01:07:45 than his personal loss in that moment.
01:07:49 Lee Sedol, I think, was much more cognizant of that,
01:07:53 even at the time.
01:07:54 And so in his closing remarks to the match,
01:07:57 he really felt very strongly that what
01:08:01 had happened in the AlphaGo match
01:08:02 was not only meaningful for AI, but for humans as well.
01:08:06 And he felt as a Go player that it had opened his horizons
01:08:09 and meant that he could start exploring new things.
01:08:12 It brought his joy back for the game of Go,
01:08:14 because it had broken all of the conventions and barriers
01:08:18 and meant that suddenly, anything was possible again.
01:08:23 So I was sad to hear that he’d retired,
01:08:26 but he’s been a great world champion over many, many years.
01:08:31 And I think he’ll be remembered for that ever more.
01:08:36 He’ll be remembered as the last person to beat AlphaGo.
01:08:39 I mean, after that, we increased the power of the system.
01:08:43 And the next version of AlphaGo beats the other strong human
01:08:49 player 60 games to nil.
01:08:52 So what a great moment for him and something
01:08:55 to be remembered for.
01:08:58 It’s interesting that you spent time at AAAI on a panel
01:09:02 with Garry Kasparov.
01:09:05 What, I mean, it’s almost, I’m just
01:09:07 curious to learn the conversations you’ve
01:09:12 had with Garry, because he’s also now,
01:09:15 he’s written a book about artificial intelligence.
01:09:17 He’s thinking about AI.
01:09:18 He has kind of a view of it.
01:09:21 And he talks about AlphaGo a lot.
01:09:23 What’s your sense?
01:09:26 Arguably, I’m not just being Russian,
01:09:28 but I think Garry is the greatest chess player
01:09:31 of all time, probably one of the greatest game
01:09:34 players of all time.
01:09:36 And you sort of at the center of creating
01:09:41 a system that beats one of the greatest players of all time.
01:09:45 So what is that conversation like?
01:09:46 Is there anything, any interesting digs, any bets,
01:09:50 any funny things, any profound things?
01:09:53 So Garry Kasparov has an incredible respect
01:09:58 for what we did with AlphaGo.
01:10:01 And it’s an amazing tribute coming from him of all people
01:10:07 that he really appreciates and respects what we’ve done.
01:10:11 And I think he feels that the progress which has happened
01:10:14 in computer chess, which later after AlphaGo,
01:10:19 we built the AlphaZero system, which
01:10:23 defeated the world’s strongest chess programs.
01:10:26 And to Garry Kasparov, that moment in computer chess
01:10:29 was more profound than Deep Blue.
01:10:32 And the reason he believes it mattered more
01:10:35 was because it was done with learning
01:10:37 and a system which was able to discover for itself
01:10:39 new principles, new ideas, which were
01:10:42 able to play the game in a way which he hadn’t always
01:10:47 known about or anyone.
01:10:50 And in fact, one of the things I discovered at this panel
01:10:53 was that the current world champion, Magnus Carlsen,
01:10:56 apparently recently commented on his improvement
01:11:00 in performance.
01:11:01 And he attributed it to AlphaZero,
01:11:03 that he’s been studying the games of AlphaZero.
01:11:05 And he’s changed his style to play more like AlphaZero.
01:11:08 And it’s led to him actually increasing his rating
01:11:13 to a new peak.
01:11:15 Yeah, I guess to me, just like to Garry,
01:11:18 the inspiring thing is that, and just like you said,
01:11:21 with reinforcement learning, reinforcement learning
01:11:25 and deep learning, machine learning
01:11:26 feels like what intelligence is.
01:11:29 And you could attribute it to a bitter viewpoint
01:11:35 from Garry’s perspective, from us humans perspective,
01:11:39 saying that pure search that IBM Deep Blue was doing
01:11:43 is not really intelligence, but somehow it didn’t feel like it.
01:11:47 And so that’s the magical.
01:11:49 I’m not sure what it is about learning that
01:11:50 feels like intelligence, but it does.
01:11:54 So I think we should not demean the achievements of what
01:11:58 was done in previous eras of AI.
01:12:00 I think that Deep Blue was an amazing achievement in itself.
01:12:04 And that heuristic search of the kind that was used by Deep
01:12:07 Blue had some powerful ideas that were in there,
01:12:11 but it also missed some things.
01:12:13 So the fact that the evaluation function, the way
01:12:16 that the chess position was understood,
01:12:18 was created by humans and not by the machine
01:12:22 is a limitation, which means that there’s
01:12:26 a ceiling on how well it can do.
01:12:28 But maybe more importantly, it means
01:12:30 that the same idea cannot be applied in other domains
01:12:33 where we don’t have access to the human grandmasters
01:12:38 and that ability to encode exactly their knowledge
01:12:41 into an evaluation function.
01:12:43 And the reality is that the story of AI
01:12:45 is that most domains turn out to be of the second type
01:12:48 where knowledge is messy, it’s hard to extract from experts,
01:12:52 or it isn’t even available.
01:12:53 And so we need to solve problems in a different way.
01:12:59 And I think AlphaGo is a step towards solving things
01:13:02 in a way which puts learning as a first class citizen
01:13:07 and says systems need to understand for themselves
01:13:11 how to understand the world, how to judge the value of any action
01:13:19 that they might take within that world
01:13:20 and any state they might find themselves in.
01:13:22 And in order to do that, we make progress towards AI.
01:13:29 Yeah, so one of the nice things about taking a learning
01:13:32 approach to the game of Go or game playing
01:13:36 is that the things you learn, the things you figure out,
01:13:39 are actually going to be applicable to other problems
01:13:42 that are real world problems.
01:13:44 That’s ultimately, I mean, there’s
01:13:47 two really interesting things about AlphaGo.
01:13:49 One is the science of it, just the science of learning,
01:13:52 the science of intelligence.
01:13:54 And then the other is while you’re actually
01:13:56 learning to figuring out how to build systems that
01:13:59 would be potentially applicable in other applications,
01:14:04 medical, autonomous vehicles, robotics,
01:14:06 I mean, it’s just open the door to all kinds of applications.
01:14:10 So the next incredible step, really the profound step
01:14:16 is probably AlphaGo Zero.
01:14:18 I mean, it’s arguable.
01:14:20 I kind of see them all as the same place.
01:14:22 But really, and perhaps you were already
01:14:24 thinking that AlphaGo Zero is the natural.
01:14:26 It was always going to be the next step.
01:14:29 But it’s removing the reliance on human expert games
01:14:33 for pre training, as you mentioned.
01:14:35 So how big of an intellectual leap
01:14:38 was this that self play could achieve superhuman level
01:14:43 performance in its own?
01:14:45 And maybe could you also say, what is self play?
01:14:48 Kind of mention it a few times.
01:14:51 So let me start with self play.
01:14:55 So the idea of self play is something
01:14:58 which is really about systems learning for themselves,
01:15:01 but in the situation where there’s more than one agent.
01:15:05 And so if you’re in a game, and the game
01:15:08 is played between two players, then self play
01:15:11 is really about understanding that game just
01:15:15 by playing games against yourself
01:15:17 rather than against any actual real opponent.
01:15:19 And so it’s a way to kind of discover strategies
01:15:23 without having to actually need to go out and play
01:15:27 against any particular human player, for example.
01:15:36 The main idea of Alpha Zero was really
01:15:38 to try and step back from any of the knowledge
01:15:45 that we put into the system and ask the question,
01:15:47 is it possible to come up with a single elegant principle
01:15:52 by which a system can learn for itself all of the knowledge
01:15:57 which it requires to play a game such as Go?
01:16:00 Importantly, by taking knowledge out,
01:16:03 you not only make the system less brittle in the sense
01:16:08 that perhaps the knowledge you were putting in
01:16:10 was just getting in the way and maybe stopping the system
01:16:13 learning for itself, but also you make it more general.
01:16:17 The more knowledge you put in, the harder
01:16:20 it is for a system to actually be placed,
01:16:23 taken out of the system in which it’s kind of been designed,
01:16:26 and placed in some other system that maybe would need
01:16:29 a completely different knowledge base to understand
01:16:31 and perform well.
01:16:32 And so the real goal here is to strip out all of the knowledge
01:16:36 that we put in to the point that we can just plug it
01:16:39 into something totally different.
01:16:41 And that, to me, is really the promise of AI
01:16:45 is that we can have systems such as that which,
01:16:47 no matter what the goal is, no matter what goal
01:16:51 we set to the system, we can come up
01:16:53 with an algorithm which can be placed into that world,
01:16:57 into that environment, and can succeed
01:16:59 in achieving that goal.
01:17:01 And then that, to me, is almost the essence of intelligence
01:17:06 if we can achieve that.
01:17:07 And so AlphaZero is a step towards that.
01:17:11 And it’s a step that was taken in the context of two player
01:17:15 perfect information games like Go and chess.
01:17:18 We also applied it to Japanese chess.
01:17:21 So just to clarify, the first step
01:17:23 was AlphaGo Zero.
01:17:25 The first step was to try and take all of the knowledge out
01:17:29 of AlphaGo in such a way that it could
01:17:32 play in a fully self discovered way, purely from self play.
01:17:39 And to me, the motivation for that
01:17:41 was always that we could then plug it into other domains.
01:17:44 But we saved that until later.
01:17:48 Well, in fact, I mean, just for fun,
01:17:52 I could tell you exactly the moment
01:17:54 where the idea for AlphaZero occurred to me.
01:17:57 Because I think there’s maybe a lesson there for researchers
01:18:00 who are too deeply embedded in their research
01:18:03 and working 24 sevens to try and come up with the next idea,
01:18:08 which is it actually occurred to me on honeymoon.
01:18:13 And I was at my most fully relaxed state,
01:18:17 really enjoying myself, and just bing,
01:18:22 the algorithm for AlphaZero just appeared in its full form.
01:18:29 And this was actually before we played against Lisa Dahl.
01:18:33 But we just didn’t.
01:18:35 I think we were so busy trying to make sure
01:18:39 we could beat the world champion that it was only later
01:18:43 that we had the opportunity to step back and start
01:18:47 examining that sort of deeper scientific question of whether
01:18:51 this could really work.
01:18:52 So nevertheless, so self play is probably
01:18:56 one of the most profound ideas that represents, to me at least,
01:19:03 artificial intelligence.
01:19:05 But the fact that you could use that kind of mechanism
01:19:09 to, again, beat world class players,
01:19:13 that’s very surprising.
01:19:14 So to me, it feels like you have to train
01:19:19 in a large number of expert games.
01:19:21 So was it surprising to you?
01:19:22 What was the intuition?
01:19:23 Can you sort of think, not necessarily at that time,
01:19:26 even now, what’s your intuition?
01:19:27 Why this thing works so well?
01:19:30 Why it’s able to learn from scratch?
01:19:31 Well, let me first say why we tried it.
01:19:34 So we tried it both because I feel
01:19:36 that it was the deeper scientific question
01:19:38 to be asking to make progress towards AI,
01:19:42 and also because, in general, in my research,
01:19:44 I don’t like to do research on questions for which we already
01:19:49 know the likely outcome.
01:19:51 I don’t see much value in running an experiment where
01:19:53 you’re 95% confident that you will succeed.
01:19:57 And so we could have tried maybe to take AlphaGo and do
01:20:02 something which we knew for sure it would succeed on.
01:20:05 But much more interesting to me was to try it on the things
01:20:07 which we weren’t sure about.
01:20:09 And one of the big questions on our minds
01:20:12 back then was, could you really do this with self play alone?
01:20:16 How far could that go?
01:20:17 Would it be as strong?
01:20:19 And honestly, we weren’t sure.
01:20:22 It was 50, 50, I think.
01:20:25 If you’d asked me, I wasn’t confident
01:20:27 that it could reach the same level as these systems,
01:20:30 but it felt like the right question to ask.
01:20:33 And even if it had not achieved the same level,
01:20:36 I felt that that was an important direction
01:20:41 to be studying.
01:20:42 And so then, lo and behold, it actually
01:20:48 ended up outperforming the previous version of AlphaGo
01:20:52 and indeed was able to beat it by 100 games to zero.
01:20:55 So what’s the intuition as to why?
01:20:59 I think the intuition to me is clear,
01:21:02 that whenever you have errors in a system, as we did in AlphaGo,
01:21:09 AlphaGo suffered from these delusions.
01:21:11 Occasionally, it would misunderstand
01:21:13 what was going on in a position and miss evaluate it.
01:21:15 How can you remove all of these errors?
01:21:19 Errors arise from many sources.
01:21:21 For us, they were arising both starting from the human data,
01:21:25 but also from the nature of the search
01:21:27 and the nature of the algorithm itself.
01:21:29 But the only way to address them in any complex system
01:21:33 is to give the system the ability
01:21:36 to correct its own errors.
01:21:37 It must be able to correct them.
01:21:39 It must be able to learn for itself
01:21:41 when it’s doing something wrong and correct for it.
01:21:44 And so it seemed to me that the way to correct delusions
01:21:47 was indeed to have more iterations of reinforcement
01:21:51 learning, that no matter where you start,
01:21:53 you should be able to correct those errors
01:21:55 until it gets to play that out and understand,
01:21:58 oh, well, I thought that I was going to win in this situation,
01:22:01 but then I ended up losing.
01:22:03 That suggests that I was miss evaluating something.
01:22:05 There’s a hole in my knowledge, and now the system
01:22:07 can correct for itself and understand how to do better.
01:22:11 Now, if you take that same idea and trace it back
01:22:14 all the way to the beginning, it should
01:22:16 be able to take you from no knowledge,
01:22:19 from completely random starting point,
01:22:21 all the way to the highest levels of knowledge
01:22:24 that you can achieve in a domain.
01:22:27 And the principle is the same, that if you bestow a system
01:22:30 with the ability to correct its own errors,
01:22:33 then it can take you from random to something slightly
01:22:36 better than random because it sees the stupid things
01:22:39 that the random is doing, and it can correct them.
01:22:41 And then it can take you from that slightly better system
01:22:43 and understand, well, what’s that doing wrong?
01:22:45 And it takes you on to the next level and the next level.
01:22:49 And this progress can go on indefinitely.
01:22:52 And indeed, what would have happened
01:22:55 if we’d carried on training AlphaGo Zero for longer?
01:22:59 We saw no sign of it slowing down its improvements,
01:23:03 or at least it was certainly carrying on to improve.
01:23:06 And presumably, if you had the computational resources,
01:23:11 this could lead to better and better systems
01:23:14 that discover more and more.
01:23:15 So your intuition is fundamentally
01:23:18 there’s not a ceiling to this process.
01:23:21 One of the surprising things, just like you said,
01:23:24 is the process of patching errors.
01:23:27 It intuitively makes sense that this is,
01:23:31 that reinforcement learning should be part of that process.
01:23:33 But what is surprising is in the process
01:23:36 of patching your own lack of knowledge,
01:23:39 you don’t open up other patches.
01:23:41 You keep sort of, like there’s a monotonic decrease
01:23:46 of your weaknesses.
01:23:48 Well, let me back this up.
01:23:50 I think science always should make falsifiable hypotheses.
01:23:53 So let me back up this claim with a falsifiable hypothesis,
01:23:57 which is that if someone was to, in the future,
01:23:59 take Alpha Zero as an algorithm
01:24:02 and run it on with greater computational resources
01:24:07 that we had available today,
01:24:10 then I would predict that they would be able
01:24:12 to beat the previous system 100 games to zero.
01:24:15 And that if they were then to do the same thing
01:24:17 a couple of years later,
01:24:19 that that would beat that previous system 100 games to zero,
01:24:22 and that that process would continue indefinitely
01:24:25 throughout at least my human lifetime.
01:24:27 Presumably the game of Go would set the ceiling.
01:24:31 I mean.
01:24:31 The game of Go would set the ceiling,
01:24:33 but the game of Go has 10 to the 170 states in it.
01:24:35 So the ceiling is unreachable by any computational device
01:24:40 that can be built out of the 10 to the 80 atoms
01:24:44 in the universe.
01:24:46 You asked a really good question,
01:24:47 which is, do you not open up other errors
01:24:51 when you correct your previous ones?
01:24:53 And the answer is yes, you do.
01:24:56 And so it’s a remarkable fact
01:24:58 about this class of two player game
01:25:02 and also true of single agent games
01:25:05 that essentially progress will always lead you to,
01:25:11 if you have sufficient representational resource,
01:25:15 like imagine you had,
01:25:16 could represent every state in a big table of the game,
01:25:20 then we know for sure that a progress of self improvement
01:25:24 will lead all the way in the single agent case
01:25:27 to the optimal possible behavior,
01:25:29 and in the two player case to the minimax optimal behavior.
01:25:31 And that is the best way that I can play
01:25:35 knowing that you’re playing perfectly against me.
01:25:38 And so for those cases,
01:25:39 we know that even if you do open up some new error,
01:25:44 that in some sense you’ve made progress.
01:25:46 You’re progressing towards the best that can be done.
01:25:50 So AlphaGo was initially trained on expert games
01:25:55 with some self play.
01:25:56 AlphaGo Zero removed the need to be trained on expert games.
01:26:00 And then another incredible step for me,
01:26:03 because I just love chess,
01:26:05 is to generalize that further to be in AlphaZero
01:26:09 to be able to play the game of Go,
01:26:12 beating AlphaGo Zero and AlphaGo,
01:26:14 and then also being able to play the game of chess
01:26:18 and others.
01:26:19 So what was that step like?
01:26:20 What’s the interesting aspects there
01:26:23 that required to make that happen?
01:26:26 I think the remarkable observation,
01:26:29 which we saw with AlphaZero,
01:26:31 was that actually without modifying the algorithm at all,
01:26:35 it was able to play and crack
01:26:37 some of AI’s greatest previous challenges.
01:26:41 In particular, we dropped it into the game of chess.
01:26:44 And unlike the previous systems like Deep Blue,
01:26:47 which had been worked on for years and years,
01:26:50 and we were able to beat
01:26:52 the world’s strongest computer chess program convincingly
01:26:57 using a system that was fully discovered
01:27:00 from scratch with its own principles.
01:27:04 And in fact, one of the nice things that we found
01:27:08 was that in fact, we also achieved the same result
01:27:11 in Japanese chess, a variant of chess
01:27:13 where you get to capture pieces
01:27:15 and then place them back down on your own side
01:27:17 as an extra piece.
01:27:18 So a much more complicated variant of chess.
01:27:21 And we also beat the world’s strongest programs
01:27:24 and reached superhuman performance in that game too.
01:27:28 And it was the very first time that we’d ever run the system
01:27:32 on that particular game,
01:27:34 was the version that we published
01:27:35 in the paper on AlphaZero.
01:27:38 It just worked out of the box, literally, no touching it.
01:27:41 We didn’t have to do anything.
01:27:42 And there it was, superhuman performance,
01:27:45 no tweaking, no twiddling.
01:27:47 And so I think there’s something beautiful
01:27:49 about that principle that you can take an algorithm
01:27:52 and without twiddling anything, it just works.
01:27:57 Now, to go beyond AlphaZero, what’s required?
01:28:02 AlphaZero is just a step.
01:28:05 And there’s a long way to go beyond that
01:28:06 to really crack the deep problems of AI.
01:28:10 But one of the important steps is to acknowledge
01:28:13 that the world is a really messy place.
01:28:16 It’s this rich, complex, beautiful,
01:28:18 but messy environment that we live in.
01:28:21 And no one gives us the rules.
01:28:23 Like no one knows the rules of the world.
01:28:26 At least maybe we understand that it operates
01:28:28 according to Newtonian or quantum mechanics
01:28:31 at the micro level or according to relativity
01:28:34 at the macro level.
01:28:35 But that’s not a model that’s useful for us as people
01:28:38 to operate in it.
01:28:40 Somehow the agent needs to understand the world for itself
01:28:43 in a way where no one tells it the rules of the game.
01:28:46 And yet it can still figure out what to do in that world,
01:28:50 deal with this stream of observations coming in,
01:28:53 rich sensory input coming in,
01:28:55 actions going out in a way that allows it to reason
01:28:58 in the way that AlphaGo or AlphaZero can reason
01:29:01 in the way that these go and chess playing programs
01:29:03 can reason.
01:29:04 But in a way that allows it to take actions
01:29:07 in that messy world to achieve its goals.
01:29:11 And so this led us to the most recent step
01:29:15 in the story of AlphaGo,
01:29:17 which was a system called MuZero.
01:29:19 And MuZero is a system which learns for itself
01:29:23 even when the rules are not given to it.
01:29:25 It actually can be dropped into a system
01:29:28 with messy perceptual inputs.
01:29:29 We actually tried it in some Atari games,
01:29:33 the canonical domains of Atari
01:29:36 that have been used for reinforcement learning.
01:29:38 And this system learned to build a model
01:29:42 of these Atari games that was sufficiently rich
01:29:46 and useful enough for it to be able to plan successfully.
01:29:51 And in fact, that system not only went on
01:29:53 to beat the state of the art in Atari,
01:29:56 but the same system without modification
01:29:59 was able to reach the same level of superhuman performance
01:30:02 in go, chess, and shogi that we’d seen in AlphaZero,
01:30:06 showing that even without the rules,
01:30:08 the system can learn for itself just by trial and error,
01:30:11 just by playing this game of go.
01:30:13 And no one tells you what the rules are,
01:30:15 but you just get to the end and someone says win or loss.
01:30:19 You play this game of chess and someone says win or loss,
01:30:22 or you play a game of breakout in Atari
01:30:25 and someone just tells you your score at the end.
01:30:28 And the system for itself figures out
01:30:30 essentially the rules of the system,
01:30:31 the dynamics of the world, how the world works.
01:30:35 And not in any explicit way, but just implicitly,
01:30:39 enough understanding for it to be able to plan
01:30:41 in that system in order to achieve its goals.
01:30:45 And that’s the fundamental process
01:30:48 that you have to go through when you’re facing
01:30:49 in any uncertain kind of environment
01:30:51 that you would in the real world,
01:30:53 is figuring out the sort of the rules,
01:30:55 the basic rules of the game.
01:30:56 That’s right.
01:30:57 So that allows it to be applicable
01:31:00 to basically any domain that could be digitized
01:31:05 in the way that it needs to in order to be consumable,
01:31:10 sort of in order for the reinforcement learning framework
01:31:12 to be able to sense the environment,
01:31:13 to be able to act in the environment and so on.
01:31:15 The full reinforcement learning problem
01:31:16 needs to deal with worlds that are unknown and complex
01:31:21 and the agent needs to learn for itself
01:31:23 how to deal with that.
01:31:24 And so MuZero is a further step in that direction.
01:31:29 One of the things that inspired the general public
01:31:32 and just in conversations I have like with my parents
01:31:34 or something with my mom that just loves what was done
01:31:38 is kind of at least the notion
01:31:40 that there was some display of creativity,
01:31:42 some new strategies, new behaviors that were created.
01:31:45 That again has echoes of intelligence.
01:31:48 So is there something that stands out?
01:31:50 Do you see it the same way that there’s creativity
01:31:52 and there’s some behaviors, patterns that you saw
01:31:57 that AlphaZero was able to display that are truly creative?
01:32:01 So let me start by saying that I think we should ask
01:32:06 what creativity really means.
01:32:08 So to me, creativity means discovering something
01:32:13 which wasn’t known before, something unexpected,
01:32:16 something outside of our norms.
01:32:19 And so in that sense, the process of reinforcement learning
01:32:24 or the self play approach that was used by AlphaZero
01:32:29 is the essence of creativity.
01:32:31 It’s really saying at every stage,
01:32:34 you’re playing according to your current norms
01:32:36 and you try something and if it works out,
01:32:39 you say, hey, here’s something great,
01:32:42 I’m gonna start using that.
01:32:44 And then that process, it’s like a micro discovery
01:32:47 that happens millions and millions of times
01:32:49 over the course of the algorithm’s life
01:32:51 where it just discovers some new idea,
01:32:54 oh, this pattern, this pattern’s working really well for me,
01:32:56 I’m gonna start using that.
01:32:58 And now, oh, here’s this other thing I can do,
01:33:00 I can start to connect these stones together in this way
01:33:03 or I can start to sacrifice stones or give up on pieces
01:33:08 or play shoulder hits on the fifth line or whatever it is.
01:33:12 The system’s discovering things like this for itself
01:33:13 continually, repeatedly, all the time.
01:33:16 And so it should come as no surprise to us then
01:33:19 when if you leave these systems going,
01:33:21 that they discover things that are not known to humans,
01:33:25 that to the human norms are considered creative.
01:33:30 And we’ve seen this several times.
01:33:32 In fact, in AlphaGo Zero,
01:33:35 we saw this beautiful timeline of discovery
01:33:39 where what we saw was that there are these opening patterns
01:33:44 that humans play called joseki,
01:33:45 these are like the patterns that humans learn
01:33:47 to play in the corners and they’ve been developed
01:33:49 and refined over literally thousands of years
01:33:51 in the game of Go.
01:33:53 And what we saw was in the course of the training,
01:33:57 AlphaGo Zero, over the course of the 40 days
01:34:00 that we trained this system,
01:34:01 it starts to discover exactly these patterns
01:34:05 that human players play.
01:34:06 And over time, we found that all of the joseki
01:34:10 that humans played were discovered by the system
01:34:13 through this process of self play
01:34:15 and this sort of essential notion of creativity.
01:34:19 But what was really interesting was that over time,
01:34:22 it then starts to discard some of these
01:34:24 in favor of its own joseki that humans didn’t know about.
01:34:28 And it starts to say, oh, well,
01:34:29 you thought that the Knights move pincer joseki
01:34:33 was a great idea,
01:34:35 but here’s something different you can do there
01:34:37 which makes some new variation
01:34:38 that humans didn’t know about.
01:34:40 And actually now the human Go players
01:34:42 study the joseki that AlphaGo played
01:34:44 and they become the new norms
01:34:46 that are used in today’s top level Go competitions.
01:34:51 That never gets old.
01:34:52 Even just the first to me,
01:34:54 maybe just makes me feel good as a human being
01:34:58 that a self play mechanism that knows nothing about us humans
01:35:01 discovers patterns that we humans do.
01:35:04 That’s just like an affirmation
01:35:06 that we’re doing okay as humans.
01:35:08 Yeah.
01:35:09 We’ve, in this domain and other domains,
01:35:12 we figured out it’s like the Churchill quote
01:35:14 about democracy.
01:35:15 It’s the, you know, it sucks,
01:35:18 but it’s the best one we’ve tried.
01:35:20 So in general, taking a step outside of Go
01:35:24 and you’ve like a million accomplishment
01:35:27 that I have no time to talk about
01:35:29 with AlphaStar and so on and the current work.
01:35:32 But in general, this self play mechanism
01:35:36 that you’ve inspired the world with
01:35:38 by beating the world champion Go player.
01:35:40 Do you see that as,
01:35:43 do you see it being applied in other domains?
01:35:47 Do you have sort of dreams and hopes
01:35:50 that it’s applied in both the simulated environments
01:35:53 and the constrained environments of games?
01:35:56 Constrained, I mean, AlphaStar really demonstrates
01:35:59 that you can remove a lot of the constraints,
01:36:00 but nevertheless, it’s in a digital simulated environment.
01:36:04 Do you have a hope, a dream that it starts being applied
01:36:07 in the robotics environment?
01:36:09 And maybe even in domains that are safety critical
01:36:12 and so on and have, you know,
01:36:15 have a real impact in the real world,
01:36:16 like autonomous vehicles, for example,
01:36:18 which seems like a very far out dream at this point.
01:36:21 So I absolutely do hope and imagine
01:36:25 that we will get to the point where ideas
01:36:27 just like these are used in all kinds of different domains.
01:36:31 In fact, one of the most satisfying things
01:36:32 as a researcher is when you start to see other people
01:36:35 use your algorithms in unexpected ways.
01:36:39 So in the last couple of years, there have been,
01:36:41 you know, a couple of nature papers
01:36:43 where different teams, unbeknownst to us,
01:36:47 took AlphaZero and applied exactly those same algorithms
01:36:51 and ideas to real world problems of huge meaning to society.
01:36:57 So one of them was the problem of chemical synthesis,
01:37:00 and they were able to beat the state of the art
01:37:02 in finding pathways of how to actually synthesize chemicals,
01:37:08 retrochemical synthesis.
01:37:11 And the second paper actually just came out
01:37:14 a couple of weeks ago in Nature,
01:37:16 showed that in quantum computation,
01:37:19 you know, one of the big questions is how to understand
01:37:22 the nature of the function in quantum computation
01:37:27 and a system based on AlphaZero beat the state of the art
01:37:30 by quite some distance there again.
01:37:32 So these are just examples.
01:37:34 And I think, you know, the lesson,
01:37:36 which we’ve seen elsewhere in machine learning
01:37:38 time and time again, is that if you make something general,
01:37:42 it will be used in all kinds of ways.
01:37:44 You know, you provide a really powerful tool to society,
01:37:47 and those tools can be used in amazing ways.
01:37:51 And so I think we’re just at the beginning,
01:37:53 and for sure, I hope that we see all kinds of outcomes.
01:37:58 So the other side of the question of reinforcement
01:38:03 learning framework is, you know,
01:38:05 you usually want to specify a reward function
01:38:07 and an objective function.
01:38:11 What do you think about sort of ideas of intrinsic rewards
01:38:13 of when we’re not really sure about, you know,
01:38:19 if we take, you know, human beings as existence proof
01:38:23 that we don’t seem to be operating
01:38:25 according to a single reward,
01:38:27 do you think that there’s interesting ideas
01:38:32 for when you don’t know how to truly specify the reward,
01:38:35 you know, that there’s some flexibility
01:38:38 for discovering it intrinsically or so on
01:38:40 in the context of reinforcement learning?
01:38:42 So I think, you know, when we think about intelligence,
01:38:45 it’s really important to be clear
01:38:46 about the problem of intelligence.
01:38:48 And I think it’s clearest to understand that problem
01:38:51 in terms of some ultimate goal
01:38:52 that we want the system to try and solve for.
01:38:55 And after all, if we don’t understand the ultimate purpose
01:38:57 of the system, do we really even have
01:39:00 a clearly defined problem that we’re solving at all?
01:39:04 Now, within that, as with your example for humans,
01:39:10 the system may choose to create its own motivations
01:39:13 and subgoals that help the system
01:39:16 to achieve its ultimate goal.
01:39:19 And that may indeed be a hugely important mechanism
01:39:22 to achieve those ultimate goals,
01:39:23 but there is still some ultimate goal
01:39:25 I think the system needs to be measurable
01:39:27 and evaluated against.
01:39:29 And even for humans, I mean, humans,
01:39:31 we’re incredibly flexible.
01:39:32 We feel that we can, you know, any goal that we’re given,
01:39:35 we feel we can master to some degree.
01:39:40 But if we think of those goals, really, you know,
01:39:41 like the goal of being able to pick up an object
01:39:44 or the goal of being able to communicate
01:39:47 or influence people to do things in a particular way
01:39:50 or whatever those goals are, really, they’re subgoals,
01:39:56 really, that we set ourselves.
01:39:58 You know, we choose to pick up the object.
01:40:00 We choose to communicate.
01:40:02 We choose to influence someone else.
01:40:05 And we choose those because we think it will lead us
01:40:07 to something later on.
01:40:10 We think that’s helpful to us to achieve some ultimate goal.
01:40:15 Now, I don’t want to speculate whether or not humans
01:40:18 as a system necessarily have a singular overall goal
01:40:20 of survival or whatever it is.
01:40:23 But I think the principle for understanding
01:40:25 and implementing intelligence is, has to be,
01:40:28 that if we’re trying to understand intelligence
01:40:30 or implement our own,
01:40:31 there has to be a well defined problem.
01:40:33 Otherwise, if it’s not, I think it’s like an admission
01:40:37 of defeat, that for there to be hope for understanding
01:40:41 or implementing intelligence, we have to know what we’re doing.
01:40:44 We have to know what we’re asking the system to do.
01:40:46 Otherwise, if you don’t have a clearly defined purpose,
01:40:48 you’re not going to get a clearly defined answer.
01:40:51 The ridiculous big question that has to naturally follow,
01:40:56 because I have to pin you down on this thing,
01:41:00 that nevertheless, one of the big silly
01:41:03 or big real questions before humans is the meaning of life,
01:41:08 is us trying to figure out our own reward function.
01:41:11 And you just kind of mentioned that if you want to build
01:41:13 intelligent systems and you know what you’re doing,
01:41:16 you should be at least cognizant to some degree
01:41:18 of what the reward function is.
01:41:20 So the natural question is what do you think
01:41:23 is the reward function of human life,
01:41:26 the meaning of life for us humans,
01:41:29 the meaning of our existence?
01:41:32 I think I’d be speculating beyond my own expertise,
01:41:36 but just for fun, let me do that.
01:41:38 Yes, please.
01:41:39 And say, I think that there are many levels
01:41:41 at which you can understand a system
01:41:43 and you can understand something as optimizing
01:41:46 for a goal at many levels.
01:41:48 And so you can understand the,
01:41:52 let’s start with the universe.
01:41:54 Does the universe have a purpose?
01:41:55 Well, it feels like it’s just at one level
01:41:58 just following certain mechanical laws of physics
01:42:02 and that that’s led to the development of the universe.
01:42:04 But at another level, you can view it as actually,
01:42:08 there’s the second law of thermodynamics that says
01:42:10 that this is increasing in entropy over time forever.
01:42:13 And now there’s a view that’s been developed
01:42:15 by certain people at MIT that this,
01:42:17 you can think of this as almost like a goal of the universe,
01:42:20 that the purpose of the universe is to maximize entropy.
01:42:24 So there are multiple levels
01:42:26 at which you can understand a system.
01:42:28 The next level down, you might say,
01:42:30 well, if the goal is to maximize entropy,
01:42:34 well, how can that be done by a particular system?
01:42:40 And maybe evolution is something that the universe
01:42:42 discovered in order to kind of dissipate energy
01:42:45 as efficiently as possible.
01:42:48 And by the way, I’m borrowing from Max Tegmark
01:42:49 for some of these metaphors, the physicist.
01:42:53 But if you can think of evolution
01:42:55 as a mechanism for dispersing energy,
01:42:59 then evolution, you might say, then becomes a goal,
01:43:04 which is if evolution disperses energy
01:43:06 by reproducing as efficiently as possible,
01:43:09 what’s evolution then?
01:43:10 Well, it’s now got its own goal within that,
01:43:13 which is to actually reproduce as effectively as possible.
01:43:19 And now how does reproduction,
01:43:22 how is that made as effective as possible?
01:43:25 Well, you need entities within that
01:43:27 that can survive and reproduce as effectively as possible.
01:43:29 And so it’s natural that in order to achieve
01:43:31 that high level goal, those individual organisms
01:43:33 discover brains, intelligences,
01:43:37 which enable them to support the goals of evolution.
01:43:43 And those brains, what do they do?
01:43:45 Well, perhaps the early brains,
01:43:47 maybe they were controlling things at some direct level.
01:43:51 Maybe they were the equivalent of preprogrammed systems,
01:43:54 which were directly controlling what was going on
01:43:57 and setting certain things in order
01:43:59 to achieve these particular goals.
01:44:03 But that led to another level of discovery,
01:44:05 which was learning systems.
01:44:07 There are parts of the brain
01:44:08 which are able to learn for themselves
01:44:10 and learn how to program themselves to achieve any goal.
01:44:13 And presumably there are parts of the brain
01:44:16 where goals are set to parts of that system
01:44:20 and provides this very flexible notion of intelligence
01:44:23 that we as humans presumably have,
01:44:25 which is the ability to kind of,
01:44:26 the reason we feel that we can achieve any goal.
01:44:30 So it’s a very long winded answer to say that,
01:44:32 I think there are many perspectives
01:44:34 and many levels at which intelligence can be understood.
01:44:38 And at each of those levels,
01:44:40 you can take multiple perspectives.
01:44:42 You can view the system as something
01:44:43 which is optimizing for a goal,
01:44:45 which is understanding it at a level
01:44:47 by which we can maybe implement it
01:44:49 and understand it as AI researchers or computer scientists,
01:44:53 or you can understand it at the level
01:44:54 of the mechanistic thing which is going on
01:44:56 that there are these atoms bouncing around in the brain
01:44:58 and they lead to the outcome of that system
01:45:01 is not in contradiction with the fact
01:45:02 that it’s also a decision making system
01:45:07 that’s optimizing for some goal and purpose.
01:45:10 I’ve never heard the description of the meaning of life
01:45:14 structured so beautifully in layers,
01:45:16 but you did miss one layer, which is the next step,
01:45:19 which you’re responsible for,
01:45:21 which is creating the artificial intelligence layer
01:45:27 on top of that.
01:45:28 And I can’t wait to see, well, I may not be around,
01:45:31 but I can’t wait to see what the next layer beyond that be.
01:45:36 Well, let’s just take that argument
01:45:39 and pursue it to its natural conclusion.
01:45:41 So the next level indeed is for how can our learning brain
01:45:46 achieve its goals most effectively?
01:45:49 Well, maybe it does so by us as learning beings
01:45:56 building a system which is able to solve for those goals
01:46:00 more effectively than we can.
01:46:02 And so when we build a system to play the game of Go,
01:46:05 when I said that I wanted to build a system
01:46:06 that can play Go better than I can,
01:46:08 I’ve enabled myself to achieve that goal of playing Go
01:46:12 better than I could by directly playing it
01:46:14 and learning it myself.
01:46:15 And so now a new layer has been created,
01:46:18 which is systems which are able to achieve goals
01:46:21 for themselves.
01:46:22 And ultimately there may be layers beyond that
01:46:25 where they set sub goals to parts of their own system
01:46:28 in order to achieve those and so forth.
01:46:32 So the story of intelligence, I think,
01:46:36 is a multi layered one and a multi perspective one.
01:46:39 We live in an incredible universe.
01:46:41 David, thank you so much, first of all,
01:46:43 for dreaming of using learning to solve Go
01:46:47 and building intelligent systems
01:46:50 and for actually making it happen
01:46:52 and for inspiring millions of people in the process.
01:46:56 It’s truly an honor.
01:46:57 Thank you so much for talking today.
01:46:58 Okay, thank you.
01:46:59 Thanks for listening to this conversation
01:47:01 with David Silver and thank you to our sponsors,
01:47:04 Masterclass and Cash App.
01:47:05 Please consider supporting the podcast
01:47:07 by signing up to Masterclass at masterclass.com slash Lex
01:47:12 and downloading Cash App and using code LexPodcast.
01:47:15 If you enjoy this podcast, subscribe on YouTube,
01:47:18 review it with five stars on Apple Podcast,
01:47:20 support it on Patreon,
01:47:21 or simply connect with me on Twitter at LexFriedman.
01:47:25 And now let me leave you with some words from David Silver.
01:47:28 My personal belief is that we’ve seen something
01:47:31 of a turning point where we’re starting to understand
01:47:34 that many abilities like intuition and creativity
01:47:38 that we’ve previously thought were in the domain only
01:47:40 of the human mind are actually accessible
01:47:43 to machine intelligence as well.
01:47:45 And I think that’s a really exciting moment in history.
01:47:48 Thank you for listening and hope to see you next time.