Pieter Abbeel: Deep Reinforcement Learning #10

Transcript

00:00:00 The following is a conversation with Peter Abbeel.

00:00:03 He’s a professor at UC Berkeley

00:00:04 and the director of the Berkeley Robotics Learning Lab.

00:00:07 He’s one of the top researchers in the world

00:00:10 working on how we make robots understand

00:00:13 and interact with the world around them,

00:00:15 especially using imitation and deep reinforcement learning.

00:00:19 This conversation is part of the MIT course

00:00:22 on Artificial General Intelligence

00:00:24 and the Artificial Intelligence podcast.

00:00:26 If you enjoy it, please subscribe on YouTube,

00:00:29 iTunes, or your podcast provider of choice,

00:00:31 or simply connect with me on Twitter at Lex Friedman,

00:00:34 spelled F R I D.

00:00:36 And now, here’s my conversation with Peter Abbeel.

00:00:41 You’ve mentioned that if there was one person

00:00:44 you could meet, it would be Roger Federer.

00:00:46 So let me ask, when do you think we’ll have a robot

00:00:50 that fully autonomously can beat Roger Federer at tennis?

00:00:54 Roger Federer level player at tennis?

00:00:57 Well, first, if you can make it happen for me to meet Roger,

00:01:00 let me know.

00:01:01 In terms of getting a robot to beat him at tennis,

00:01:07 it’s kind of an interesting question

00:01:08 because for a lot of the challenges we think about in AI,

00:01:14 the software is really the missing piece,

00:01:16 but for something like this,

00:01:18 the hardware is nowhere near either.

00:01:22 To really have a robot that can physically run around,

00:01:26 the Boston Dynamics robots are starting to get there,

00:01:28 but still not really human level ability to run around

00:01:33 and then swing a racket.

00:01:36 So you think that’s a hardware problem?

00:01:38 I don’t think it’s a hardware problem only.

00:01:39 I think it’s a hardware and a software problem.

00:01:41 I think it’s both.

00:01:43 And I think they’ll have independent progress.

00:01:45 So I’d say the hardware maybe in 10, 15 years.

00:01:51 On clay, not grass.

00:01:52 I mean, grass is probably harder.

00:01:53 With the sliding?

00:01:54 Yeah.

00:01:55 With the clay, I’m not sure what’s harder, grass or clay.

00:01:58 The clay involves sliding,

00:02:01 which might be harder to master actually, yeah.

00:02:06 But you’re not limited to a bipedal.

00:02:08 I mean, I’m sure there’s no…

00:02:09 Well, if we can build a machine,

00:02:11 it’s a whole different question, of course.

00:02:13 If you can say, okay, this robot can be on wheels,

00:02:16 it can move around on wheels and can be designed differently,

00:02:19 then I think that can be done sooner probably

00:02:23 than a full humanoid type of setup.

00:02:26 What do you think of swing a racket?

00:02:27 So you’ve worked at basic manipulation.

00:02:31 How hard do you think is the task of swinging a racket

00:02:34 would be able to hit a nice backhand or a forehand?

00:02:39 Let’s say we just set up stationary,

00:02:42 a nice robot arm, let’s say, a standard industrial arm,

00:02:46 and it can watch the ball come and then swing the racket.

00:02:50 It’s a good question.

00:02:51 I’m not sure it would be super hard to do.

00:02:56 I mean, I’m sure it would require a lot,

00:02:58 if we do it with reinforcement learning,

00:03:00 it would require a lot of trial and error.

00:03:01 It’s not gonna swing it right the first time around,

00:03:03 but yeah, I don’t see why I couldn’t

00:03:07 swing it the right way.

00:03:09 I think it’s learnable.

00:03:10 I think if you set up a ball machine,

00:03:12 let’s say on one side,

00:03:13 and then a robot with a tennis racket on the other side,

00:03:17 I think it’s learnable

00:03:20 and maybe a little bit of pre training and simulation.

00:03:22 Yeah, I think that’s feasible.

00:03:25 I think the swing the racket is feasible.

00:03:27 It’d be very interesting to see how much precision

00:03:28 it can get.

00:03:31 Cause I mean, that’s where, I mean,

00:03:35 some of the human players can hit it on the lines,

00:03:37 which is very high precision.

00:03:39 With spin, the spin is an interesting,

00:03:42 whether RL can learn to put a spin on the ball.

00:03:45 Well, you got me interested.

00:03:46 Maybe someday we’ll set this up.

00:03:48 Sure, you got me intrigued.

00:03:51 Your answer is basically, okay,

00:03:52 for this problem, it sounds fascinating,

00:03:54 but for the general problem of a tennis player,

00:03:56 we might be a little bit farther away.

00:03:58 What’s the most impressive thing you’ve seen a robot do

00:04:01 in the physical world?

00:04:04 So physically for me,

00:04:06 it’s the Boston Dynamics videos.

00:04:10 Always just bring home and just super impressed.

00:04:15 Recently, the robot running up the stairs,

00:04:17 doing the parkour type thing.

00:04:19 I mean, yes, we don’t know what’s underneath.

00:04:22 They don’t really write a lot of detail,

00:04:23 but even if it’s hard coded underneath,

00:04:27 which it might or might not be just the physical abilities

00:04:29 of doing that parkour, that’s a very impressive.

00:04:32 So have you met Spot Mini

00:04:34 or any of those robots in person?

00:04:36 Met Spot Mini last year in April at the Mars event

00:04:41 that Jeff Bezos organizes.

00:04:42 They brought it out there

00:04:44 and it was nicely following around Jeff.

00:04:47 When Jeff left the room, they had it follow him along,

00:04:50 which is pretty impressive.

00:04:52 So I think there’s some confidence to know

00:04:55 that there’s no learning going on in those robots.

00:04:58 The psychology of it, so while knowing that,

00:05:00 while knowing there’s not,

00:05:01 if there’s any learning going on, it’s very limited.

00:05:04 I met Spot Mini earlier this year

00:05:06 and knowing everything that’s going on,

00:05:09 having one on one interaction,

00:05:11 so I got to spend some time alone and there’s immediately

00:05:15 a deep connection on the psychological level.

00:05:18 Even though you know the fundamentals, how it works,

00:05:21 there’s something magical.

00:05:23 So do you think about the psychology of interacting

00:05:27 with robots in the physical world?

00:05:29 Even you just showed me the PR2, the robot,

00:05:33 and there was a little bit something like a face,

00:05:36 had a little bit something like a face.

00:05:38 There’s something that immediately draws you to it.

00:05:40 Do you think about that aspect of the robotics problem?

00:05:45 Well, it’s very hard with Brad here.

00:05:48 We’ll give him a name, Berkeley Robot

00:05:50 for the Elimination of Tedious Tasks.

00:05:52 It’s very hard to not think of the robot as a person

00:05:56 and it seems like everybody calls him a he

00:05:58 for whatever reason, but that also makes it more a person

00:06:01 than if it was a it, and it seems pretty natural

00:06:06 to think of it that way.

00:06:07 This past weekend really struck me.

00:06:08 I’ve seen Pepper many times on videos,

00:06:13 but then I was at an event organized by,

00:06:15 this was by Fidelity, and they had scripted Pepper

00:06:18 to help moderate some sessions,

00:06:22 and they had scripted Pepper

00:06:23 to have the personality of a child a little bit,

00:06:26 and it was very hard to not think of it

00:06:28 as its own person in some sense

00:06:31 because it would just jump in the conversation,

00:06:34 making it very interactive.

00:06:35 Moderate would be saying, Pepper would just jump in,

00:06:37 hold on, how about me?

00:06:40 Can I participate in this too?

00:06:41 And you’re just like, okay, this is like a person,

00:06:43 and that was 100% scripted, and even then it was hard

00:06:46 not to have that sense of somehow there is something there.

00:06:50 So as we have robots interact in this physical world,

00:06:54 is that a signal that could be used

00:06:56 in reinforcement learning?

00:06:57 You’ve worked a little bit in this direction,

00:07:00 but do you think that psychology can be somehow pulled in?

00:07:04 Yes, that’s a question I would say

00:07:07 a lot of people ask, and I think part of why they ask it

00:07:11 is they’re thinking about how unique

00:07:14 are we really still as people?

00:07:16 Like after they see some results,

00:07:18 they see a computer play Go, they see a computer do this,

00:07:21 that, they’re like, okay, but can it really have emotion?

00:07:23 Can it really interact with us in that way?

00:07:26 And then once you’re around robots,

00:07:29 you already start feeling it,

00:07:30 and I think that kind of maybe mythologically,

00:07:33 the way that I think of it is

00:07:34 if you run something like reinforcement learning,

00:07:37 it’s about optimizing some objective,

00:07:39 and there’s no reason that the objective

00:07:45 couldn’t be tied into how much does a person like

00:07:49 interacting with this system,

00:07:50 and why could not the reinforcement learning system

00:07:53 optimize for the robot being fun to be around?

00:07:56 And why wouldn’t it then naturally become

00:07:58 more and more interactive and more and more

00:08:01 maybe like a person or like a pet?

00:08:03 I don’t know what it would exactly be,

00:08:04 but more and more have those features

00:08:06 and acquire them automatically.

00:08:08 As long as you can formalize an objective

00:08:10 of what it means to like something,

00:08:13 what, how you exhibit, what’s the ground truth?

00:08:16 How do you get the reward from human?

00:08:19 Because you have to somehow collect

00:08:20 that information within you, human.

00:08:22 But you’re saying if you can formulate as an objective,

00:08:26 it can be learned.

00:08:27 There’s no reason it couldn’t emerge through learning,

00:08:29 and maybe one way to formulate as an objective,

00:08:31 you wouldn’t have to necessarily score it explicitly,

00:08:33 so standard rewards are numbers,

00:08:36 and numbers are hard to come by.

00:08:38 This is a 1.5 or a 1.7 on some scale.

00:08:41 It’s very hard to do for a person,

00:08:43 but much easier is for a person to say,

00:08:45 okay, what you did the last five minutes

00:08:47 was much nicer than what you did the previous five minutes,

00:08:51 and that now gives a comparison.

00:08:53 And in fact, there have been some results on that.

00:08:55 For example, Paul Christiano and collaborators at OpenAI

00:08:57 had the Hopper, Mojoko Hopper, a one legged robot,

00:09:02 going through backflips purely from feedback.

00:09:05 I like this better than that.

00:09:06 That’s kind of equally good,

00:09:08 and after a bunch of interactions,

00:09:10 it figured out what it was the person was asking for,

00:09:13 namely a backflip.

00:09:14 And so I think the same thing.

00:09:15 Oh, it wasn’t trying to do a backflip.

00:09:18 It was just getting a comparison score

00:09:20 from the person based on?

00:09:23 Person having in mind, in their own mind,

00:09:26 I wanted to do a backflip,

00:09:27 but the robot didn’t know what it was supposed to be doing.

00:09:30 It just knew that sometimes the person said,

00:09:32 this is better, this is worse,

00:09:34 and then the robot figured out

00:09:36 what the person was actually after was a backflip.

00:09:38 And I’d imagine the same would be true

00:09:40 for things like more interactive robots,

00:09:43 that the robot would figure out over time,

00:09:45 oh, this kind of thing apparently is appreciated more

00:09:48 than this other kind of thing.

00:09:50 So when I first picked up Sutton’s,

00:09:54 Richard Sutton’s reinforcement learning book,

00:09:56 before sort of this deep learning,

00:10:01 before the reemergence of neural networks

00:10:03 as a powerful mechanism for machine learning,

00:10:05 RL seemed to me like magic.

00:10:08 It was beautiful.

00:10:10 So that seemed like what intelligence is,

00:10:13 RL reinforcement learning.

00:10:15 So how do you think we can possibly learn anything

00:10:20 about the world when the reward for the actions

00:10:22 is delayed, is so sparse?

00:10:25 Like where is, why do you think RL works?

00:10:30 Why do you think you can learn anything

00:10:32 under such sparse rewards,

00:10:35 whether it’s regular reinforcement learning

00:10:36 or deep reinforcement learning?

00:10:38 What’s your intuition?

00:10:40 The counterpart of that is why is RL,

00:10:44 why does it need so many samples,

00:10:47 so many experiences to learn from?

00:10:49 Because really what’s happening is

00:10:50 when you have a sparse reward,

00:10:53 you do something maybe for like, I don’t know,

00:10:55 you take 100 actions and then you get a reward.

00:10:57 And maybe you get like a score of three.

00:10:59 And I’m like okay, three, not sure what that means.

00:11:03 You go again and now you get two.

00:11:05 And now you know that that sequence of 100 actions

00:11:07 that you did the second time around

00:11:08 somehow was worse than the sequence of 100 actions

00:11:10 you did the first time around.

00:11:11 But that’s tough to now know which one of those

00:11:14 were better or worse.

00:11:15 Some might have been good and bad in either one.

00:11:17 And so that’s why it needs so many experiences.

00:11:19 But once you have enough experiences,

00:11:21 effectively RL is teasing that apart.

00:11:23 It’s trying to say okay, what is consistently there

00:11:26 when you get a higher reward

00:11:27 and what’s consistently there when you get a lower reward?

00:11:30 And then kind of the magic of sometimes

00:11:32 the policy gradient update is to say

00:11:34 now let’s update the neural network

00:11:37 to make the actions that were kind of present

00:11:39 when things are good more likely

00:11:41 and make the actions that are present

00:11:43 when things are not as good less likely.

00:11:45 So that is the counterpoint,

00:11:47 but it seems like you would need to run it

00:11:49 a lot more than you do.

00:11:50 Even though right now people could say

00:11:52 that RL is very inefficient,

00:11:54 but it seems to be way more efficient

00:11:56 than one would imagine on paper.

00:11:58 That the simple updates to the policy,

00:12:02 the policy gradient, that somehow you can learn,

00:12:04 exactly you just said, what are the common actions

00:12:07 that seem to produce some good results?

00:12:09 That that somehow can learn anything.

00:12:12 It seems counterintuitive at least.

00:12:15 Is there some intuition behind it?

00:12:16 Yeah, so I think there’s a few ways to think about this.

00:12:21 The way I tend to think about it mostly originally,

00:12:26 so when we started working on deep reinforcement learning

00:12:29 here at Berkeley, which was maybe 2011, 12, 13,

00:12:32 around that time, John Schulman was a PhD student

00:12:36 initially kind of driving it forward here.

00:12:39 And the way we thought about it at the time was

00:12:44 if you think about rectified linear units

00:12:47 or kind of rectifier type neural networks,

00:12:50 what do you get?

00:12:51 You get something that’s piecewise linear feedback control.

00:12:55 And if you look at the literature,

00:12:57 linear feedback control is extremely successful,

00:12:59 can solve many, many problems surprisingly well.

00:13:03 I remember, for example, when we did helicopter flight,

00:13:05 if you’re in a stationary flight regime,

00:13:07 not a non stationary, but a stationary flight regime

00:13:10 like hover, you can use linear feedback control

00:13:12 to stabilize a helicopter, very complex dynamical system,

00:13:15 but the controller is relatively simple.

00:13:18 And so I think that’s a big part of it is that

00:13:20 if you do feedback control, even though the system

00:13:23 you control can be very, very complex,

00:13:25 often relatively simple control architectures

00:13:28 can already do a lot.

00:13:30 But then also just linear is not good enough.

00:13:32 And so one way you can think of these neural networks

00:13:35 is that sometimes they tile the space,

00:13:37 which people were already trying to do more by hand

00:13:39 or with finite state machines,

00:13:41 say this linear controller here,

00:13:42 this linear controller here.

00:13:43 Neural network learns to tile the space

00:13:45 and say linear controller here,

00:13:46 another linear controller here,

00:13:48 but it’s more subtle than that.

00:13:50 And so it’s benefiting from this linear control aspect,

00:13:52 it’s benefiting from the tiling,

00:13:53 but it’s somehow tiling it one dimension at a time.

00:13:57 Because if let’s say you have a two layer network,

00:13:59 if in that hidden layer, you make a transition

00:14:03 from active to inactive or the other way around,

00:14:06 that is essentially one axis, but not axis aligned,

00:14:09 but one direction that you change.

00:14:12 And so you have this kind of very gradual tiling

00:14:14 of the space where you have a lot of sharing

00:14:16 between the linear controllers that tile the space.

00:14:19 And that was always my intuition as to why

00:14:21 to expect that this might work pretty well.

00:14:24 It’s essentially leveraging the fact

00:14:26 that linear feedback control is so good,

00:14:28 but of course not enough.

00:14:29 And this is a gradual tiling of the space

00:14:31 with linear feedback controls

00:14:33 that share a lot of expertise across them.

00:14:36 So that’s really nice intuition,

00:14:39 but do you think that scales to the more

00:14:41 and more general problems of when you start going up

00:14:44 the number of dimensions when you start

00:14:49 going down in terms of how often

00:14:52 you get a clean reward signal?

00:14:55 Does that intuition carry forward to those crazier,

00:14:58 weirder worlds that we think of as the real world?

00:15:03 So I think where things get really tricky

00:15:08 in the real world compared to the things

00:15:09 we’ve looked at so far with great success

00:15:11 in reinforcement learning is the time scales,

00:15:17 which takes us to an extreme.

00:15:18 So when you think about the real world,

00:15:21 I mean, I don’t know, maybe some student

00:15:24 decided to do a PhD here, right?

00:15:26 Okay, that’s a decision.

00:15:28 That’s a very high level decision.

00:15:30 But if you think about their lives,

00:15:32 I mean, any person’s life,

00:15:34 it’s a sequence of muscle fiber contractions

00:15:37 and relaxations, and that’s how you interact with the world.

00:15:40 And that’s a very high frequency control thing,

00:15:42 but it’s ultimately what you do

00:15:44 and how you affect the world,

00:15:46 until I guess we have brain readings

00:15:48 and you can maybe do it slightly differently.

00:15:49 But typically that’s how you affect the world.

00:15:52 And the decision of doing a PhD is so abstract

00:15:56 relative to what you’re actually doing in the world.

00:15:59 And I think that’s where credit assignment

00:16:01 becomes just completely beyond

00:16:04 what any current RL algorithm can do.

00:16:06 And we need hierarchical reasoning

00:16:09 at a level that is just not available at all yet.

00:16:12 Where do you think we can pick up hierarchical reasoning?

00:16:14 By which mechanisms?

00:16:16 Yeah, so maybe let me highlight

00:16:18 what I think the limitations are

00:16:20 of what already was done 20, 30 years ago.

00:16:26 In fact, you’ll find reasoning systems

00:16:27 that reason over relatively long horizons,

00:16:30 but the problem is that they were not grounded

00:16:32 in the real world.

00:16:34 So people would have to hand design

00:16:39 some kind of logical, dynamical descriptions of the world

00:16:43 and that didn’t tie into perception.

00:16:46 And so it didn’t tie into real objects and so forth.

00:16:49 And so that was a big gap.

00:16:51 Now with deep learning, we start having the ability

00:16:53 to really see with sensors, process that

00:16:59 and understand what’s in the world.

00:17:01 And so it’s a good time to try

00:17:02 to bring these things together.

00:17:04 I see a few ways of getting there.

00:17:06 One way to get there would be to say

00:17:08 deep learning can get bolted on somehow

00:17:10 to some of these more traditional approaches.

00:17:12 Now bolted on would probably mean

00:17:14 you need to do some kind of end to end training

00:17:16 where you say my deep learning processing

00:17:18 somehow leads to a representation

00:17:20 that in term uses some kind of traditional

00:17:24 underlying dynamical systems that can be used for planning.

00:17:29 And that’s, for example, the direction Aviv Tamar

00:17:32 and Thanard Kuretach here have been pushing

00:17:34 with causal info again and of course other people too.

00:17:36 That’s one way.

00:17:38 Can we somehow force it into the form factor

00:17:41 that is amenable to reasoning?

00:17:43 Another direction we’ve been thinking about

00:17:46 for a long time and didn’t make any progress on

00:17:50 was more information theoretic approaches.

00:17:53 So the idea there was that what it means

00:17:56 to take high level action is to take

00:17:59 and choose a latent variable now

00:18:02 that tells you a lot about what’s gonna be the case

00:18:04 in the future.

00:18:05 Because that’s what it means to take a high level action.

00:18:09 I say okay, I decide I’m gonna navigate

00:18:13 to the gas station because I need to get gas for my car.

00:18:15 Well, that’ll now take five minutes to get there.

00:18:17 But the fact that I get there,

00:18:19 I could already tell that from the high level action

00:18:22 I took much earlier.

00:18:24 That we had a very hard time getting success with.

00:18:28 Not saying it’s a dead end necessarily,

00:18:30 but we had a lot of trouble getting that to work.

00:18:33 And then we started revisiting the notion

00:18:34 of what are we really trying to achieve?

00:18:37 What we’re trying to achieve is not necessarily hierarchy

00:18:40 per se, but you could think about

00:18:41 what does hierarchy give us?

00:18:44 What we hope it would give us is better credit assignment.

00:18:49 What is better credit assignment?

00:18:51 It’s giving us, it gives us faster learning, right?

00:18:55 And so faster learning is ultimately maybe what we’re after.

00:18:59 And so that’s where we ended up with the RL squared paper

00:19:03 on learning to reinforcement learn,

00:19:06 which at a time Rocky Dwan led.

00:19:08 And that’s exactly the meta learning approach

00:19:11 where you say, okay, we don’t know how to design hierarchy.

00:19:14 We know what we want to get from it.

00:19:15 Let’s just enter and optimize for what we want to get

00:19:18 from it and see if it might emerge.

00:19:20 And we saw things emerge.

00:19:21 The maze navigation had consistent motion down hallways,

00:19:26 which is what you want.

00:19:27 A hierarchical control should say,

00:19:28 I want to go down this hallway.

00:19:29 And then when there is an option to take a turn,

00:19:31 I can decide whether to take a turn or not and repeat.

00:19:33 Even had the notion of where have you been before or not

00:19:37 to not revisit places you’ve been before.

00:19:39 It still didn’t scale yet

00:19:42 to the real world kind of scenarios I think you had in mind,

00:19:46 but it was some sign of life

00:19:47 that maybe you can meta learn these hierarchical concepts.

00:19:51 I mean, it seems like through these meta learning concepts,

00:19:56 get at the, what I think is one of the hardest

00:19:59 and most important problems of AI,

00:20:02 which is transfer learning.

00:20:04 So it’s generalization.

00:20:06 How far along this journey

00:20:08 towards building general systems are we?

00:20:11 Being able to do transfer learning well.

00:20:13 So there’s some signs that you can generalize a little bit,

00:20:17 but do you think we’re on the right path

00:20:19 or it’s totally different breakthroughs are needed

00:20:23 to be able to transfer knowledge

00:20:26 between different learned models?

00:20:31 Yeah, I’m pretty torn on this in that

00:20:33 I think there are some very impressive.

00:20:35 Well, there’s just some very impressive results already.

00:20:40 I mean, I would say when,

00:20:44 even with the initial kind of big breakthrough in 2012

00:20:47 with AlexNet, the initial thing is okay, great.

00:20:52 This does better on ImageNet, hence image recognition.

00:20:55 But then immediately thereafter,

00:20:57 there was of course the notion that,

00:21:00 wow, what was learned on ImageNet

00:21:03 and you now wanna solve a new task,

00:21:05 you can fine tune AlexNet for new tasks.

00:21:09 And that was often found to be the even bigger deal

00:21:12 that you learn something that was reusable,

00:21:14 which was not often the case before.

00:21:16 Usually machine learning, you learn something

00:21:17 for one scenario and that was it.

00:21:19 And that’s really exciting.

00:21:20 I mean, that’s a huge application.

00:21:22 That’s probably the biggest success

00:21:23 of transfer learning today in terms of scope and impact.

00:21:27 That was a huge breakthrough.

00:21:29 And then recently, I feel like similar kind of,

00:21:33 by scaling things up, it seems like

00:21:34 this has been expanded upon.

00:21:36 Like people training even bigger networks,

00:21:37 they might transfer even better.

00:21:39 If you looked at, for example,

00:21:41 some of the OpenAI results on language models

00:21:43 and some of the recent Google results on language models,

00:21:47 they’re learned for just prediction

00:21:51 and then they get reused for other tasks.

00:21:54 And so I think there is something there

00:21:56 where somehow if you train a big enough model

00:21:58 on enough things, it seems to transfer

00:22:01 some deep mind results that I thought were very impressive,

00:22:03 the Unreal results, where it was learned to navigate mazes

00:22:09 in ways where it wasn’t just doing reinforcement learning,

00:22:11 but it had other objectives it was optimizing for.

00:22:14 So I think there’s a lot of interesting results already.

00:22:17 I think maybe where it’s hard to wrap my head around this,

00:22:22 to which extent or when do we call something generalization?

00:22:26 Or the levels of generalization in the real world,

00:22:29 or the levels of generalization involved

00:22:31 in these different tasks, right?

00:22:36 You draw this, by the way, just to frame things.

00:22:39 I’ve heard you say somewhere, it’s the difference

00:22:41 between learning to master versus learning to generalize,

00:22:44 that it’s a nice line to think about.

00:22:47 And I guess you’re saying that it’s a gray area

00:22:50 of what learning to master and learning to generalize,

00:22:53 where one starts.

00:22:54 I think I might have heard this.

00:22:56 I might have heard it somewhere else.

00:22:57 And I think it might’ve been one of your interviews,

00:23:00 maybe the one with Yoshua Benjamin, I’m not 100% sure.

00:23:03 But I liked the example, I’m not sure who it was,

00:23:08 but the example was essentially,

00:23:10 if you use current deep learning techniques,

00:23:13 what we’re doing to predict, let’s say,

00:23:17 the relative motion of our planets, it would do pretty well.

00:23:22 But then now if a massive new mass enters our solar system,

00:23:28 it would probably not predict what will happen, right?

00:23:32 And that’s a different kind of generalization.

00:23:33 That’s a generalization that relies

00:23:34 on the ultimate simplest, simplest explanation

00:23:38 that we have available today

00:23:40 to explain the motion of planets,

00:23:41 whereas just pattern recognition could predict

00:23:43 our current solar system motion pretty well, no problem.

00:23:47 And so I think that’s an example

00:23:48 of a kind of generalization that is a little different

00:23:52 from what we’ve achieved so far.

00:23:54 And it’s not clear if just regularizing more

00:23:59 and forcing it to come up with a simpler, simpler,

00:24:01 simpler explanation and say, look, this is not simple.

00:24:03 But that’s what physics researchers do, right?

00:24:05 They say, can I make this even simpler?

00:24:08 How simple can I get this?

00:24:09 What’s the simplest equation that can explain everything?

00:24:12 The master equation for the entire dynamics of the universe,

00:24:15 we haven’t really pushed that direction as hard

00:24:17 in deep learning, I would say.

00:24:20 Not sure if it should be pushed,

00:24:22 but it seems a kind of generalization you get from that

00:24:24 that you don’t get in our current methods so far.

00:24:27 So I just talked to Vladimir Vapnik, for example,

00:24:30 who’s a statistician of statistical learning,

00:24:34 and he kind of dreams of creating

00:24:37 the E equals MC squared for learning, right?

00:24:41 The general theory of learning.

00:24:42 Do you think that’s a fruitless pursuit

00:24:44 in the near term, within the next several decades?

00:24:51 I think that’s a really interesting pursuit

00:24:53 in the following sense, in that there is a lot of evidence

00:24:58 that the brain is pretty modular.

00:25:03 And so I wouldn’t maybe think of it as the theory,

00:25:05 maybe the underlying theory, but more kind of the principle

00:25:10 where there have been findings where

00:25:12 people who are blind will use the part of the brain

00:25:16 usually used for vision for other functions.

00:25:21 And even after some kind of,

00:25:24 if people get rewired in some way,

00:25:26 they might be able to reuse parts of their brain

00:25:28 for other functions.

00:25:30 And so what that suggests is some kind of modularity.

00:25:35 And I think it is a pretty natural thing to strive for

00:25:39 to see, can we find that modularity?

00:25:41 Can we find this thing?

00:25:43 Of course, every part of the brain is not exactly the same.

00:25:45 Not everything can be rewired arbitrarily.

00:25:48 But if you think of things like the neocortex,

00:25:50 which is a pretty big part of the brain,

00:25:52 that seems fairly modular from what the findings so far.

00:25:56 Can you design something equally modular?

00:25:59 And if you can just grow it,

00:26:00 it becomes more capable probably.

00:26:02 I think that would be the kind of interesting

00:26:04 underlying principle to shoot for that is not unrealistic.

00:26:09 Do you think you prefer math or empirical trial and error

00:26:15 for the discovery of the essence of what it means

00:26:17 to do something intelligent?

00:26:19 So reinforcement learning embodies both groups, right?

00:26:22 To prove that something converges, prove the bounds.

00:26:26 And then at the same time, a lot of those successes are,

00:26:29 well, let’s try this and see if it works.

00:26:31 So which do you gravitate towards?

00:26:33 How do you think of those two parts of your brain?

00:26:39 Maybe I would prefer we could make the progress

00:26:44 with mathematics.

00:26:45 And the reason maybe I would prefer that is because often

00:26:48 if you have something you can mathematically formalize,

00:26:52 you can leapfrog a lot of experimentation.

00:26:55 And experimentation takes a long time to get through.

00:26:58 And a lot of trial and error,

00:27:01 kind of reinforcement learning, your research process,

00:27:04 but you need to do a lot of trial and error

00:27:05 before you get to a success.

00:27:06 So if you can leapfrog that, to my mind,

00:27:08 that’s what the math is about.

00:27:10 And hopefully once you do a bunch of experiments,

00:27:13 you start seeing a pattern.

00:27:14 You can do some derivations that leapfrog some experiments.

00:27:18 But I agree with you.

00:27:19 I mean, in practice, a lot of the progress has been such

00:27:21 that we have not been able to find the math

00:27:23 that allows you to leapfrog ahead.

00:27:25 And we are kind of making gradual progress

00:27:28 one step at a time, a new experiment here,

00:27:30 a new experiment there that gives us new insights

00:27:32 and gradually building up,

00:27:34 but not getting to something yet where we’re just,

00:27:36 okay, here’s an equation that now explains how,

00:27:39 you know, that would be,

00:27:40 have been two years of experimentation to get there,

00:27:42 but this tells us what the result’s going to be.

00:27:45 Unfortunately, not so much yet.

00:27:47 Not so much yet, but your hope is there.

00:27:50 In trying to teach robots or systems

00:27:53 to do everyday tasks or even in simulation,

00:27:58 what do you think you’re more excited about?

00:28:02 Imitation learning or self play?

00:28:04 So letting robots learn from humans

00:28:08 or letting robots plan their own

00:28:11 to try to figure out in their own way

00:28:13 and eventually play, eventually interact with humans

00:28:18 or solve whatever the problem is.

00:28:20 What’s the more exciting to you?

00:28:21 What’s more promising you think as a research direction?

00:28:24 So when we look at self play,

00:28:32 what’s so beautiful about it is goes back

00:28:34 to kind of the challenges in reinforcement learning.

00:28:37 So the challenge of reinforcement learning

00:28:38 is getting signal.

00:28:40 And if you don’t never succeed, you don’t get any signal.

00:28:43 In self play, you’re on both sides.

00:28:46 So one of you succeeds.

00:28:48 And the beauty is also one of you fails.

00:28:49 And so you see the contrast.

00:28:51 You see the one version of me that did better

00:28:53 than the other version.

00:28:54 So every time you play yourself, you get signal.

00:28:57 And so whenever you can turn something into self play,

00:29:00 you’re in a beautiful situation

00:29:02 where you can naturally learn much more quickly

00:29:04 than in most other reinforcement learning environments.

00:29:07 So I think if somehow we can turn more

00:29:12 reinforcement learning problems

00:29:13 into self play formulations,

00:29:15 that would go really, really far.

00:29:17 So far, self play has been largely around games

00:29:20 where there is natural opponents.

00:29:22 But if we could do self play for other things,

00:29:24 and let’s say, I don’t know,

00:29:25 a robot learns to build a house.

00:29:26 I mean, that’s a pretty advanced thing

00:29:28 to try to do for a robot,

00:29:29 but maybe it tries to build a hut or something.

00:29:31 If that can be done through self play,

00:29:34 it would learn a lot more quickly

00:29:35 if somebody can figure that out.

00:29:36 And I think that would be something

00:29:37 where it goes closer to kind of the mathematical leapfrogging

00:29:41 where somebody figures out a formalism to say,

00:29:43 okay, any RL problem by playing this and this idea,

00:29:47 you can turn it into a self play problem

00:29:48 where you get signal a lot more easily.

00:29:50 Reality is, many problems we don’t know

00:29:52 how to turn into self play.

00:29:53 And so either we need to provide detailed reward.

00:29:56 That doesn’t just reward for achieving a goal,

00:29:58 but rewards for making progress,

00:30:00 and that becomes time consuming.

00:30:02 And once you’re starting to do that,

00:30:03 let’s say you want a robot to do something,

00:30:05 you need to give all this detailed reward.

00:30:07 Well, why not just give a demonstration?

00:30:09 Because why not just show the robot?

00:30:11 And now the question is, how do you show the robot?

00:30:14 One way to show is to tally operate the robot,

00:30:16 and then the robot really experiences things.

00:30:19 And that’s nice, because that’s really high signal

00:30:21 to noise ratio data, and we’ve done a lot of that.

00:30:23 And you teach your robot skills in just 10 minutes,

00:30:26 you can teach your robot a new basic skill,

00:30:27 like okay, pick up the bottle, place it somewhere else.

00:30:30 That’s a skill, no matter where the bottle starts,

00:30:32 maybe it always goes onto a target or something.

00:30:34 That’s fairly easy to teach your robot with tally up.

00:30:38 Now, what’s even more interesting

00:30:40 if you can now teach your robot

00:30:41 through third person learning,

00:30:43 where the robot watches you do something

00:30:45 and doesn’t experience it, but just kind of watches you.

00:30:48 It doesn’t experience it, but just watches it

00:30:49 and says, okay, well, if you’re showing me that,

00:30:52 that means I should be doing this.

00:30:53 And I’m not gonna be using your hand,

00:30:55 because I don’t get to control your hand,

00:30:57 but I’m gonna use my hand, I do that mapping.

00:30:59 And so that’s where I think one of the big breakthroughs

00:31:02 has happened this year.

00:31:03 This was led by Chelsea Finn here.

00:31:06 It’s almost like learning a machine translation

00:31:08 for demonstrations, where you have a human demonstration,

00:31:11 and the robot learns to translate it

00:31:12 into what it means for the robot to do it.

00:31:15 And that was a meta learning formulation,

00:31:17 learn from one to get the other.

00:31:20 And that, I think, opens up a lot of opportunities

00:31:23 to learn a lot more quickly.

00:31:24 So my focus is on autonomous vehicles.

00:31:26 Do you think this approach of third person watching,

00:31:29 the autonomous driving is amenable

00:31:31 to this kind of approach?

00:31:33 So for autonomous driving,

00:31:36 I would say third person is slightly easier.

00:31:41 And the reason I’m gonna say it’s slightly easier

00:31:43 to do with third person is because

00:31:46 the car dynamics are very well understood.

00:31:49 So the…

00:31:51 Easier than first person, you mean?

00:31:53 Or easier than…

00:31:55 So I think the distinction between third person

00:31:57 and first person is not a very important distinction

00:32:00 for autonomous driving.

00:32:01 They’re very similar.

00:32:03 Because the distinction is really about

00:32:06 who turns the steering wheel.

00:32:09 Or maybe, let me put it differently.

00:32:12 How to get from a point where you are now

00:32:14 to a point, let’s say, a couple meters in front of you.

00:32:17 And that’s a problem that’s very well understood.

00:32:19 And that’s the only distinction

00:32:20 between third and first person there.

00:32:21 Whereas with the robot manipulation,

00:32:23 interaction forces are very complex.

00:32:25 And it’s still a very different thing.

00:32:27 For autonomous driving,

00:32:29 I think there is still the question,

00:32:31 imitation versus RL.

00:32:34 So imitation gives you a lot more signal.

00:32:36 I think where imitation is lacking

00:32:38 and needs some extra machinery is,

00:32:42 it doesn’t, in its normal format,

00:32:45 doesn’t think about goals or objectives.

00:32:48 And of course, there are versions of imitation learning

00:32:51 and versus reinforcement learning type imitation learning

00:32:52 which also thinks about goals.

00:32:54 I think then we’re getting much closer.

00:32:57 But I think it’s very hard to think of a

00:32:59 fully reactive car, generalizing well.

00:33:04 If it really doesn’t have a notion of objectives

00:33:05 to generalize well to the kind of general

00:33:08 that you would want.

00:33:09 You’d want more than just that reactivity

00:33:12 that you get from just behavioral cloning

00:33:13 slash supervised learning.

00:33:17 So a lot of the work,

00:33:19 whether it’s self play or even imitation learning,

00:33:22 would benefit significantly from simulation,

00:33:24 from effective simulation.

00:33:26 And you’re doing a lot of stuff

00:33:27 in the physical world and in simulation.

00:33:29 Do you have hope for greater and greater

00:33:33 power of simulation being boundless eventually

00:33:38 to where most of what we need to operate

00:33:40 in the physical world could be simulated

00:33:43 to a degree that’s directly transferable

00:33:46 to the physical world?

00:33:47 Or are we still very far away from that?

00:33:51 So I think we could even rephrase that question

00:33:57 in some sense.

00:33:58 Please.

00:34:00 And so the power of simulation, right?

00:34:04 As simulators get better and better,

00:34:06 of course, becomes stronger

00:34:08 and we can learn more in simulation.

00:34:11 But there’s also another version

00:34:12 which is where you say the simulator

00:34:13 doesn’t even have to be that precise.

00:34:15 As long as it’s somewhat representative

00:34:18 and instead of trying to get one simulator

00:34:21 that is sufficiently precise to learn in

00:34:23 and transfer really well to the real world,

00:34:25 I’m gonna build many simulators.

00:34:27 Ensemble of simulators?

00:34:28 Ensemble of simulators.

00:34:29 Not any single one of them is sufficiently representative

00:34:33 of the real world such that it would work

00:34:36 if you train in there.

00:34:37 But if you train in all of them,

00:34:40 then there is something that’s good in all of them.

00:34:43 The real world will just be another one of them

00:34:47 that’s not identical to any one of them

00:34:49 but just another one of them.

00:34:50 Another sample from the distribution of simulators.

00:34:53 Exactly.

00:34:54 We do live in a simulation,

00:34:54 so this is just one other one.

00:34:57 I’m not sure about that, but yeah.

00:35:01 It’s definitely a very advanced simulator if it is.

00:35:03 Yeah, it’s a pretty good one.

00:35:05 I’ve talked to Stuart Russell.

00:35:07 It’s something you think about a little bit too.

00:35:09 Of course, you’re really trying to build these systems,

00:35:12 but do you think about the future of AI?

00:35:13 A lot of people have concern about safety.

00:35:16 How do you think about AI safety?

00:35:18 As you build robots that are operating in the physical world,

00:35:21 what is, yeah, how do you approach this problem

00:35:25 in an engineering kind of way, in a systematic way?

00:35:29 So when a robot is doing things,

00:35:32 you kind of have a few notions of safety to worry about.

00:35:36 One is that the robot is physically strong

00:35:39 and of course could do a lot of damage.

00:35:42 Same for cars, which we can think of as robots too

00:35:44 in some way.

00:35:46 And this could be completely unintentional.

00:35:48 So it could be not the kind of longterm AI safety concerns

00:35:51 that, okay, AI is smarter than us and now what do we do?

00:35:54 But it could be just very practical.

00:35:55 Okay, this robot, if it makes a mistake,

00:35:58 what are the results going to be?

00:36:00 Of course, simulation comes in a lot there

00:36:02 to test in simulation. It’s a difficult question.

00:36:07 And I’m always wondering, like, I always wonder,

00:36:09 let’s say you look at, let’s go back to driving

00:36:12 because a lot of people know driving well, of course.

00:36:15 What do we do to test somebody for driving, right?

00:36:18 Get a driver’s license. What do they really do?

00:36:21 I mean, you fill out some tests and then you drive.

00:36:26 And I mean, it’s suburban California.

00:36:29 That driving test is just you drive around the block,

00:36:32 pull over, you do a stop sign successfully,

00:36:36 and then you pull over again and you’re pretty much done.

00:36:40 And you’re like, okay, if a self driving car did that,

00:36:44 would you trust it that it can drive?

00:36:46 And I’d be like, no, that’s not enough for me to trust it.

00:36:48 But somehow for humans, we’ve figured out

00:36:51 that somebody being able to do that is representative

00:36:55 of them being able to do a lot of other things.

00:36:57 And so I think somehow for humans,

00:36:59 we figured out representative tests

00:37:02 of what it means if you can do this, what you can really do.

00:37:05 Of course, testing humans,

00:37:07 humans don’t wanna be tested at all times.

00:37:09 Self driving cars or robots

00:37:10 could be tested more often probably.

00:37:11 You can have replicas that get tested

00:37:13 that are known to be identical

00:37:14 because they use the same neural net and so forth.

00:37:17 But still, I feel like we don’t have this kind of unit tests

00:37:21 or proper tests for robots.

00:37:24 And I think there’s something very interesting

00:37:25 to be thought about there,

00:37:26 especially as you update things.

00:37:28 Your software improves,

00:37:29 you have a better self driving car suite, you update it.

00:37:32 How do you know it’s indeed more capable on everything

00:37:35 than what you had before,

00:37:37 that you didn’t have any bad things creep into it?

00:37:41 So I think that’s a very interesting direction of research

00:37:43 that there is no real solution yet,

00:37:46 except that somehow for humans we do.

00:37:47 Because we say, okay, you have a driving test, you passed,

00:37:50 you can go on the road now,

00:37:51 and humans have accidents every like a million

00:37:54 or 10 million miles, something pretty phenomenal

00:37:57 compared to that short test that is being done.

00:38:01 So let me ask, you’ve mentioned that Andrew Ng by example

00:38:06 showed you the value of kindness.

00:38:10 Do you think the space of policies,

00:38:14 good policies for humans and for AI

00:38:17 is populated by policies that with kindness

00:38:22 or ones that are the opposite, exploitation, even evil?

00:38:28 So if you just look at the sea of policies

00:38:30 we operate under as human beings,

00:38:32 or if AI system had to operate in this real world,

00:38:35 do you think it’s really easy to find policies

00:38:38 that are full of kindness,

00:38:39 like we naturally fall into them?

00:38:41 Or is it like a very hard optimization problem?

00:38:48 I mean, there is kind of two optimizations

00:38:50 happening for humans, right?

00:38:52 So for humans, there’s kind of the very long term

00:38:54 optimization which evolution has done for us

00:38:56 and we’re kind of predisposed to like certain things.

00:39:00 And that’s in some sense what makes our learning easier

00:39:02 because I mean, we know things like pain

00:39:05 and hunger and thirst.

00:39:08 And the fact that we know about those

00:39:10 is not something that we were taught, that’s kind of innate.

00:39:12 When we’re hungry, we’re unhappy.

00:39:14 When we’re thirsty, we’re unhappy.

00:39:16 When we have pain, we’re unhappy.

00:39:18 And ultimately evolution built that into us

00:39:21 to think about those things.

00:39:22 And so I think there is a notion that

00:39:24 it seems somehow humans evolved in general

00:39:28 to prefer to get along in some ways,

00:39:32 but at the same time also to be very territorial

00:39:36 and kind of centric to their own tribe.

00:39:41 Like it seems like that’s the kind of space

00:39:43 we converged onto.

00:39:44 I mean, I’m not an expert in anthropology,

00:39:46 but it seems like we’re very kind of good

00:39:49 within our own tribe, but need to be taught

00:39:52 to be nice to other tribes.

00:39:54 Well, if you look at Steven Pinker,

00:39:56 he highlights this pretty nicely in

00:40:00 Better Angels of Our Nature,

00:40:02 where he talks about violence decreasing over time

00:40:04 consistently.

00:40:05 So whatever tension, whatever teams we pick,

00:40:08 it seems that the long arc of history

00:40:11 goes towards us getting along more and more.

00:40:14 So. I hope so.

00:40:15 So do you think that, do you think it’s possible

00:40:20 to teach RL based robots this kind of kindness,

00:40:26 this kind of ability to interact with humans,

00:40:28 this kind of policy, even to, let me ask a fun one.

00:40:32 Do you think it’s possible to teach RL based robot

00:40:35 to love a human being and to inspire that human

00:40:38 to love the robot back?

00:40:40 So to like RL based algorithm that leads to a happy marriage.

00:40:47 That’s an interesting question.

00:40:48 Maybe I’ll answer it with another question, right?

00:40:52 Because I mean, but I’ll come back to it.

00:40:56 So another question you can have is okay.

00:40:58 I mean, how close does some people’s happiness get

00:41:03 from interacting with just a really nice dog?

00:41:07 Like, I mean, dogs, you come home,

00:41:09 that’s what dogs do.

00:41:10 They greet you, they’re excited,

00:41:12 makes you happy when you come home to your dog.

00:41:14 You’re just like, okay, this is exciting.

00:41:16 They’re always happy when I’m here.

00:41:18 And if they don’t greet you, cause maybe whatever,

00:41:21 your partner took them on a trip or something,

00:41:23 you might not be nearly as happy when you get home, right?

00:41:26 And so the kind of, it seems like the level of reasoning

00:41:30 a dog has is pretty sophisticated,

00:41:32 but then it’s still not yet at the level of human reasoning.

00:41:35 And so it seems like we don’t even need to achieve

00:41:37 human level reasoning to get like very strong affection

00:41:40 with humans.

00:41:41 And so my thinking is why not, right?

00:41:44 Why couldn’t, with an AI, couldn’t we achieve

00:41:47 the kind of level of affection that humans feel

00:41:51 among each other or with friendly animals and so forth?

00:41:57 So question, is it a good thing for us or not?

00:41:59 That’s another thing, right?

00:42:01 Because I mean, but I don’t see why not.

00:42:05 Why not, yeah, so Elon Musk says love is the answer.

00:42:09 Maybe he should say love is the objective function

00:42:12 and then RL is the answer, right?

00:42:14 Well, maybe.

00:42:17 Oh, Peter, thank you so much.

00:42:18 I don’t want to take up more of your time.

00:42:20 Thank you so much for talking today.

00:42:21 Well, thanks for coming by.

00:42:23 Great to have you visit.