Transcript
00:00:00 The following is a conversation with Peter Abbeel.
00:00:03 He’s a professor at UC Berkeley
00:00:04 and the director of the Berkeley Robotics Learning Lab.
00:00:07 He’s one of the top researchers in the world
00:00:10 working on how we make robots understand
00:00:13 and interact with the world around them,
00:00:15 especially using imitation and deep reinforcement learning.
00:00:19 This conversation is part of the MIT course
00:00:22 on Artificial General Intelligence
00:00:24 and the Artificial Intelligence podcast.
00:00:26 If you enjoy it, please subscribe on YouTube,
00:00:29 iTunes, or your podcast provider of choice,
00:00:31 or simply connect with me on Twitter at Lex Friedman,
00:00:34 spelled F R I D.
00:00:36 And now, here’s my conversation with Peter Abbeel.
00:00:41 You’ve mentioned that if there was one person
00:00:44 you could meet, it would be Roger Federer.
00:00:46 So let me ask, when do you think we’ll have a robot
00:00:50 that fully autonomously can beat Roger Federer at tennis?
00:00:54 Roger Federer level player at tennis?
00:00:57 Well, first, if you can make it happen for me to meet Roger,
00:01:00 let me know.
00:01:01 In terms of getting a robot to beat him at tennis,
00:01:07 it’s kind of an interesting question
00:01:08 because for a lot of the challenges we think about in AI,
00:01:14 the software is really the missing piece,
00:01:16 but for something like this,
00:01:18 the hardware is nowhere near either.
00:01:22 To really have a robot that can physically run around,
00:01:26 the Boston Dynamics robots are starting to get there,
00:01:28 but still not really human level ability to run around
00:01:33 and then swing a racket.
00:01:36 So you think that’s a hardware problem?
00:01:38 I don’t think it’s a hardware problem only.
00:01:39 I think it’s a hardware and a software problem.
00:01:41 I think it’s both.
00:01:43 And I think they’ll have independent progress.
00:01:45 So I’d say the hardware maybe in 10, 15 years.
00:01:51 On clay, not grass.
00:01:52 I mean, grass is probably harder.
00:01:53 With the sliding?
00:01:54 Yeah.
00:01:55 With the clay, I’m not sure what’s harder, grass or clay.
00:01:58 The clay involves sliding,
00:02:01 which might be harder to master actually, yeah.
00:02:06 But you’re not limited to a bipedal.
00:02:08 I mean, I’m sure there’s no…
00:02:09 Well, if we can build a machine,
00:02:11 it’s a whole different question, of course.
00:02:13 If you can say, okay, this robot can be on wheels,
00:02:16 it can move around on wheels and can be designed differently,
00:02:19 then I think that can be done sooner probably
00:02:23 than a full humanoid type of setup.
00:02:26 What do you think of swing a racket?
00:02:27 So you’ve worked at basic manipulation.
00:02:31 How hard do you think is the task of swinging a racket
00:02:34 would be able to hit a nice backhand or a forehand?
00:02:39 Let’s say we just set up stationary,
00:02:42 a nice robot arm, let’s say, a standard industrial arm,
00:02:46 and it can watch the ball come and then swing the racket.
00:02:50 It’s a good question.
00:02:51 I’m not sure it would be super hard to do.
00:02:56 I mean, I’m sure it would require a lot,
00:02:58 if we do it with reinforcement learning,
00:03:00 it would require a lot of trial and error.
00:03:01 It’s not gonna swing it right the first time around,
00:03:03 but yeah, I don’t see why I couldn’t
00:03:07 swing it the right way.
00:03:09 I think it’s learnable.
00:03:10 I think if you set up a ball machine,
00:03:12 let’s say on one side,
00:03:13 and then a robot with a tennis racket on the other side,
00:03:17 I think it’s learnable
00:03:20 and maybe a little bit of pre training and simulation.
00:03:22 Yeah, I think that’s feasible.
00:03:25 I think the swing the racket is feasible.
00:03:27 It’d be very interesting to see how much precision
00:03:28 it can get.
00:03:31 Cause I mean, that’s where, I mean,
00:03:35 some of the human players can hit it on the lines,
00:03:37 which is very high precision.
00:03:39 With spin, the spin is an interesting,
00:03:42 whether RL can learn to put a spin on the ball.
00:03:45 Well, you got me interested.
00:03:46 Maybe someday we’ll set this up.
00:03:48 Sure, you got me intrigued.
00:03:51 Your answer is basically, okay,
00:03:52 for this problem, it sounds fascinating,
00:03:54 but for the general problem of a tennis player,
00:03:56 we might be a little bit farther away.
00:03:58 What’s the most impressive thing you’ve seen a robot do
00:04:01 in the physical world?
00:04:04 So physically for me,
00:04:06 it’s the Boston Dynamics videos.
00:04:10 Always just bring home and just super impressed.
00:04:15 Recently, the robot running up the stairs,
00:04:17 doing the parkour type thing.
00:04:19 I mean, yes, we don’t know what’s underneath.
00:04:22 They don’t really write a lot of detail,
00:04:23 but even if it’s hard coded underneath,
00:04:27 which it might or might not be just the physical abilities
00:04:29 of doing that parkour, that’s a very impressive.
00:04:32 So have you met Spot Mini
00:04:34 or any of those robots in person?
00:04:36 Met Spot Mini last year in April at the Mars event
00:04:41 that Jeff Bezos organizes.
00:04:42 They brought it out there
00:04:44 and it was nicely following around Jeff.
00:04:47 When Jeff left the room, they had it follow him along,
00:04:50 which is pretty impressive.
00:04:52 So I think there’s some confidence to know
00:04:55 that there’s no learning going on in those robots.
00:04:58 The psychology of it, so while knowing that,
00:05:00 while knowing there’s not,
00:05:01 if there’s any learning going on, it’s very limited.
00:05:04 I met Spot Mini earlier this year
00:05:06 and knowing everything that’s going on,
00:05:09 having one on one interaction,
00:05:11 so I got to spend some time alone and there’s immediately
00:05:15 a deep connection on the psychological level.
00:05:18 Even though you know the fundamentals, how it works,
00:05:21 there’s something magical.
00:05:23 So do you think about the psychology of interacting
00:05:27 with robots in the physical world?
00:05:29 Even you just showed me the PR2, the robot,
00:05:33 and there was a little bit something like a face,
00:05:36 had a little bit something like a face.
00:05:38 There’s something that immediately draws you to it.
00:05:40 Do you think about that aspect of the robotics problem?
00:05:45 Well, it’s very hard with Brad here.
00:05:48 We’ll give him a name, Berkeley Robot
00:05:50 for the Elimination of Tedious Tasks.
00:05:52 It’s very hard to not think of the robot as a person
00:05:56 and it seems like everybody calls him a he
00:05:58 for whatever reason, but that also makes it more a person
00:06:01 than if it was a it, and it seems pretty natural
00:06:06 to think of it that way.
00:06:07 This past weekend really struck me.
00:06:08 I’ve seen Pepper many times on videos,
00:06:13 but then I was at an event organized by,
00:06:15 this was by Fidelity, and they had scripted Pepper
00:06:18 to help moderate some sessions,
00:06:22 and they had scripted Pepper
00:06:23 to have the personality of a child a little bit,
00:06:26 and it was very hard to not think of it
00:06:28 as its own person in some sense
00:06:31 because it would just jump in the conversation,
00:06:34 making it very interactive.
00:06:35 Moderate would be saying, Pepper would just jump in,
00:06:37 hold on, how about me?
00:06:40 Can I participate in this too?
00:06:41 And you’re just like, okay, this is like a person,
00:06:43 and that was 100% scripted, and even then it was hard
00:06:46 not to have that sense of somehow there is something there.
00:06:50 So as we have robots interact in this physical world,
00:06:54 is that a signal that could be used
00:06:56 in reinforcement learning?
00:06:57 You’ve worked a little bit in this direction,
00:07:00 but do you think that psychology can be somehow pulled in?
00:07:04 Yes, that’s a question I would say
00:07:07 a lot of people ask, and I think part of why they ask it
00:07:11 is they’re thinking about how unique
00:07:14 are we really still as people?
00:07:16 Like after they see some results,
00:07:18 they see a computer play Go, they see a computer do this,
00:07:21 that, they’re like, okay, but can it really have emotion?
00:07:23 Can it really interact with us in that way?
00:07:26 And then once you’re around robots,
00:07:29 you already start feeling it,
00:07:30 and I think that kind of maybe mythologically,
00:07:33 the way that I think of it is
00:07:34 if you run something like reinforcement learning,
00:07:37 it’s about optimizing some objective,
00:07:39 and there’s no reason that the objective
00:07:45 couldn’t be tied into how much does a person like
00:07:49 interacting with this system,
00:07:50 and why could not the reinforcement learning system
00:07:53 optimize for the robot being fun to be around?
00:07:56 And why wouldn’t it then naturally become
00:07:58 more and more interactive and more and more
00:08:01 maybe like a person or like a pet?
00:08:03 I don’t know what it would exactly be,
00:08:04 but more and more have those features
00:08:06 and acquire them automatically.
00:08:08 As long as you can formalize an objective
00:08:10 of what it means to like something,
00:08:13 what, how you exhibit, what’s the ground truth?
00:08:16 How do you get the reward from human?
00:08:19 Because you have to somehow collect
00:08:20 that information within you, human.
00:08:22 But you’re saying if you can formulate as an objective,
00:08:26 it can be learned.
00:08:27 There’s no reason it couldn’t emerge through learning,
00:08:29 and maybe one way to formulate as an objective,
00:08:31 you wouldn’t have to necessarily score it explicitly,
00:08:33 so standard rewards are numbers,
00:08:36 and numbers are hard to come by.
00:08:38 This is a 1.5 or a 1.7 on some scale.
00:08:41 It’s very hard to do for a person,
00:08:43 but much easier is for a person to say,
00:08:45 okay, what you did the last five minutes
00:08:47 was much nicer than what you did the previous five minutes,
00:08:51 and that now gives a comparison.
00:08:53 And in fact, there have been some results on that.
00:08:55 For example, Paul Christiano and collaborators at OpenAI
00:08:57 had the Hopper, Mojoko Hopper, a one legged robot,
00:09:02 going through backflips purely from feedback.
00:09:05 I like this better than that.
00:09:06 That’s kind of equally good,
00:09:08 and after a bunch of interactions,
00:09:10 it figured out what it was the person was asking for,
00:09:13 namely a backflip.
00:09:14 And so I think the same thing.
00:09:15 Oh, it wasn’t trying to do a backflip.
00:09:18 It was just getting a comparison score
00:09:20 from the person based on?
00:09:23 Person having in mind, in their own mind,
00:09:26 I wanted to do a backflip,
00:09:27 but the robot didn’t know what it was supposed to be doing.
00:09:30 It just knew that sometimes the person said,
00:09:32 this is better, this is worse,
00:09:34 and then the robot figured out
00:09:36 what the person was actually after was a backflip.
00:09:38 And I’d imagine the same would be true
00:09:40 for things like more interactive robots,
00:09:43 that the robot would figure out over time,
00:09:45 oh, this kind of thing apparently is appreciated more
00:09:48 than this other kind of thing.
00:09:50 So when I first picked up Sutton’s,
00:09:54 Richard Sutton’s reinforcement learning book,
00:09:56 before sort of this deep learning,
00:10:01 before the reemergence of neural networks
00:10:03 as a powerful mechanism for machine learning,
00:10:05 RL seemed to me like magic.
00:10:08 It was beautiful.
00:10:10 So that seemed like what intelligence is,
00:10:13 RL reinforcement learning.
00:10:15 So how do you think we can possibly learn anything
00:10:20 about the world when the reward for the actions
00:10:22 is delayed, is so sparse?
00:10:25 Like where is, why do you think RL works?
00:10:30 Why do you think you can learn anything
00:10:32 under such sparse rewards,
00:10:35 whether it’s regular reinforcement learning
00:10:36 or deep reinforcement learning?
00:10:38 What’s your intuition?
00:10:40 The counterpart of that is why is RL,
00:10:44 why does it need so many samples,
00:10:47 so many experiences to learn from?
00:10:49 Because really what’s happening is
00:10:50 when you have a sparse reward,
00:10:53 you do something maybe for like, I don’t know,
00:10:55 you take 100 actions and then you get a reward.
00:10:57 And maybe you get like a score of three.
00:10:59 And I’m like okay, three, not sure what that means.
00:11:03 You go again and now you get two.
00:11:05 And now you know that that sequence of 100 actions
00:11:07 that you did the second time around
00:11:08 somehow was worse than the sequence of 100 actions
00:11:10 you did the first time around.
00:11:11 But that’s tough to now know which one of those
00:11:14 were better or worse.
00:11:15 Some might have been good and bad in either one.
00:11:17 And so that’s why it needs so many experiences.
00:11:19 But once you have enough experiences,
00:11:21 effectively RL is teasing that apart.
00:11:23 It’s trying to say okay, what is consistently there
00:11:26 when you get a higher reward
00:11:27 and what’s consistently there when you get a lower reward?
00:11:30 And then kind of the magic of sometimes
00:11:32 the policy gradient update is to say
00:11:34 now let’s update the neural network
00:11:37 to make the actions that were kind of present
00:11:39 when things are good more likely
00:11:41 and make the actions that are present
00:11:43 when things are not as good less likely.
00:11:45 So that is the counterpoint,
00:11:47 but it seems like you would need to run it
00:11:49 a lot more than you do.
00:11:50 Even though right now people could say
00:11:52 that RL is very inefficient,
00:11:54 but it seems to be way more efficient
00:11:56 than one would imagine on paper.
00:11:58 That the simple updates to the policy,
00:12:02 the policy gradient, that somehow you can learn,
00:12:04 exactly you just said, what are the common actions
00:12:07 that seem to produce some good results?
00:12:09 That that somehow can learn anything.
00:12:12 It seems counterintuitive at least.
00:12:15 Is there some intuition behind it?
00:12:16 Yeah, so I think there’s a few ways to think about this.
00:12:21 The way I tend to think about it mostly originally,
00:12:26 so when we started working on deep reinforcement learning
00:12:29 here at Berkeley, which was maybe 2011, 12, 13,
00:12:32 around that time, John Schulman was a PhD student
00:12:36 initially kind of driving it forward here.
00:12:39 And the way we thought about it at the time was
00:12:44 if you think about rectified linear units
00:12:47 or kind of rectifier type neural networks,
00:12:50 what do you get?
00:12:51 You get something that’s piecewise linear feedback control.
00:12:55 And if you look at the literature,
00:12:57 linear feedback control is extremely successful,
00:12:59 can solve many, many problems surprisingly well.
00:13:03 I remember, for example, when we did helicopter flight,
00:13:05 if you’re in a stationary flight regime,
00:13:07 not a non stationary, but a stationary flight regime
00:13:10 like hover, you can use linear feedback control
00:13:12 to stabilize a helicopter, very complex dynamical system,
00:13:15 but the controller is relatively simple.
00:13:18 And so I think that’s a big part of it is that
00:13:20 if you do feedback control, even though the system
00:13:23 you control can be very, very complex,
00:13:25 often relatively simple control architectures
00:13:28 can already do a lot.
00:13:30 But then also just linear is not good enough.
00:13:32 And so one way you can think of these neural networks
00:13:35 is that sometimes they tile the space,
00:13:37 which people were already trying to do more by hand
00:13:39 or with finite state machines,
00:13:41 say this linear controller here,
00:13:42 this linear controller here.
00:13:43 Neural network learns to tile the space
00:13:45 and say linear controller here,
00:13:46 another linear controller here,
00:13:48 but it’s more subtle than that.
00:13:50 And so it’s benefiting from this linear control aspect,
00:13:52 it’s benefiting from the tiling,
00:13:53 but it’s somehow tiling it one dimension at a time.
00:13:57 Because if let’s say you have a two layer network,
00:13:59 if in that hidden layer, you make a transition
00:14:03 from active to inactive or the other way around,
00:14:06 that is essentially one axis, but not axis aligned,
00:14:09 but one direction that you change.
00:14:12 And so you have this kind of very gradual tiling
00:14:14 of the space where you have a lot of sharing
00:14:16 between the linear controllers that tile the space.
00:14:19 And that was always my intuition as to why
00:14:21 to expect that this might work pretty well.
00:14:24 It’s essentially leveraging the fact
00:14:26 that linear feedback control is so good,
00:14:28 but of course not enough.
00:14:29 And this is a gradual tiling of the space
00:14:31 with linear feedback controls
00:14:33 that share a lot of expertise across them.
00:14:36 So that’s really nice intuition,
00:14:39 but do you think that scales to the more
00:14:41 and more general problems of when you start going up
00:14:44 the number of dimensions when you start
00:14:49 going down in terms of how often
00:14:52 you get a clean reward signal?
00:14:55 Does that intuition carry forward to those crazier,
00:14:58 weirder worlds that we think of as the real world?
00:15:03 So I think where things get really tricky
00:15:08 in the real world compared to the things
00:15:09 we’ve looked at so far with great success
00:15:11 in reinforcement learning is the time scales,
00:15:17 which takes us to an extreme.
00:15:18 So when you think about the real world,
00:15:21 I mean, I don’t know, maybe some student
00:15:24 decided to do a PhD here, right?
00:15:26 Okay, that’s a decision.
00:15:28 That’s a very high level decision.
00:15:30 But if you think about their lives,
00:15:32 I mean, any person’s life,
00:15:34 it’s a sequence of muscle fiber contractions
00:15:37 and relaxations, and that’s how you interact with the world.
00:15:40 And that’s a very high frequency control thing,
00:15:42 but it’s ultimately what you do
00:15:44 and how you affect the world,
00:15:46 until I guess we have brain readings
00:15:48 and you can maybe do it slightly differently.
00:15:49 But typically that’s how you affect the world.
00:15:52 And the decision of doing a PhD is so abstract
00:15:56 relative to what you’re actually doing in the world.
00:15:59 And I think that’s where credit assignment
00:16:01 becomes just completely beyond
00:16:04 what any current RL algorithm can do.
00:16:06 And we need hierarchical reasoning
00:16:09 at a level that is just not available at all yet.
00:16:12 Where do you think we can pick up hierarchical reasoning?
00:16:14 By which mechanisms?
00:16:16 Yeah, so maybe let me highlight
00:16:18 what I think the limitations are
00:16:20 of what already was done 20, 30 years ago.
00:16:26 In fact, you’ll find reasoning systems
00:16:27 that reason over relatively long horizons,
00:16:30 but the problem is that they were not grounded
00:16:32 in the real world.
00:16:34 So people would have to hand design
00:16:39 some kind of logical, dynamical descriptions of the world
00:16:43 and that didn’t tie into perception.
00:16:46 And so it didn’t tie into real objects and so forth.
00:16:49 And so that was a big gap.
00:16:51 Now with deep learning, we start having the ability
00:16:53 to really see with sensors, process that
00:16:59 and understand what’s in the world.
00:17:01 And so it’s a good time to try
00:17:02 to bring these things together.
00:17:04 I see a few ways of getting there.
00:17:06 One way to get there would be to say
00:17:08 deep learning can get bolted on somehow
00:17:10 to some of these more traditional approaches.
00:17:12 Now bolted on would probably mean
00:17:14 you need to do some kind of end to end training
00:17:16 where you say my deep learning processing
00:17:18 somehow leads to a representation
00:17:20 that in term uses some kind of traditional
00:17:24 underlying dynamical systems that can be used for planning.
00:17:29 And that’s, for example, the direction Aviv Tamar
00:17:32 and Thanard Kuretach here have been pushing
00:17:34 with causal info again and of course other people too.
00:17:36 That’s one way.
00:17:38 Can we somehow force it into the form factor
00:17:41 that is amenable to reasoning?
00:17:43 Another direction we’ve been thinking about
00:17:46 for a long time and didn’t make any progress on
00:17:50 was more information theoretic approaches.
00:17:53 So the idea there was that what it means
00:17:56 to take high level action is to take
00:17:59 and choose a latent variable now
00:18:02 that tells you a lot about what’s gonna be the case
00:18:04 in the future.
00:18:05 Because that’s what it means to take a high level action.
00:18:09 I say okay, I decide I’m gonna navigate
00:18:13 to the gas station because I need to get gas for my car.
00:18:15 Well, that’ll now take five minutes to get there.
00:18:17 But the fact that I get there,
00:18:19 I could already tell that from the high level action
00:18:22 I took much earlier.
00:18:24 That we had a very hard time getting success with.
00:18:28 Not saying it’s a dead end necessarily,
00:18:30 but we had a lot of trouble getting that to work.
00:18:33 And then we started revisiting the notion
00:18:34 of what are we really trying to achieve?
00:18:37 What we’re trying to achieve is not necessarily hierarchy
00:18:40 per se, but you could think about
00:18:41 what does hierarchy give us?
00:18:44 What we hope it would give us is better credit assignment.
00:18:49 What is better credit assignment?
00:18:51 It’s giving us, it gives us faster learning, right?
00:18:55 And so faster learning is ultimately maybe what we’re after.
00:18:59 And so that’s where we ended up with the RL squared paper
00:19:03 on learning to reinforcement learn,
00:19:06 which at a time Rocky Dwan led.
00:19:08 And that’s exactly the meta learning approach
00:19:11 where you say, okay, we don’t know how to design hierarchy.
00:19:14 We know what we want to get from it.
00:19:15 Let’s just enter and optimize for what we want to get
00:19:18 from it and see if it might emerge.
00:19:20 And we saw things emerge.
00:19:21 The maze navigation had consistent motion down hallways,
00:19:26 which is what you want.
00:19:27 A hierarchical control should say,
00:19:28 I want to go down this hallway.
00:19:29 And then when there is an option to take a turn,
00:19:31 I can decide whether to take a turn or not and repeat.
00:19:33 Even had the notion of where have you been before or not
00:19:37 to not revisit places you’ve been before.
00:19:39 It still didn’t scale yet
00:19:42 to the real world kind of scenarios I think you had in mind,
00:19:46 but it was some sign of life
00:19:47 that maybe you can meta learn these hierarchical concepts.
00:19:51 I mean, it seems like through these meta learning concepts,
00:19:56 get at the, what I think is one of the hardest
00:19:59 and most important problems of AI,
00:20:02 which is transfer learning.
00:20:04 So it’s generalization.
00:20:06 How far along this journey
00:20:08 towards building general systems are we?
00:20:11 Being able to do transfer learning well.
00:20:13 So there’s some signs that you can generalize a little bit,
00:20:17 but do you think we’re on the right path
00:20:19 or it’s totally different breakthroughs are needed
00:20:23 to be able to transfer knowledge
00:20:26 between different learned models?
00:20:31 Yeah, I’m pretty torn on this in that
00:20:33 I think there are some very impressive.
00:20:35 Well, there’s just some very impressive results already.
00:20:40 I mean, I would say when,
00:20:44 even with the initial kind of big breakthrough in 2012
00:20:47 with AlexNet, the initial thing is okay, great.
00:20:52 This does better on ImageNet, hence image recognition.
00:20:55 But then immediately thereafter,
00:20:57 there was of course the notion that,
00:21:00 wow, what was learned on ImageNet
00:21:03 and you now wanna solve a new task,
00:21:05 you can fine tune AlexNet for new tasks.
00:21:09 And that was often found to be the even bigger deal
00:21:12 that you learn something that was reusable,
00:21:14 which was not often the case before.
00:21:16 Usually machine learning, you learn something
00:21:17 for one scenario and that was it.
00:21:19 And that’s really exciting.
00:21:20 I mean, that’s a huge application.
00:21:22 That’s probably the biggest success
00:21:23 of transfer learning today in terms of scope and impact.
00:21:27 That was a huge breakthrough.
00:21:29 And then recently, I feel like similar kind of,
00:21:33 by scaling things up, it seems like
00:21:34 this has been expanded upon.
00:21:36 Like people training even bigger networks,
00:21:37 they might transfer even better.
00:21:39 If you looked at, for example,
00:21:41 some of the OpenAI results on language models
00:21:43 and some of the recent Google results on language models,
00:21:47 they’re learned for just prediction
00:21:51 and then they get reused for other tasks.
00:21:54 And so I think there is something there
00:21:56 where somehow if you train a big enough model
00:21:58 on enough things, it seems to transfer
00:22:01 some deep mind results that I thought were very impressive,
00:22:03 the Unreal results, where it was learned to navigate mazes
00:22:09 in ways where it wasn’t just doing reinforcement learning,
00:22:11 but it had other objectives it was optimizing for.
00:22:14 So I think there’s a lot of interesting results already.
00:22:17 I think maybe where it’s hard to wrap my head around this,
00:22:22 to which extent or when do we call something generalization?
00:22:26 Or the levels of generalization in the real world,
00:22:29 or the levels of generalization involved
00:22:31 in these different tasks, right?
00:22:36 You draw this, by the way, just to frame things.
00:22:39 I’ve heard you say somewhere, it’s the difference
00:22:41 between learning to master versus learning to generalize,
00:22:44 that it’s a nice line to think about.
00:22:47 And I guess you’re saying that it’s a gray area
00:22:50 of what learning to master and learning to generalize,
00:22:53 where one starts.
00:22:54 I think I might have heard this.
00:22:56 I might have heard it somewhere else.
00:22:57 And I think it might’ve been one of your interviews,
00:23:00 maybe the one with Yoshua Benjamin, I’m not 100% sure.
00:23:03 But I liked the example, I’m not sure who it was,
00:23:08 but the example was essentially,
00:23:10 if you use current deep learning techniques,
00:23:13 what we’re doing to predict, let’s say,
00:23:17 the relative motion of our planets, it would do pretty well.
00:23:22 But then now if a massive new mass enters our solar system,
00:23:28 it would probably not predict what will happen, right?
00:23:32 And that’s a different kind of generalization.
00:23:33 That’s a generalization that relies
00:23:34 on the ultimate simplest, simplest explanation
00:23:38 that we have available today
00:23:40 to explain the motion of planets,
00:23:41 whereas just pattern recognition could predict
00:23:43 our current solar system motion pretty well, no problem.
00:23:47 And so I think that’s an example
00:23:48 of a kind of generalization that is a little different
00:23:52 from what we’ve achieved so far.
00:23:54 And it’s not clear if just regularizing more
00:23:59 and forcing it to come up with a simpler, simpler,
00:24:01 simpler explanation and say, look, this is not simple.
00:24:03 But that’s what physics researchers do, right?
00:24:05 They say, can I make this even simpler?
00:24:08 How simple can I get this?
00:24:09 What’s the simplest equation that can explain everything?
00:24:12 The master equation for the entire dynamics of the universe,
00:24:15 we haven’t really pushed that direction as hard
00:24:17 in deep learning, I would say.
00:24:20 Not sure if it should be pushed,
00:24:22 but it seems a kind of generalization you get from that
00:24:24 that you don’t get in our current methods so far.
00:24:27 So I just talked to Vladimir Vapnik, for example,
00:24:30 who’s a statistician of statistical learning,
00:24:34 and he kind of dreams of creating
00:24:37 the E equals MC squared for learning, right?
00:24:41 The general theory of learning.
00:24:42 Do you think that’s a fruitless pursuit
00:24:44 in the near term, within the next several decades?
00:24:51 I think that’s a really interesting pursuit
00:24:53 in the following sense, in that there is a lot of evidence
00:24:58 that the brain is pretty modular.
00:25:03 And so I wouldn’t maybe think of it as the theory,
00:25:05 maybe the underlying theory, but more kind of the principle
00:25:10 where there have been findings where
00:25:12 people who are blind will use the part of the brain
00:25:16 usually used for vision for other functions.
00:25:21 And even after some kind of,
00:25:24 if people get rewired in some way,
00:25:26 they might be able to reuse parts of their brain
00:25:28 for other functions.
00:25:30 And so what that suggests is some kind of modularity.
00:25:35 And I think it is a pretty natural thing to strive for
00:25:39 to see, can we find that modularity?
00:25:41 Can we find this thing?
00:25:43 Of course, every part of the brain is not exactly the same.
00:25:45 Not everything can be rewired arbitrarily.
00:25:48 But if you think of things like the neocortex,
00:25:50 which is a pretty big part of the brain,
00:25:52 that seems fairly modular from what the findings so far.
00:25:56 Can you design something equally modular?
00:25:59 And if you can just grow it,
00:26:00 it becomes more capable probably.
00:26:02 I think that would be the kind of interesting
00:26:04 underlying principle to shoot for that is not unrealistic.
00:26:09 Do you think you prefer math or empirical trial and error
00:26:15 for the discovery of the essence of what it means
00:26:17 to do something intelligent?
00:26:19 So reinforcement learning embodies both groups, right?
00:26:22 To prove that something converges, prove the bounds.
00:26:26 And then at the same time, a lot of those successes are,
00:26:29 well, let’s try this and see if it works.
00:26:31 So which do you gravitate towards?
00:26:33 How do you think of those two parts of your brain?
00:26:39 Maybe I would prefer we could make the progress
00:26:44 with mathematics.
00:26:45 And the reason maybe I would prefer that is because often
00:26:48 if you have something you can mathematically formalize,
00:26:52 you can leapfrog a lot of experimentation.
00:26:55 And experimentation takes a long time to get through.
00:26:58 And a lot of trial and error,
00:27:01 kind of reinforcement learning, your research process,
00:27:04 but you need to do a lot of trial and error
00:27:05 before you get to a success.
00:27:06 So if you can leapfrog that, to my mind,
00:27:08 that’s what the math is about.
00:27:10 And hopefully once you do a bunch of experiments,
00:27:13 you start seeing a pattern.
00:27:14 You can do some derivations that leapfrog some experiments.
00:27:18 But I agree with you.
00:27:19 I mean, in practice, a lot of the progress has been such
00:27:21 that we have not been able to find the math
00:27:23 that allows you to leapfrog ahead.
00:27:25 And we are kind of making gradual progress
00:27:28 one step at a time, a new experiment here,
00:27:30 a new experiment there that gives us new insights
00:27:32 and gradually building up,
00:27:34 but not getting to something yet where we’re just,
00:27:36 okay, here’s an equation that now explains how,
00:27:39 you know, that would be,
00:27:40 have been two years of experimentation to get there,
00:27:42 but this tells us what the result’s going to be.
00:27:45 Unfortunately, not so much yet.
00:27:47 Not so much yet, but your hope is there.
00:27:50 In trying to teach robots or systems
00:27:53 to do everyday tasks or even in simulation,
00:27:58 what do you think you’re more excited about?
00:28:02 Imitation learning or self play?
00:28:04 So letting robots learn from humans
00:28:08 or letting robots plan their own
00:28:11 to try to figure out in their own way
00:28:13 and eventually play, eventually interact with humans
00:28:18 or solve whatever the problem is.
00:28:20 What’s the more exciting to you?
00:28:21 What’s more promising you think as a research direction?
00:28:24 So when we look at self play,
00:28:32 what’s so beautiful about it is goes back
00:28:34 to kind of the challenges in reinforcement learning.
00:28:37 So the challenge of reinforcement learning
00:28:38 is getting signal.
00:28:40 And if you don’t never succeed, you don’t get any signal.
00:28:43 In self play, you’re on both sides.
00:28:46 So one of you succeeds.
00:28:48 And the beauty is also one of you fails.
00:28:49 And so you see the contrast.
00:28:51 You see the one version of me that did better
00:28:53 than the other version.
00:28:54 So every time you play yourself, you get signal.
00:28:57 And so whenever you can turn something into self play,
00:29:00 you’re in a beautiful situation
00:29:02 where you can naturally learn much more quickly
00:29:04 than in most other reinforcement learning environments.
00:29:07 So I think if somehow we can turn more
00:29:12 reinforcement learning problems
00:29:13 into self play formulations,
00:29:15 that would go really, really far.
00:29:17 So far, self play has been largely around games
00:29:20 where there is natural opponents.
00:29:22 But if we could do self play for other things,
00:29:24 and let’s say, I don’t know,
00:29:25 a robot learns to build a house.
00:29:26 I mean, that’s a pretty advanced thing
00:29:28 to try to do for a robot,
00:29:29 but maybe it tries to build a hut or something.
00:29:31 If that can be done through self play,
00:29:34 it would learn a lot more quickly
00:29:35 if somebody can figure that out.
00:29:36 And I think that would be something
00:29:37 where it goes closer to kind of the mathematical leapfrogging
00:29:41 where somebody figures out a formalism to say,
00:29:43 okay, any RL problem by playing this and this idea,
00:29:47 you can turn it into a self play problem
00:29:48 where you get signal a lot more easily.
00:29:50 Reality is, many problems we don’t know
00:29:52 how to turn into self play.
00:29:53 And so either we need to provide detailed reward.
00:29:56 That doesn’t just reward for achieving a goal,
00:29:58 but rewards for making progress,
00:30:00 and that becomes time consuming.
00:30:02 And once you’re starting to do that,
00:30:03 let’s say you want a robot to do something,
00:30:05 you need to give all this detailed reward.
00:30:07 Well, why not just give a demonstration?
00:30:09 Because why not just show the robot?
00:30:11 And now the question is, how do you show the robot?
00:30:14 One way to show is to tally operate the robot,
00:30:16 and then the robot really experiences things.
00:30:19 And that’s nice, because that’s really high signal
00:30:21 to noise ratio data, and we’ve done a lot of that.
00:30:23 And you teach your robot skills in just 10 minutes,
00:30:26 you can teach your robot a new basic skill,
00:30:27 like okay, pick up the bottle, place it somewhere else.
00:30:30 That’s a skill, no matter where the bottle starts,
00:30:32 maybe it always goes onto a target or something.
00:30:34 That’s fairly easy to teach your robot with tally up.
00:30:38 Now, what’s even more interesting
00:30:40 if you can now teach your robot
00:30:41 through third person learning,
00:30:43 where the robot watches you do something
00:30:45 and doesn’t experience it, but just kind of watches you.
00:30:48 It doesn’t experience it, but just watches it
00:30:49 and says, okay, well, if you’re showing me that,
00:30:52 that means I should be doing this.
00:30:53 And I’m not gonna be using your hand,
00:30:55 because I don’t get to control your hand,
00:30:57 but I’m gonna use my hand, I do that mapping.
00:30:59 And so that’s where I think one of the big breakthroughs
00:31:02 has happened this year.
00:31:03 This was led by Chelsea Finn here.
00:31:06 It’s almost like learning a machine translation
00:31:08 for demonstrations, where you have a human demonstration,
00:31:11 and the robot learns to translate it
00:31:12 into what it means for the robot to do it.
00:31:15 And that was a meta learning formulation,
00:31:17 learn from one to get the other.
00:31:20 And that, I think, opens up a lot of opportunities
00:31:23 to learn a lot more quickly.
00:31:24 So my focus is on autonomous vehicles.
00:31:26 Do you think this approach of third person watching,
00:31:29 the autonomous driving is amenable
00:31:31 to this kind of approach?
00:31:33 So for autonomous driving,
00:31:36 I would say third person is slightly easier.
00:31:41 And the reason I’m gonna say it’s slightly easier
00:31:43 to do with third person is because
00:31:46 the car dynamics are very well understood.
00:31:49 So the…
00:31:51 Easier than first person, you mean?
00:31:53 Or easier than…
00:31:55 So I think the distinction between third person
00:31:57 and first person is not a very important distinction
00:32:00 for autonomous driving.
00:32:01 They’re very similar.
00:32:03 Because the distinction is really about
00:32:06 who turns the steering wheel.
00:32:09 Or maybe, let me put it differently.
00:32:12 How to get from a point where you are now
00:32:14 to a point, let’s say, a couple meters in front of you.
00:32:17 And that’s a problem that’s very well understood.
00:32:19 And that’s the only distinction
00:32:20 between third and first person there.
00:32:21 Whereas with the robot manipulation,
00:32:23 interaction forces are very complex.
00:32:25 And it’s still a very different thing.
00:32:27 For autonomous driving,
00:32:29 I think there is still the question,
00:32:31 imitation versus RL.
00:32:34 So imitation gives you a lot more signal.
00:32:36 I think where imitation is lacking
00:32:38 and needs some extra machinery is,
00:32:42 it doesn’t, in its normal format,
00:32:45 doesn’t think about goals or objectives.
00:32:48 And of course, there are versions of imitation learning
00:32:51 and versus reinforcement learning type imitation learning
00:32:52 which also thinks about goals.
00:32:54 I think then we’re getting much closer.
00:32:57 But I think it’s very hard to think of a
00:32:59 fully reactive car, generalizing well.
00:33:04 If it really doesn’t have a notion of objectives
00:33:05 to generalize well to the kind of general
00:33:08 that you would want.
00:33:09 You’d want more than just that reactivity
00:33:12 that you get from just behavioral cloning
00:33:13 slash supervised learning.
00:33:17 So a lot of the work,
00:33:19 whether it’s self play or even imitation learning,
00:33:22 would benefit significantly from simulation,
00:33:24 from effective simulation.
00:33:26 And you’re doing a lot of stuff
00:33:27 in the physical world and in simulation.
00:33:29 Do you have hope for greater and greater
00:33:33 power of simulation being boundless eventually
00:33:38 to where most of what we need to operate
00:33:40 in the physical world could be simulated
00:33:43 to a degree that’s directly transferable
00:33:46 to the physical world?
00:33:47 Or are we still very far away from that?
00:33:51 So I think we could even rephrase that question
00:33:57 in some sense.
00:33:58 Please.
00:34:00 And so the power of simulation, right?
00:34:04 As simulators get better and better,
00:34:06 of course, becomes stronger
00:34:08 and we can learn more in simulation.
00:34:11 But there’s also another version
00:34:12 which is where you say the simulator
00:34:13 doesn’t even have to be that precise.
00:34:15 As long as it’s somewhat representative
00:34:18 and instead of trying to get one simulator
00:34:21 that is sufficiently precise to learn in
00:34:23 and transfer really well to the real world,
00:34:25 I’m gonna build many simulators.
00:34:27 Ensemble of simulators?
00:34:28 Ensemble of simulators.
00:34:29 Not any single one of them is sufficiently representative
00:34:33 of the real world such that it would work
00:34:36 if you train in there.
00:34:37 But if you train in all of them,
00:34:40 then there is something that’s good in all of them.
00:34:43 The real world will just be another one of them
00:34:47 that’s not identical to any one of them
00:34:49 but just another one of them.
00:34:50 Another sample from the distribution of simulators.
00:34:53 Exactly.
00:34:54 We do live in a simulation,
00:34:54 so this is just one other one.
00:34:57 I’m not sure about that, but yeah.
00:35:01 It’s definitely a very advanced simulator if it is.
00:35:03 Yeah, it’s a pretty good one.
00:35:05 I’ve talked to Stuart Russell.
00:35:07 It’s something you think about a little bit too.
00:35:09 Of course, you’re really trying to build these systems,
00:35:12 but do you think about the future of AI?
00:35:13 A lot of people have concern about safety.
00:35:16 How do you think about AI safety?
00:35:18 As you build robots that are operating in the physical world,
00:35:21 what is, yeah, how do you approach this problem
00:35:25 in an engineering kind of way, in a systematic way?
00:35:29 So when a robot is doing things,
00:35:32 you kind of have a few notions of safety to worry about.
00:35:36 One is that the robot is physically strong
00:35:39 and of course could do a lot of damage.
00:35:42 Same for cars, which we can think of as robots too
00:35:44 in some way.
00:35:46 And this could be completely unintentional.
00:35:48 So it could be not the kind of longterm AI safety concerns
00:35:51 that, okay, AI is smarter than us and now what do we do?
00:35:54 But it could be just very practical.
00:35:55 Okay, this robot, if it makes a mistake,
00:35:58 what are the results going to be?
00:36:00 Of course, simulation comes in a lot there
00:36:02 to test in simulation. It’s a difficult question.
00:36:07 And I’m always wondering, like, I always wonder,
00:36:09 let’s say you look at, let’s go back to driving
00:36:12 because a lot of people know driving well, of course.
00:36:15 What do we do to test somebody for driving, right?
00:36:18 Get a driver’s license. What do they really do?
00:36:21 I mean, you fill out some tests and then you drive.
00:36:26 And I mean, it’s suburban California.
00:36:29 That driving test is just you drive around the block,
00:36:32 pull over, you do a stop sign successfully,
00:36:36 and then you pull over again and you’re pretty much done.
00:36:40 And you’re like, okay, if a self driving car did that,
00:36:44 would you trust it that it can drive?
00:36:46 And I’d be like, no, that’s not enough for me to trust it.
00:36:48 But somehow for humans, we’ve figured out
00:36:51 that somebody being able to do that is representative
00:36:55 of them being able to do a lot of other things.
00:36:57 And so I think somehow for humans,
00:36:59 we figured out representative tests
00:37:02 of what it means if you can do this, what you can really do.
00:37:05 Of course, testing humans,
00:37:07 humans don’t wanna be tested at all times.
00:37:09 Self driving cars or robots
00:37:10 could be tested more often probably.
00:37:11 You can have replicas that get tested
00:37:13 that are known to be identical
00:37:14 because they use the same neural net and so forth.
00:37:17 But still, I feel like we don’t have this kind of unit tests
00:37:21 or proper tests for robots.
00:37:24 And I think there’s something very interesting
00:37:25 to be thought about there,
00:37:26 especially as you update things.
00:37:28 Your software improves,
00:37:29 you have a better self driving car suite, you update it.
00:37:32 How do you know it’s indeed more capable on everything
00:37:35 than what you had before,
00:37:37 that you didn’t have any bad things creep into it?
00:37:41 So I think that’s a very interesting direction of research
00:37:43 that there is no real solution yet,
00:37:46 except that somehow for humans we do.
00:37:47 Because we say, okay, you have a driving test, you passed,
00:37:50 you can go on the road now,
00:37:51 and humans have accidents every like a million
00:37:54 or 10 million miles, something pretty phenomenal
00:37:57 compared to that short test that is being done.
00:38:01 So let me ask, you’ve mentioned that Andrew Ng by example
00:38:06 showed you the value of kindness.
00:38:10 Do you think the space of policies,
00:38:14 good policies for humans and for AI
00:38:17 is populated by policies that with kindness
00:38:22 or ones that are the opposite, exploitation, even evil?
00:38:28 So if you just look at the sea of policies
00:38:30 we operate under as human beings,
00:38:32 or if AI system had to operate in this real world,
00:38:35 do you think it’s really easy to find policies
00:38:38 that are full of kindness,
00:38:39 like we naturally fall into them?
00:38:41 Or is it like a very hard optimization problem?
00:38:48 I mean, there is kind of two optimizations
00:38:50 happening for humans, right?
00:38:52 So for humans, there’s kind of the very long term
00:38:54 optimization which evolution has done for us
00:38:56 and we’re kind of predisposed to like certain things.
00:39:00 And that’s in some sense what makes our learning easier
00:39:02 because I mean, we know things like pain
00:39:05 and hunger and thirst.
00:39:08 And the fact that we know about those
00:39:10 is not something that we were taught, that’s kind of innate.
00:39:12 When we’re hungry, we’re unhappy.
00:39:14 When we’re thirsty, we’re unhappy.
00:39:16 When we have pain, we’re unhappy.
00:39:18 And ultimately evolution built that into us
00:39:21 to think about those things.
00:39:22 And so I think there is a notion that
00:39:24 it seems somehow humans evolved in general
00:39:28 to prefer to get along in some ways,
00:39:32 but at the same time also to be very territorial
00:39:36 and kind of centric to their own tribe.
00:39:41 Like it seems like that’s the kind of space
00:39:43 we converged onto.
00:39:44 I mean, I’m not an expert in anthropology,
00:39:46 but it seems like we’re very kind of good
00:39:49 within our own tribe, but need to be taught
00:39:52 to be nice to other tribes.
00:39:54 Well, if you look at Steven Pinker,
00:39:56 he highlights this pretty nicely in
00:40:00 Better Angels of Our Nature,
00:40:02 where he talks about violence decreasing over time
00:40:04 consistently.
00:40:05 So whatever tension, whatever teams we pick,
00:40:08 it seems that the long arc of history
00:40:11 goes towards us getting along more and more.
00:40:14 So. I hope so.
00:40:15 So do you think that, do you think it’s possible
00:40:20 to teach RL based robots this kind of kindness,
00:40:26 this kind of ability to interact with humans,
00:40:28 this kind of policy, even to, let me ask a fun one.
00:40:32 Do you think it’s possible to teach RL based robot
00:40:35 to love a human being and to inspire that human
00:40:38 to love the robot back?
00:40:40 So to like RL based algorithm that leads to a happy marriage.
00:40:47 That’s an interesting question.
00:40:48 Maybe I’ll answer it with another question, right?
00:40:52 Because I mean, but I’ll come back to it.
00:40:56 So another question you can have is okay.
00:40:58 I mean, how close does some people’s happiness get
00:41:03 from interacting with just a really nice dog?
00:41:07 Like, I mean, dogs, you come home,
00:41:09 that’s what dogs do.
00:41:10 They greet you, they’re excited,
00:41:12 makes you happy when you come home to your dog.
00:41:14 You’re just like, okay, this is exciting.
00:41:16 They’re always happy when I’m here.
00:41:18 And if they don’t greet you, cause maybe whatever,
00:41:21 your partner took them on a trip or something,
00:41:23 you might not be nearly as happy when you get home, right?
00:41:26 And so the kind of, it seems like the level of reasoning
00:41:30 a dog has is pretty sophisticated,
00:41:32 but then it’s still not yet at the level of human reasoning.
00:41:35 And so it seems like we don’t even need to achieve
00:41:37 human level reasoning to get like very strong affection
00:41:40 with humans.
00:41:41 And so my thinking is why not, right?
00:41:44 Why couldn’t, with an AI, couldn’t we achieve
00:41:47 the kind of level of affection that humans feel
00:41:51 among each other or with friendly animals and so forth?
00:41:57 So question, is it a good thing for us or not?
00:41:59 That’s another thing, right?
00:42:01 Because I mean, but I don’t see why not.
00:42:05 Why not, yeah, so Elon Musk says love is the answer.
00:42:09 Maybe he should say love is the objective function
00:42:12 and then RL is the answer, right?
00:42:14 Well, maybe.
00:42:17 Oh, Peter, thank you so much.
00:42:18 I don’t want to take up more of your time.
00:42:20 Thank you so much for talking today.
00:42:21 Well, thanks for coming by.
00:42:23 Great to have you visit.