Sergey Levine: Robotics and Machine Learning #108

Transcript

00:00:00 The following is a conversation with Sergei Levine, a professor at Berkeley and a world

00:00:05 class researcher in deep learning, reinforcement learning, robotics, and computer vision, including

00:00:10 the development of algorithms for end to end training of neural network policies that combine

00:00:15 perception and control, scalable algorithms for inverse reinforcement learning, and, in

00:00:21 general, deep RL algorithms.

00:00:24 Quick summary of the ads.

00:00:25 Two sponsors, Cash App and ExpressVPN.

00:00:28 Please consider supporting the podcast by downloading Cash App and using code LexPodcast

00:00:34 and signing up at expressvpn.com slash lexpod.

00:00:38 Click the links, buy the stuff, it’s the best way to support this podcast and, in general,

00:00:44 the journey I’m on.

00:00:45 If you enjoy this thing, subscribe on YouTube, review it with 5 stars on Apple Podcast, follow

00:00:51 on Spotify, support it on Patreon, or connect with me on Twitter at lexfriedman.

00:00:57 As usual, I’ll do a few minutes of ads now and never any ads in the middle that can break

00:01:01 the flow of the conversation.

00:01:04 This show is presented by Cash App, the number one finance app in the App Store.

00:01:08 When you get it, use code lexpodcast.

00:01:11 Cash App lets you send money to friends, buy Bitcoin, and invest in the stock market with

00:01:15 as little as one dollar.

00:01:18 Since Cash App does fractional share trading, let me mention that the order execution algorithm

00:01:23 that works behind the scenes to create the abstraction of fractional orders is an algorithmic

00:01:29 marvel.

00:01:30 So, big props to the Cash App engineers for taking a step up to the next layer of abstraction

00:01:34 over the stock market, making trading more accessible for new investors and diversification

00:01:40 much easier.

00:01:41 So, again, if you get Cash App from the App Store or Google Play and use the code lexpodcast,

00:01:48 you get $10, and Cash App will also donate $10 to FIRST, an organization that is helping

00:01:54 to advance robotics and STEM education for young people around the world.

00:01:59 This show is also sponsored by ExpressVPN.

00:02:04 Get it at expressvpn.com slash lexpod to support this podcast and to get an extra three months

00:02:11 free on a one year package.

00:02:14 I’ve been using ExpressVPN for many years.

00:02:17 I love it.

00:02:18 I think ExpressVPN is the best VPN out there.

00:02:22 They told me to say it, but it happens to be true in my humble opinion.

00:02:26 It doesn’t log your data, it’s crazy fast, and it’s easy to use literally just one big

00:02:31 power on button.

00:02:32 Again, it’s probably obvious to you, but I should say it again, it’s really important

00:02:37 that they don’t log your data.

00:02:40 It works on Linux and every other operating system, but Linux, of course, is the best

00:02:45 operating system.

00:02:46 Shout out to my favorite flavor, Ubuntu Mate 2004.

00:02:50 Once again, get it at expressvpn.com slash lexpod to support this podcast and to get

00:02:56 an extra three months free on a one year package.

00:03:00 And now, here’s my conversation with Sergey Levine.

00:03:05 What’s the difference between a state of the art human, such as you and I, well, I don’t

00:03:10 know if we qualify as state of the art humans, but a state of the art human and a state of

00:03:14 the art robot?

00:03:16 That’s a very interesting question.

00:03:19 Robot capability is, it’s kind of a, I think it’s a very tricky thing to understand because

00:03:26 there are some things that are difficult that we wouldn’t think are difficult and some things

00:03:29 that are easy that we wouldn’t think are easy.

00:03:33 And there’s also a really big gap between capabilities of robots in terms of hardware

00:03:37 and their physical capability and capabilities of robots in terms of what they can do autonomously.

00:03:43 There is a little video that I think robotics researchers really like to show, especially

00:03:47 robotics learning researchers like myself, from 2004 from Stanford, which demonstrates

00:03:53 a prototype robot called the PR1, and the PR1 was a robot that was designed as a home

00:03:58 assistance robot.

00:03:59 And there’s this beautiful video showing the PR1 tidying up a living room, putting away

00:04:03 toys and at the end bringing a beer to the person sitting on the couch, which looks really

00:04:10 amazing.

00:04:11 And then the punchline is that this robot is entirely controlled by a person.

00:04:16 So in some ways the gap between a state of the art human and state of the art robot,

00:04:20 if the robot has a human brain, is actually not that large.

00:04:23 Now obviously like human bodies are sophisticated and very robust and resilient in many ways,

00:04:28 but on the whole, if we’re willing to like spend a bit of money and do a bit of engineering,

00:04:32 we can kind of close the hardware gap almost.

00:04:35 But the intelligence gap, that one is very wide.

00:04:40 And when you say hardware, you’re referring to the physical, sort of the actuators, the

00:04:43 actual body of the robot, as opposed to the hardware on which the cognition, the hardware

00:04:49 of the nervous system.

00:04:50 Yes, exactly.

00:04:51 I’m referring to the body rather than the mind.

00:04:54 So that means that the kind of the work is cut out for us.

00:04:56 Like while we can still make the body better, we kind of know that the big bottleneck right

00:05:00 now is really the mind.

00:05:02 And how big is that gap?

00:05:03 How big is the difference in your sense of ability to learn, ability to reason, ability

00:05:11 to perceive the world between humans and our best robots?

00:05:16 The gap is very large and the gap becomes larger the more unexpected events can happen

00:05:23 in the world.

00:05:24 So essentially the spectrum along which you can measure the size of that gap is the spectrum

00:05:30 of how open the world is.

00:05:32 If you control everything in the world very tightly, if you put the robot in like a factory

00:05:36 and you tell it where everything is and you rigidly program its motion, then it can do

00:05:41 things, you know, one might even say in a superhuman way.

00:05:43 It can move faster, it’s stronger, it can lift up a car and things like that.

00:05:47 But as soon as anything starts to vary in the environment, now it’ll trip up.

00:05:51 And if many, many things vary like they would like in your kitchen, for example, then things

00:05:55 are pretty much like wide open.

00:05:57 Now, again, we’re going to stick a bit on the philosophical questions, but how much

00:06:03 on the human side of the cognitive abilities in your sense is nature versus nurture?

00:06:11 So how much of it is a product of evolution and how much of it is something we’ll learn

00:06:18 from sort of scratch from the day we’re born?

00:06:22 I’m going to read into your question as asking about the implications of this for AI.

00:06:26 Because I’m not a biologist, I can’t really like speak authoritatively.

00:06:30 So until we go on it, if it’s so, if it’s all about learning, then there’s more hope

00:06:36 for AI.

00:06:38 So the way that I look at this is that, you know, well, first, of course, biology is very

00:06:44 messy.

00:06:45 And it’s if you ask the question, how does a person do something or has a person’s mind

00:06:49 do something, you can come up with a bunch of hypotheses and oftentimes you can find

00:06:54 support for many different, often conflicting hypotheses.

00:06:58 One way that we can approach the question of what the implications of this for AI are

00:07:03 is we can think about what’s sufficient.

00:07:05 So you know, maybe a person is from birth very, very good at some things like, for example,

00:07:11 recognizing faces.

00:07:12 There’s a very strong evolutionary pressure to do that.

00:07:13 If you can recognize your mother’s face, then you’re more likely to survive and therefore

00:07:18 people are good at this.

00:07:20 But we can also ask like, what’s the minimum sufficient thing?

00:07:23 And one of the ways that we can study the minimal sufficient thing is we could, for

00:07:27 example, see what people do in unusual situations.

00:07:29 If you present them with things that evolution couldn’t have prepared them for, you know,

00:07:33 our daily lives actually do this to us all the time.

00:07:36 We didn’t evolve to deal with, you know, automobiles and space flight and whatever.

00:07:41 So there are all these situations that we can find ourselves in and we do very well

00:07:45 there.

00:07:46 Like I can give you a joystick to control a robotic arm, which you’ve never used before

00:07:50 and you might be pretty bad for the first couple of seconds.

00:07:52 But if I tell you like your life depends on using this robotic arm to like open this door,

00:07:58 you’ll probably manage it.

00:07:59 Even though you’ve never seen this device before, you’ve never used the joystick control

00:08:03 us and you’ll kind of muddle through it.

00:08:04 And that’s not your evolved natural ability.

00:08:08 That’s your, your flexibility or your adaptability.

00:08:11 And that’s exactly where our current robotic systems really kind of fall flat.

00:08:14 But I wonder how much general, almost what we think of as common sense, pre trained models

00:08:22 underneath all of that.

00:08:24 So that ability to adapt to a joystick is, requires you to have a kind of, you know,

00:08:32 I’m human.

00:08:33 So it’s hard for me to introspect all the knowledge I have about the world, but it seems

00:08:37 like there might be an iceberg underneath of the amount of knowledge we actually bring

00:08:42 to the table.

00:08:43 That’s kind of the open question.

00:08:45 There’s absolutely an iceberg of knowledge that we bring to the table, but I think it’s

00:08:48 very likely that iceberg of knowledge is actually built up over our lifetimes.

00:08:54 Because we have, you know, we have a lot of prior experience to draw on.

00:08:58 And it kind of makes sense that the right way for us to, you know, to optimize our,

00:09:05 our efficiency, our evolutionary fitness and so on is to utilize all of that experience

00:09:10 to build up the best iceberg we can get.

00:09:13 And that’s actually one of the, you know, while that sounds an awful lot like what machine

00:09:16 learning actually does, I think that for modern machine learning, it’s actually a really big

00:09:20 challenge to take this unstructured mass of experience and distill out something that

00:09:25 looks like a common sense understanding of the world.

00:09:28 And perhaps part of that isn’t, it’s not because something about machine learning itself is,

00:09:32 is broken or hard, but because we’ve been a little too rigid in subscribing to a very

00:09:38 supervised, very rigid notion of learning, you know, kind of the input output, X’s go

00:09:42 to Y’s sort of model.

00:09:43 And maybe what we really need to do is to view the world more as like a mass of experience

00:09:51 that is not necessarily providing any rigid supervision, but sort of providing many, many

00:09:55 instances of things that could be.

00:09:56 And then you take that and you distill it into some sort of common sense understanding.

00:10:00 I see what you’re, you’re painting an optimistic, beautiful picture, especially from the robotics

00:10:06 perspective because that means we just need to invest and build better learning algorithms,

00:10:12 figure out how we can get access to more and more data for those learning algorithms to

00:10:17 extract signal from, and then accumulate that iceberg of knowledge.

00:10:22 It’s a beautiful picture.

00:10:23 It’s a hopeful one.

00:10:25 I think it’s potentially a little bit more than just that.

00:10:29 And this is, this is where we perhaps reach the limits of our current understanding.

00:10:32 But one thing that I think that the research community hasn’t really resolved in a satisfactory

00:10:37 way is how much it matters where that experience comes from, like, you know, do you just like

00:10:43 download everything on the internet and cram it into essentially the 21st century analog

00:10:48 of the giant language model and then see what happens or does it actually matter whether

00:10:54 your machine physically experiences the world or in the sense that it actually attempts

00:10:59 things, observes the outcome of its actions and kind of augments its experience that way.

00:11:03 And it chooses which parts of the world it gets to interact with and observe and learn

00:11:09 from.

00:11:10 Right.

00:11:11 It may be that the world is so complex that simply obtaining a large mass of sort of

00:11:16 IID samples of the world is a very difficult way to go.

00:11:21 But if you are actually interacting with the world and essentially performing this sort

00:11:25 of hard negative mining by attempting what you think might work, observing the sometimes

00:11:30 happy and sometimes sad outcomes of that and augmenting your understanding using that experience

00:11:35 and you’re just doing this continually for many years, maybe that sort of data in some

00:11:40 sense is actually much more favorable to obtaining a common sense understanding.

00:11:44 One reason we might think that this is true is that, you know, what we associate with

00:11:49 common sense or lack of common sense is often characterized by the ability to reason about

00:11:55 kind of counterfactual questions like, you know, if I were to hear this bottle of water

00:12:01 sitting on the table, everything is fine if I were to knock it over, which I’m not going

00:12:04 to do.

00:12:05 But if I were to do that, what would happen?

00:12:07 And I know that nothing good would happen from that.

00:12:10 But if I have a bad understanding of the world, I might think that that’s a good way for me

00:12:14 to like, you know, gain more utility.

00:12:16 If I actually go about my daily life doing the things that my current understanding of

00:12:22 the world suggests will give me high utility, in some ways, I’ll get exactly the right supervision

00:12:28 to tell me not to do those bad things and to keep doing the good things.

00:12:33 So there’s a spectrum between IID, random walk through the space of data, and then there’s

00:12:39 and what we humans do, I don’t even know if we do it optimal, but that might be beyond.

00:12:45 So this open question that you raised, where do you think systems, intelligent systems

00:12:52 that would be able to deal with this world fall?

00:12:56 Can we do pretty well by reading all of Wikipedia, sort of randomly sampling it like language

00:13:02 models do?

00:13:03 Or do we have to be exceptionally selective and intelligent about which aspects of the

00:13:09 world we interact with?

00:13:12 So I think this is first an open scientific problem, and I don’t have like a clear answer,

00:13:15 but I can speculate a little bit.

00:13:18 And what I would speculate is that you don’t need to be super, super careful.

00:13:23 I think it’s less about like, being careful to avoid the useless stuff, and more about

00:13:28 making sure that you hit on the really important stuff.

00:13:31 So perhaps it’s okay, if you spend part of your day, just, you know, guided by your curiosity,

00:13:37 reading interesting regions of your state space, but it’s important for you to, you

00:13:42 know, every once in a while, make sure that you really try out the solutions that your

00:13:47 current model of the world suggests might be effective, and observe whether those solutions

00:13:51 are working as you expect or not.

00:13:53 And perhaps some of that is really essential to have kind of a perpetual improvement loop.

00:13:59 This perpetual improvement loop is really like, that’s really the key, the key that’s

00:14:03 going to potentially distinguish the best current methods from the best methods of tomorrow

00:14:07 in a sense.

00:14:08 How important do you think is exploration or total out of the box thinking exploration

00:14:15 in this space as you jump to totally different domains?

00:14:19 So you kind of mentioned there’s an optimization problem, you kind of kind of explore the specifics

00:14:24 of a particular strategy, whatever the thing you’re trying to solve.

00:14:27 How important is it to explore totally outside of the strategies that have been working for

00:14:33 you so far?

00:14:34 What’s your intuition there?

00:14:35 Yeah, I think it’s a very problem dependent kind of question.

00:14:38 And I think that that’s actually, you know, in some ways that question gets at one of

00:14:45 the big differences between sort of the classic formulation of a reinforcement learning problem

00:14:51 and some of the sort of more open ended reformulations of that problem that have been explored in

00:14:57 recent years.

00:14:58 So classically reinforcement learning is framed as a problem of maximizing utility, like any

00:15:02 kind of rational AI agent, and then anything you do is in service to maximizing that utility.

00:15:08 But a very interesting kind of way to look at, I’m not necessarily saying this is the

00:15:15 best way to look at it, but an interesting alternative way to look at these problems

00:15:17 is as something where you first get to explore the world, however you please, and then afterwards

00:15:24 you will be tasked with doing something.

00:15:26 And that might suggest a somewhat different solution.

00:15:28 So if you don’t know what you’re going to be tasked with doing, and you just want to

00:15:31 prepare yourself optimally for whatever your uncertain future holds, maybe then you will

00:15:35 choose to attain some sort of coverage, build up sort of an arsenal of cognitive tools,

00:15:41 if you will, such that later on when someone tells you, now your job is to fetch the coffee

00:15:46 for me, you will be well prepared to undertake that task.

00:15:49 And that you see that as the modern formulation of the reinforcement learning problem, as

00:15:54 a kind of the more multitask, the general intelligence kind of formulation.

00:16:00 I think that’s one possible vision of where things might be headed.

00:16:04 I don’t think that’s by any means the mainstream or standard way of doing things, and it’s

00:16:08 not like if I had to…

00:16:09 But I like it.

00:16:10 It’s a beautiful vision.

00:16:11 So maybe you actually take a step back.

00:16:14 What is the goal of robotics?

00:16:16 What’s the general problem of robotics we’re trying to solve?

00:16:18 You actually kind of painted two pictures here.

00:16:21 One of sort of the narrow, one of the general.

00:16:23 What in your view is the big problem of robotics?

00:16:26 And ridiculously philosophical high level questions.

00:16:31 I think that maybe there are two ways I can answer this question.

00:16:34 One is there’s a very pragmatic problem, which is like what would make robots, what would

00:16:41 sort of maximize the usefulness of robots?

00:16:44 And there the answer might be something like a system where a system that can perform whatever

00:16:53 task a human user sets for it, within the physical constraints, of course.

00:16:59 If you tell it to teleport to another planet, it probably can’t do that.

00:17:02 But if you ask it to do something that’s within its physical capability, then potentially

00:17:06 with a little bit of additional training or a little bit of additional trial and error,

00:17:10 it ought to be able to figure it out in much the same way as like a human teleoperator

00:17:14 ought to figure out how to drive the robot to do that.

00:17:16 That’s kind of the very pragmatic view of what it would take to kind of solve the robotics

00:17:22 problem, if you will.

00:17:24 But I think that there is a second answer, and that answer is a lot closer to why I want

00:17:29 to work on robotics, which is that I think it’s less about what it would take to do a

00:17:34 really good job in the world of robotics, but more the other way around, what robotics

00:17:39 can bring to the table to help us understand artificial intelligence.

00:17:44 So your dream fundamentally is to understand intelligence?

00:17:48 Yes.

00:17:49 And I think that’s the dream for many people who actually work in this space.

00:17:53 I think that there’s something very pragmatic and very useful about studying robotics, but

00:17:58 I do think that a lot of people that go into this field actually, you know, the things

00:18:02 that they draw inspiration from are the potential for robots to like help us learn about intelligence

00:18:09 and about ourselves.

00:18:10 So that’s fascinating that robotics is basically the space by which you can get closer to understanding

00:18:18 the fundamentals of artificial intelligence.

00:18:20 So what is it about robotics that’s different from some of the other approaches?

00:18:25 So if we look at some of the early breakthroughs in deep learning or in the computer vision

00:18:30 space and the natural language processing, there’s really nice clean benchmarks that

00:18:34 a lot of people competed on and thereby came up with a lot of brilliant ideas.

00:18:38 What’s the fundamental difference to you between computer vision purely defined and ImageNet

00:18:43 and kind of the bigger robotics problem?

00:18:46 So there are a couple of things.

00:18:48 One is that with robotics, you kind of have to take away many of the crutches.

00:18:55 So you have to deal with both the particular problems of perception control and so on,

00:19:01 but you also have to deal with the integration of those things.

00:19:04 And you know, classically, we’ve always thought of the integration as kind of a separate problem.

00:19:08 So a classic kind of modular engineering approach is that we solve the individual subproblems

00:19:12 and then wire them together and then the whole thing works.

00:19:16 And one of the things that we’ve been seeing over the last couple of decades is that, well,

00:19:19 maybe studying the thing as a whole might lead to just like very different solutions

00:19:24 than if we were to study the parts and wire them together.

00:19:26 So the integrative nature of robotics research helps us see, you know, the different perspectives

00:19:32 on the problem.

00:19:34 Another part of the answer is that with robotics, it casts a certain paradox into very clever

00:19:40 relief.

00:19:41 This is sometimes referred to as Moravec’s paradox, the idea that in artificial intelligence,

00:19:48 things that are very hard for people can be very easy for machines and vice versa.

00:19:52 Things that are very easy for people can be very hard for machines.

00:19:54 So you know, integral and differential calculus is pretty difficult to learn for people.

00:20:02 But if you program a computer, do it, it can derive derivatives and integrals for you all

00:20:06 day long without any trouble.

00:20:08 Whereas some things like, you know, drinking from a cup of water, very easy for a person

00:20:13 to do, very hard for a robot to deal with.

00:20:16 And sometimes when we see such blatant discrepancies, that gives us a really strong hint that we’re

00:20:21 missing something important.

00:20:23 So if we really try to zero in on those discrepancies, we might find that little bit that we’re missing.

00:20:28 And it’s not that we need to make machines better or worse at math and better at drinking

00:20:32 water, but just that by studying those discrepancies, we might find some new insight.

00:20:37 So that could be in any space, it doesn’t have to be robotics.

00:20:41 But you’re saying, I mean, it’s kind of interesting that robotics seems to have a lot of those

00:20:48 discrepancies.

00:20:49 So the Hans Marvak paradox is probably referring to the space of the physical interaction,

00:20:56 like you said, object manipulation, walking, all the kind of stuff we do in the physical

00:21:00 world.

00:21:01 How do you make sense if you were to try to disentangle the Marvak paradox, like why is

00:21:13 there such a gap in our intuition about it?

00:21:17 Why do you think manipulating objects is so hard from everything you’ve learned from applying

00:21:23 reinforcement learning in this space?

00:21:25 Yeah, I think that one reason is maybe that for many of the other problems that we’ve

00:21:33 studied in AI and computer science and so on, the notion of input output and supervision

00:21:41 is much, much cleaner.

00:21:42 So computer vision, for example, deals with very complex inputs.

00:21:45 But it’s comparatively a bit easier, at least up to some level of abstraction, to cast it

00:21:52 as a very tightly supervised problem.

00:21:54 It’s comparatively much, much harder to cast robotic manipulation as a very tightly supervised

00:21:59 problem.

00:22:00 You can do it, it just doesn’t seem to work all that well.

00:22:03 So you could say that, well, maybe we get a labeled data set where we know exactly which

00:22:06 motor commands to send, and then we train on that.

00:22:09 But for various reasons, that’s not actually such a great solution.

00:22:13 And it also doesn’t seem to be even remotely similar to how people and animals learn to

00:22:17 do things, because we’re not told by our parents, here’s how you fire your muscles in order

00:22:22 to walk.

00:22:24 So we do get some guidance, but the really low level detailed stuff we figure out mostly

00:22:29 on our own.

00:22:30 And that’s what you mean by tightly coupled, that every single little sub action gets a

00:22:34 supervised signal of whether it’s a good one or not.

00:22:37 Right.

00:22:38 So while in computer vision, you could sort of imagine up to a level of abstraction that

00:22:41 maybe somebody told you this is a car and this is a cat and this is a dog, in motor

00:22:45 control, it’s very clear that that was not the case.

00:22:49 If we look at sort of the sub spaces of robotics, that, again, as you said, robotics integrates

00:22:57 all of them together, and we get to see how this beautiful mess interplays.

00:23:00 But so there’s nevertheless still perception.

00:23:04 So it’s the computer vision problem, broadly speaking, understanding the environment.

00:23:09 And there’s also maybe you can correct me on this kind of categorization of the space,

00:23:14 and there’s prediction in trying to anticipate what things are going to do into the future

00:23:20 in order for you to be able to act in that world.

00:23:24 And then there’s also this game theoretic aspect of how your actions will change the

00:23:31 behavior of others.

00:23:34 In this kind of space, what, and this is bigger than reinforcement learning, this is just

00:23:38 broadly looking at the problem of robotics, what’s the hardest problem here?

00:23:42 Or is there, or is what you said true that when you start to look at all of them together,

00:23:52 that’s a whole nother thing, like you can’t even say which one individually is harder

00:23:57 because all of them together, you should only be looking at them all together.

00:24:01 I think when you look at them all together, some things actually become easier.

00:24:05 And I think that’s actually pretty important.

00:24:07 So we had back in 2014, we had some work, basically our first work on end to end reinforcement

00:24:16 learning for robotic manipulation skills from vision, which at the time was something that

00:24:21 seemed a little inflammatory and controversial in the robotics world.

00:24:25 But other than the inflammatory and controversial part of it, the point that we were actually

00:24:30 trying to make in that work is that for the particular case of combining perception and

00:24:35 control, you could actually do better if you treat them together than if you try to separate

00:24:39 them.

00:24:40 And the way that we tried to demonstrate this is we picked a fairly simple motor control

00:24:43 task where a robot had to insert a little red trapezoid into a trapezoidal hole.

00:24:49 And we had our separated solution, which involved first detecting the hole using a pose detector

00:24:54 and then actuating the arm to put it in.

00:24:57 And then our intent solution, which just mapped pixels to the torques.

00:25:01 And one of the things we observed is that if you use the intent solution, essentially

00:25:05 the pressure on the perception part of the model is actually lower.

00:25:08 Like it doesn’t have to figure out exactly where the thing is in 3D space.

00:25:11 It just needs to figure out where it is, you know, distributing the errors in such a way

00:25:15 that the horizontal difference matters more than the vertical difference because vertically

00:25:19 it just pushes it down all the way until it can’t go any further.

00:25:22 And their perceptual errors are a lot less harmful, whereas perpendicular to the direction

00:25:26 of motion, perceptual errors are much more harmful.

00:25:29 So the point is that if you combine these two things, you can trade off errors between

00:25:33 the components optimally to best accomplish the task.

00:25:38 And the components can actually be weaker while still leading to better overall performance.

00:25:41 It’s a profound idea.

00:25:44 I mean, in the space of pegs and things like that, it’s quite simple.

00:25:48 It almost is tempting to overlook, but that seems to be at least intuitively an idea that

00:25:55 should generalize to basically all aspects of perception and control, that one strengthens

00:26:01 the other.

00:26:02 Yeah.

00:26:03 And we, you know, people who have studied sort of perceptual heuristics in humans and

00:26:07 animals find things like that all the time.

00:26:08 So one very well known example of this is something called the gaze heuristic, which

00:26:12 is a little trick that you can use to intercept a flying object.

00:26:17 So if you want to catch a ball, for instance, you could try to localize it in 3D space,

00:26:21 estimate its velocity, estimate the effect of wind resistance, solve a complex system

00:26:25 of differential equations in your head.

00:26:27 Or you can maintain a running speed so that the object stays in the same position as in

00:26:33 your field of view.

00:26:34 So if it dips a little bit, you speed up.

00:26:35 If it rises a little bit, you slow down.

00:26:38 And if you follow the simple rule, you’ll actually arrive at exactly the place where

00:26:40 the object lands and you’ll catch it.

00:26:43 And humans use it when they play baseball, human pilots use it when they fly airplanes

00:26:46 to figure out if they’re about to collide with somebody, frogs use this to catch insects

00:26:50 and so on and so on.

00:26:51 So this is something that actually happens in nature.

00:26:53 And I’m sure this is just one instance of it that we were able to identify just because

00:26:57 all the scientists were able to identify because it’s so prevalent, but there are probably

00:27:00 many others.

00:27:01 Do you have a, just so we can zoom in as we talk about robotics, do you have a canonical

00:27:06 problem, sort of a simple, clean, beautiful representative problem in robotics that you

00:27:12 think about when you’re thinking about some of these problems?

00:27:16 We talked about robotic manipulation, to me that seems intuitively, at least the robotics

00:27:23 community has converged towards that as a space that’s the canonical problem.

00:27:28 If you agree, then maybe do you zoom in in some particular aspect of that problem that

00:27:33 you just like?

00:27:34 Like if we solve that problem perfectly, it’ll unlock a major step towards human level intelligence.

00:27:44 I don’t think I have like a really great answer to that.

00:27:46 And I think partly the reason I don’t have a great answer kind of has to do with the,

00:27:53 it has to do with the fact that the difficulty is really in the flexibility and adaptability

00:27:57 rather than in doing a particular thing really, really well.

00:28:01 So it’s hard to just say like, oh, if you can, I don’t know, like shuffle a deck of

00:28:06 cards as fast as like a Vegas casino dealer, then you’ll be very proficient.

00:28:12 It’s really the ability to quickly figure out how to do some arbitrary new thing well

00:28:21 enough to like, you know, to move on to the next arbitrary thing.

00:28:26 But the source of newness and uncertainty, have you found problems in which it’s easy

00:28:33 to generate new newnessnesses?

00:28:38 New types of newness.

00:28:40 Yeah.

00:28:41 So a few years ago, so if you had asked me this question around like 2016, maybe I would

00:28:46 have probably said that robotic grasping is a really great example of that because it’s

00:28:51 a task with great real world utility.

00:28:54 Like you will get a lot of money if you can do it well.

00:28:57 What is robotic grasping?

00:28:58 Picking up any object with a robotic hand.

00:29:02 Exactly.

00:29:03 So you will get a lot of money if you do it well, because lots of people want to run warehouses

00:29:06 with robots and it’s highly non trivial because very different objects will require very different

00:29:13 grasping strategies.

00:29:15 But actually since then, people have gotten really good at building systems to solve this

00:29:19 problem to the point where I’m not actually sure how much more progress we can make with

00:29:25 that as like the main guiding thing.

00:29:29 But it’s kind of interesting to see the kind of methods that have actually worked well

00:29:32 in that space because robotic grasping classically used to be regarded very much as kind of almost

00:29:39 like a geometry problem.

00:29:41 So people who have studied the history of computer vision will find this very familiar

00:29:46 that it’s kind of in the same way that in the early days of computer vision, people

00:29:49 thought of it very much as like an inverse graphics thing.

00:29:52 In robotic grasping, people thought of it as an inverse physics problem essentially.

00:29:57 You look at what’s in front of you, figure out the shapes, then use your best estimate

00:30:01 of the laws of physics to figure out where to put your fingers on, you pick up the thing.

00:30:05 And it turns out that works really well for robotic grasping instantiated in many different

00:30:10 recent works, including our own, but also ones from many other labs is to use learning

00:30:15 methods with some combination of either exhaustive simulation or like actual real world trial

00:30:21 and error.

00:30:22 And it turns out that those things actually work really well and then you don’t have to

00:30:24 worry about solving geometry problems or physics problems.

00:30:29 What are, just by the way, in the grasping, what are the difficulties that have been worked

00:30:35 on?

00:30:36 So one is like the materials of things, maybe occlusions on the perception side.

00:30:41 Why is it such a difficult, why is picking stuff up such a difficult problem?

00:30:45 Yeah, it’s a difficult problem because the number of things that you might have to deal

00:30:50 with or the variety of things that you have to deal with is extremely large.

00:30:54 And oftentimes things that work for one class of objects won’t work for other classes of

00:30:59 objects.

00:31:00 So if you, if you get really good at picking up boxes and now you have to pick up plastic

00:31:05 bags, you know, you just need to employ a very different strategy.

00:31:09 And there are many properties of objects that are more than just their geometry that has

00:31:15 to do with, you know, the bits that are easier to pick up, the bits that are hard to pick

00:31:19 up, the bits that are more flexible, the bits that will cause the thing to pivot and bend

00:31:23 and drop out of your hand versus the bits that result in a nice secure grasp.

00:31:28 Things that are flexible, things that if you pick them up the wrong way, they’ll fall upside

00:31:31 down and the contents will spill out.

00:31:33 So there’s all these little details that come up, but the task is still kind of can be characterized

00:31:38 as one task.

00:31:39 Like there’s a very clear notion of you did it or you didn’t do it.

00:31:43 So in terms of spilling things, there creeps in this notion that starts to sound and feel

00:31:50 like common sense reasoning.

00:31:53 Do you think solving the general problem of robotics requires common sense reasoning,

00:32:01 requires general intelligence, this kind of human level capability of, you know, like

00:32:09 you said, be robust and deal with uncertainty, but also be able to sort of reason and assimilate

00:32:14 different pieces of knowledge that you have?

00:32:17 Yeah.

00:32:18 What are your thoughts on the needs?

00:32:23 Of common sense reasoning in the space of the general robotics problem?

00:32:28 So I’m going to slightly dodge that question and say that I think maybe actually it’s the

00:32:32 other way around is that studying robotics can help us understand how to put common sense

00:32:38 into our AI systems.

00:32:40 One way to think about common sense is that, and why our current systems might lack common

00:32:45 sense is that common sense is an emergent property of actually having to interact with

00:32:51 a particular world, a particular universe, and get things done in that universe.

00:32:56 So you might think that, for instance, like an image captioning system, maybe it looks

00:33:01 at pictures of the world and it types out English sentences.

00:33:05 So it kind of deals with our world.

00:33:09 And then you can easily construct situations where image captioning systems do things that

00:33:12 defy common sense, like give it a picture of a person wearing a fur coat and we’ll say

00:33:16 it’s a teddy bear.

00:33:18 But I think what’s really happening in those settings is that the system doesn’t actually

00:33:22 live in our world.

00:33:24 It lives in its own world that consists of pixels and English sentences and doesn’t actually

00:33:28 consist of having to put on a fur coat in the winter so you don’t get cold.

00:33:33 So perhaps the reason for the disconnect is that the systems that we have now simply inhabit

00:33:39 a different universe.

00:33:40 And if we build AI systems that are forced to deal with all of the messiness and complexity

00:33:45 of our universe, maybe they will have to acquire common sense to essentially maximize their

00:33:50 utility.

00:33:51 Whereas the systems we’re building now don’t have to do that.

00:33:53 They can take some shortcuts.

00:33:56 That’s fascinating.

00:33:57 You’ve a couple of times already sort of reframed the role of robotics in this whole thing.

00:34:02 And for some reason, I don’t know if my way of thinking is common, but I thought like

00:34:08 we need to understand and solve intelligence in order to solve robotics.

00:34:13 And you’re kind of framing it as, no, robotics is one of the best ways to just study artificial

00:34:18 intelligence and build sort of like, robotics is like the right space in which you get to

00:34:24 explore some of the fundamental learning mechanisms, fundamental sort of multimodal multitask aggregation

00:34:33 of knowledge mechanisms that are required for general intelligence.

00:34:36 It’s really interesting way to think about it, but let me ask about learning.

00:34:41 Can the general sort of robotics, the epitome of the robotics problem be solved purely through

00:34:47 learning, perhaps end to end learning, sort of learning from scratch as opposed to injecting

00:34:55 human expertise and rules and heuristics and so on?

00:35:00 I think that in terms of the spirit of the question, I would say yes.

00:35:04 I mean, I think that though in some ways it’s maybe like an overly sharp dichotomy, I think

00:35:12 that in some ways when we build algorithms, at some point a person does something, a person

00:35:20 turned on the computer, a person implemented a TensorFlow.

00:35:26 But yeah, I think that in terms of the point that you’re getting at, I do think the answer

00:35:29 is yes.

00:35:30 I think that we can solve many problems that have previously required meticulous manual

00:35:36 engineering through automated optimization techniques.

00:35:40 And actually one thing I will say on this topic is I don’t think this is actually a

00:35:43 very radical or very new idea.

00:35:45 I think people have been thinking about automated optimization techniques as a way to do control

00:35:51 for a very, very long time.

00:35:53 And in some ways what’s changed is really more the name.

00:35:58 So today we would say that, oh, my robot does machine learning, it does reinforcement learning.

00:36:03 Maybe in the 1960s you’d say, oh, my robot is doing optimal control.

00:36:08 And maybe the difference between typing out a system of differential equations and doing

00:36:12 feedback linearization versus training a neural net, maybe it’s not such a large difference.

00:36:17 It’s just pushing the optimization deeper and deeper into the thing.

00:36:21 Well, it’s interesting you think that way, but especially with deep learning that the

00:36:28 accumulation of sort of experiences in data form to form deep representations starts to

00:36:35 feel like knowledge as opposed to optimal control.

00:36:38 So this feels like there’s an accumulation of knowledge through the learning process.

00:36:42 Yes.

00:36:43 Yeah.

00:36:44 So I think that is a good point.

00:36:45 That one big difference between learning based systems and classic optimal control systems

00:36:49 is that learning based systems in principle should get better and better the more they

00:36:53 do something.

00:36:54 Right.

00:36:55 And I do think that that’s actually a very, very powerful difference.

00:36:58 So if we look back at the world of expert systems and symbolic AI and so on of using

00:37:04 logic to accumulate expertise, human expertise, human encoded expertise, do you think that

00:37:11 will have a role at some point?

00:37:13 The deep learning, machine learning, reinforcement learning has shown incredible results and

00:37:20 breakthroughs and just inspired thousands, maybe millions of researchers.

00:37:26 But there’s this less popular now, but it used to be popular idea of symbolic AI.

00:37:32 Do you think that will have a role?

00:37:35 I think in some ways the descendants of symbolic AI actually already have a role.

00:37:44 So this is the highly biased history from my perspective.

00:37:49 You say that, well, initially we thought that rational decision making involves logical

00:37:53 manipulation.

00:37:54 So you have some model of the world expressed in terms of logic.

00:37:59 You have some query, like what action do I take in order for X to be true?

00:38:04 And then you manipulate your logical symbolic representation to get an answer.

00:38:08 What that turned into somewhere in the 1990s is, well, instead of building kind of predicates

00:38:14 and statements that have true or false values, we’ll build probabilistic systems where things

00:38:20 have probabilities associated and probabilities of being true and false.

00:38:23 And that turned into Bayes nets.

00:38:25 And that provided sort of a boost to what were really still essentially logical inference

00:38:30 systems, just probabilistic logical inference systems.

00:38:33 And then people said, well, let’s actually learn the individual probabilities inside

00:38:37 these models.

00:38:39 And then people said, well, let’s not even specify the nodes in the models, let’s just

00:38:43 put a big neural net in there.

00:38:45 But in many ways, I see these as actually kind of descendants from the same idea.

00:38:48 It’s essentially instantiating rational decision making by means of some inference process

00:38:54 and learning by means of an optimization process.

00:38:57 So in a sense, I would say, yes, that it has a place.

00:39:00 And in many ways that place is, it already holds that place.

00:39:04 It’s already in there.

00:39:05 Yeah.

00:39:06 It’s just quite different.

00:39:07 It looks slightly different than it was before.

00:39:09 Yeah.

00:39:10 But there are some things that we can think about that make this a little bit more obvious.

00:39:13 Like if I train a big neural net model to predict what will happen in response to my

00:39:17 robot’s actions, and then I run probabilistic inference, meaning I invert that model to

00:39:22 figure out the actions that lead to some plausible outcome, like to me, that seems like a kind

00:39:26 of logic.

00:39:27 You have a model of the world that just happens to be expressed by a neural net, and you are

00:39:32 doing some inference procedure, some sort of manipulation on that model to figure out

00:39:37 the answer to a query that you have.

00:39:39 It’s the interpretability.

00:39:41 It’s the explainability, though, that seems to be lacking more so because the nice thing

00:39:46 about sort of expert systems is you can follow the reasoning of the system that to us mere

00:39:52 humans is somehow compelling.

00:39:56 It’s just I don’t know what to make of this fact that there’s a human desire for intelligence

00:40:04 systems to be able to convey in a poetic way to us why it made the decisions it did, like

00:40:12 tell a convincing story.

00:40:15 And perhaps that’s like a silly human thing, like we shouldn’t expect that of intelligence

00:40:22 systems.

00:40:23 I’m super happy that there is intelligence systems out there.

00:40:27 But if I were to sort of psychoanalyze the researchers at the time, I would say expert

00:40:33 systems connected to that part, that desire of AI researchers for systems to be explainable.

00:40:40 I mean, maybe on that topic, do you have a hope that sort of inferences of learning based

00:40:48 systems will be as explainable as the dream was with expert systems, for example?

00:40:55 I think it’s a very complicated question because I think that in some ways the question of

00:40:59 explainability is kind of very closely tied to the question of like performance, like,

00:41:07 you know, why do you want your system to explain itself so that when it screws up, you can

00:41:11 kind of figure out why it did it.

00:41:14 But in some ways that’s a much bigger problem, actually.

00:41:17 Like your system might screw up and then it might screw up in how it explains itself.

00:41:22 Or you might have some bug somewhere so that it’s not actually doing what it was supposed

00:41:26 to do.

00:41:27 So, you know, maybe a good way to view that problem is really as a problem, as a bigger

00:41:32 problem of verification and validation, of which explainability is sort of one component.

00:41:38 I see.

00:41:39 I just see it differently.

00:41:41 I see explainability, you put it beautifully, I think you actually summarize the field of

00:41:45 explainability.

00:41:46 But to me, there’s another aspect of explainability, which is like storytelling that has nothing

00:41:52 to do with errors or with, like, it uses errors as elements of its story as opposed to a fundamental

00:42:05 need to be explainable when errors occur.

00:42:08 It’s just that for other intelligent systems to be in our world, we seem to want to tell

00:42:12 each other stories.

00:42:14 And that’s true in the political world, that’s true in the academic world.

00:42:19 And that, you know, neural networks are less capable of doing that, or perhaps they’re

00:42:24 equally capable of storytelling and storytelling.

00:42:26 Maybe it doesn’t matter what the fundamentals of the system are.

00:42:30 You just need to be a good storyteller.

00:42:32 Maybe one specific story I can tell you about in that space is actually about some work

00:42:38 that was done by my former collaborator, who’s now a professor at MIT named Jacob Andreas.

00:42:43 Jacob actually works in natural language processing, but he had this idea to do a little bit of

00:42:47 work in reinforcement learning on how natural language can basically structure the internals

00:42:53 of policies trained with RL.

00:42:55 And one of the things he did is he set up a model that attempts to perform some task

00:43:01 that’s defined by a reward function, but the model reads in a natural language instruction.

00:43:06 So this is a pretty common thing to do in instruction following.

00:43:08 So you tell it like, you know, go to the red house and then it’s supposed to go to the red house.

00:43:13 But then one of the things that Jacob did is he treated that sentence, not as a command

00:43:18 from a person, but as a representation of the internal kind of a state of the mind of

00:43:25 this policy, essentially.

00:43:26 So that when it was faced with a new task, what it would do is it would basically try

00:43:30 to think of possible language descriptions, attempt to do them and see if they led to

00:43:34 the right outcome.

00:43:35 So it would kind of think out loud, like, you know, I’m faced with this new task.

00:43:38 What am I going to do?

00:43:39 Let me go to the red house.

00:43:40 Oh, that didn’t work.

00:43:41 Let me go to the blue room or something.

00:43:43 Let me go to the green plant.

00:43:45 And once it got some reward, it would say, oh, go to the green plant.

00:43:47 That’s what’s working.

00:43:48 I’m going to go to the green plant.

00:43:49 And then you could look at the string that it came up with, and that was a description

00:43:51 of how it thought it should solve the problem.

00:43:54 So you could do, you could basically incorporate language as internal state and you can start

00:43:58 getting some handle on these kinds of things.

00:44:01 And then what I was kind of trying to get to is that also, if you add to the reward

00:44:05 function, the convincingness of that story.

00:44:10 So I have another reward signal of like people who review that story, how much they like

00:44:15 it.

00:44:16 So that, you know, initially that could be a hyperparameter sort of hard coded heuristic

00:44:22 type of thing, but it’s an interesting notion of the convincingness of the story becoming

00:44:30 part of the reward function, the objective function of the explainability.

00:44:34 That’s in the world of sort of Twitter and fake news, that might be a scary notion that

00:44:40 the nature of truth may not be as important as the convincingness of the, how convincing

00:44:45 you are in telling the story around the facts.

00:44:49 Well, let me ask the basic question.

00:44:55 You’re one of the world class researchers in reinforcement learning, deep reinforcement

00:44:58 learning, certainly in the robotic space.

00:45:01 What is reinforcement learning?

00:45:04 I think that what reinforcement learning refers to today is really just the kind of the modern

00:45:09 incarnation of learning based control.

00:45:13 So classically reinforcement learning has a much more narrow definition, which is that

00:45:16 it’s literally learning from reinforcement, like the thing does something and then it

00:45:20 gets a reward or punishment.

00:45:22 But really I think the way the term is used today is it’s used to refer more broadly to

00:45:26 learning based control.

00:45:28 So some kind of system that’s supposed to be controlling something and it uses data

00:45:33 to get better.

00:45:34 And what does control mean?

00:45:35 So this action is the fundamental element there.

00:45:38 It means making rational decisions.

00:45:41 And rational decisions are decisions that maximize a measure of utility.

00:45:44 And sequentially, so you made decisions time and time and time again.

00:45:48 Now like it’s easier to see that kind of idea in the space of maybe games and the space

00:45:54 of robotics.

00:45:55 Do you see it bigger than that?

00:45:58 Is it applicable?

00:45:59 Like where are the limits of the applicability of reinforcement learning?

00:46:04 Yeah, so rational decision making is essentially the encapsulation of the AI problem viewed

00:46:12 through a particular lens.

00:46:13 So any problem that we would want a machine to do, an intelligent machine, can likely

00:46:18 be represented as a decision making problem.

00:46:20 Learning images is a decision making problem, although not a sequential one typically.

00:46:26 Controlling a chemical plant is a decision making problem.

00:46:30 Deciding what videos to recommend on YouTube is a decision making problem.

00:46:34 And one of the really appealing things about reinforcement learning is if it does encapsulate

00:46:39 the range of all these decision making problems, perhaps working on reinforcement learning

00:46:43 is one of the ways to reach a very broad swath of AI problems.

00:46:50 What is the fundamental difference between reinforcement learning and maybe supervised

00:46:55 machine learning?

00:46:57 So reinforcement learning can be viewed as a generalization of supervised machine learning.

00:47:02 You can certainly cast supervised learning as a reinforcement learning problem.

00:47:05 You can just say your loss function is the negative of your reward.

00:47:09 But you have stronger assumptions.

00:47:10 You have the assumption that someone actually told you what the correct answer was, that

00:47:14 your data was IID and so on.

00:47:16 So you could view reinforcement learning as essentially relaxing some of those assumptions.

00:47:20 Now that’s not always a very productive way to look at it because if you actually have

00:47:22 a supervised learning problem, you’ll probably solve it much more effectively by using supervised

00:47:26 learning methods because it’s easier.

00:47:29 But you can view reinforcement learning as a generalization of that.

00:47:32 No, for sure.

00:47:33 But they’re fundamentally different.

00:47:36 That’s a mathematical statement.

00:47:37 That’s absolutely correct.

00:47:38 But it seems that reinforcement learning, the kind of tools we bring to the table today

00:47:43 of today.

00:47:44 So maybe down the line, everything will be a reinforcement learning problem.

00:47:49 Just like you said, image classification should be mapped to a reinforcement learning problem.

00:47:53 But today, the tools and ideas, the way we think about them are different, sort of supervised

00:48:01 learning has been used very effectively to solve basic narrow AI problems.

00:48:07 Reinforcement learning kind of represents the dream of AI.

00:48:11 It’s very much so in the research space now in sort of captivating the imagination of

00:48:17 people of what we can do with intelligent systems, but it hasn’t yet had as wide of

00:48:22 an impact as the supervised learning approaches.

00:48:25 So my question comes from the more practical sense, like what do you see is the gap between

00:48:32 the more general reinforcement learning and the very specific, yes, it’s a question decision

00:48:38 making with one step in the sequence of the supervised learning?

00:48:43 So from a practical standpoint, I think that one thing that is potentially a little tough

00:48:49 now, and this is I think something that we’ll see, this is a gap that we might see closing

00:48:53 over the next couple of years, is the ability of reinforcement learning algorithms to effectively

00:48:57 utilize large amounts of prior data.

00:49:00 So one of the reasons why it’s a bit difficult today to use reinforcement learning for all

00:49:05 the things that we might want to use it for is that in most of the settings where we want

00:49:10 to do rational decision making, it’s a little bit tough to just deploy some policy that

00:49:15 does crazy stuff and learns purely through trial and error.

00:49:18 It’s much easier to collect a lot of data, a lot of logs of some other policy that you’ve

00:49:23 got, and then maybe if you can get a good policy out of that, then you deploy it and

00:49:28 let it kind of fine tune a little bit.

00:49:30 But algorithmically, it’s quite difficult to do that.

00:49:33 So I think that once we figure out how to get reinforcement learning to bootstrap effectively

00:49:37 from large data sets, then we’ll see very, very rapid growth in applications of these

00:49:44 technologies.

00:49:45 So this is what’s referred to as off policy reinforcement learning or offline RL or batch

00:49:48 RL.

00:49:50 And I think we’re seeing a lot of research right now that’s bringing us closer and closer

00:49:53 to that.

00:49:54 Can you maybe paint the picture of the different methods?

00:49:57 So you said off policy, what’s value based reinforcement learning?

00:50:02 What’s policy based?

00:50:03 What’s model based?

00:50:04 What’s off policy, on policy?

00:50:05 What are the different categories of reinforcement learning?

00:50:07 Okay.

00:50:08 So one way we can think about reinforcement learning is that it’s, in some very fundamental

00:50:14 way, it’s about learning models that can answer kind of what if questions.

00:50:20 So what would happen if I take this action that I hadn’t taken before?

00:50:24 And you do that, of course, from experience, from data.

00:50:26 And oftentimes you do it in a loop.

00:50:28 So you build a model that answers these what if questions, use it to figure out the best

00:50:32 action you can take, and then go and try taking that and see if the outcome agrees with what

00:50:36 you predicted.

00:50:38 So the different kinds of techniques basically refer to different ways of doing it.

00:50:43 So model based methods answer a question of what state you would get, basically what would

00:50:48 happen to the world if you were to take a certain action.

00:50:50 Value based methods, they answer the question of what value you would get, meaning what

00:50:55 utility you would get.

00:50:57 But in a sense, they’re not really all that different because they’re both really just

00:51:00 answering these what if questions.

00:51:03 Now unfortunately for us, with current machine learning methods, answering what if questions

00:51:07 can be really hard because they are really questions about things that didn’t happen.

00:51:12 If you wanted to answer what if questions about things that did happen, you wouldn’t

00:51:14 need a learn model.

00:51:15 You would just like repeat the thing that worked before.

00:51:19 And that’s really a big part of why RL is a little bit tough.

00:51:23 So if you have a purely on policy kind of online process, then you ask these what if

00:51:28 questions, you make some mistakes, then you go and try doing those mistaken things.

00:51:33 And then you observe kind of the counter examples that will teach you not to do those things

00:51:36 again.

00:51:37 If you have a bunch of off policy data and you just want to synthesize the best policy

00:51:42 you can out of that data, then you really have to deal with the challenges of making

00:51:46 these counterfactual.

00:51:47 First of all, what’s a policy?

00:51:50 A policy is a model or some kind of function that maps from observations of the world to

00:51:59 actions.

00:52:00 So in reinforcement learning, we often refer to the current configuration of the world

00:52:05 as the state.

00:52:06 So we say the state kind of encompasses everything you need to fully define where the world is

00:52:10 at the moment.

00:52:11 And depending on how we formulate the problem, we might say you either get to see the state

00:52:15 or you get to see an observation, which is some snapshot, some piece of the state.

00:52:19 So policy just includes everything in it in order to be able to act in this world.

00:52:25 Yes.

00:52:26 And so what does off policy mean?

00:52:29 Yeah, so the terms on policy and off policy refer to how you get your data.

00:52:33 So if you get your data from somebody else who was doing some other stuff, maybe you

00:52:37 get your data from some manually programmed system that was just running in the world

00:52:43 before that’s referred to as off policy data.

00:52:46 But if you got the data by actually acting in the world based on what your current policy

00:52:50 thinks is good, we call that on policy data.

00:52:53 And obviously on policy data is more useful to you because if your current policy makes

00:52:58 some bad decisions, you will actually see that those decisions are bad.

00:53:01 Off policy data, however, might be much easier to obtain because maybe that’s all the logged

00:53:06 data that you have from before.

00:53:08 So we talk about offline, talked about autonomous vehicles so you can envision off policy kind

00:53:14 of approaches in robotic spaces where there’s already a ton of robots out there, but they

00:53:19 don’t get the luxury of being able to explore based on a reinforcement learning framework.

00:53:26 So how do we make, again, open question, but how do we make off policy methods work?

00:53:32 Yeah.

00:53:33 So this is something that has been kind of a big open problem for a while.

00:53:37 And in the last few years, people have made a little bit of progress on that.

00:53:41 You know, I can tell you about, and it’s not by any means solved yet, but I can tell you

00:53:44 some of the things that, for example, we’ve done to try to address some of the challenges.

00:53:49 It turns out that one really big challenge with off policy reinforcement learning is

00:53:53 that you can’t really trust your models to give accurate predictions for any possible

00:53:59 action.

00:54:00 So if I’ve never tried to, if in my data set I never saw somebody steering the car off

00:54:05 the road onto the sidewalk, my value function or my model is probably not going to predict

00:54:11 the right thing if I ask what would happen if I were to steer the car off the road onto

00:54:14 the sidewalk.

00:54:15 So one of the important things you have to do to get off policy RL to work is you have

00:54:20 to be able to figure out whether a given action will result in a trustworthy prediction or

00:54:24 not.

00:54:25 And you can use a kind of distribution estimation methods, kind of density estimation methods

00:54:31 to try to figure that out.

00:54:32 So you could figure out that, well, this action, my model is telling me that it’s great, but

00:54:35 it looks totally different from any action I’ve taken before, so my model is probably

00:54:38 not correct.

00:54:39 And you can incorporate regularization terms into your learning objective that will essentially

00:54:45 tell you not to ask those questions that your model is unable to answer.

00:54:50 What would lead to breakthroughs in this space, do you think?

00:54:54 Like what’s needed?

00:54:55 Is this a data set question?

00:54:57 Do we need to collect big benchmark data sets that allow us to explore the space?

00:55:03 Is it a new kinds of methodologies?

00:55:08 Like what’s your sense?

00:55:09 Or maybe coming together in a space of robotics and defining the right problem to be working

00:55:14 on?

00:55:15 I think for off policy reinforcement learning in particular, it’s very much an algorithms

00:55:18 question right now.

00:55:19 And this is something that I think is great because an algorithms question is that that

00:55:25 just takes some very smart people to get together and think about it really hard, whereas if

00:55:29 it was like a data problem or a hardware problem, that would take some serious engineering.

00:55:34 So that’s why I’m pretty excited about that problem because I think that we’re in a position

00:55:38 where we can make some real progress on it just by coming up with the right algorithms.

00:55:42 In terms of which algorithms they could be, the problems at their core are very related

00:55:47 to problems in things like causal inference.

00:55:51 Because what you’re really dealing with is situations where you have a model, a statistical

00:55:55 model, that’s trying to make predictions about things that it hadn’t seen before.

00:56:00 And if it’s a model that’s generalizing properly, that’ll make good predictions.

00:56:04 If it’s a model that picks up on spurious correlations, that will not generalize properly.

00:56:09 And then you have an arsenal of tools you can use.

00:56:11 You could, for example, figure out what are the regions where it’s trustworthy, or on

00:56:15 the other hand, you could try to make it generalize better somehow, or some combination of the

00:56:18 two.

00:56:20 Is there room for mixing where most of it, like 90, 95% is off policy, you already have

00:56:30 the data set, and then you get to send the robot out to do a little exploration?

00:56:36 What’s that role of mixing them together?

00:56:38 Yeah, absolutely.

00:56:39 I think that this is something that you actually described very well at the beginning of our

00:56:45 discussion when you talked about the iceberg.

00:56:47 This is the iceberg.

00:56:48 The 99% of your prior experience, that’s your iceberg.

00:56:51 You’d use that for off policy reinforcement learning.

00:56:54 And then, of course, if you’ve never opened that particular kind of door with that particular

00:56:59 lock before, then you have to go out and fiddle with it a little bit.

00:57:02 And that’s that additional 1% to help you figure out a new task.

00:57:05 And I think that’s actually a pretty good recipe going forward.

00:57:08 Is this, to you, the most exciting space of reinforcement learning now?

00:57:12 Or is there, what’s, and maybe taking a step back, not just now, but what’s, to you, is

00:57:18 the most beautiful idea, apologize for the romanticized question, but the beautiful idea

00:57:23 or concept in reinforcement learning?

00:57:27 In general, I actually think that one of the things that is a very beautiful idea in reinforcement

00:57:32 learning is just the idea that you can obtain a near optimal control or near optimal policy

00:57:41 without actually having a complete model of the world.

00:57:45 This is, you know, it’s something that feels perhaps kind of obvious if you just hear the

00:57:53 term reinforcement learning or you think about trial and error learning.

00:57:55 But from a controls perspective, it’s a very weird thing because classically, you know,

00:58:01 we think about engineered systems and controlling engineered systems as the problem of writing

00:58:07 down some equations and then figuring out given these equations, you know, basically

00:58:11 solve for X, figure out the thing that maximizes its performance.

00:58:16 And the theory of reinforcement learning actually gives us a mathematically principled framework

00:58:21 to think, to reason about, you know, optimizing some quantity when you don’t actually know

00:58:27 the equations that govern that system.

00:58:28 And I don’t, to me, that’s actually seems kind of, you know, very elegant, not something

00:58:35 that sort of becomes immediately obvious, at least in the mathematical sense.

00:58:40 Does it make sense to you that it works at all?

00:58:42 Well, I think it makes sense when you take some time to think about it, but it is a little

00:58:48 surprising.

00:58:49 Well, then taking a step into the more deeper representations, which is also very surprising

00:58:56 of sort of the richness of the state space, the space of environments that this kind of

00:59:04 approach can operate in, can you maybe say what is deep reinforcement learning?

00:59:10 Well, deep reinforcement learning simply refers to taking reinforcement learning algorithms

00:59:16 and combining them with high capacity neural net representations.

00:59:20 Which is, you know, kind of, it might at first seem like a pretty arbitrary thing, just take

00:59:24 these two components and stick them together.

00:59:26 But the reason that it’s something that has become so important in recent years is that

00:59:32 reinforcement learning, it kind of faces an exacerbated version of a problem that has

00:59:38 faced many other machine learning techniques.

00:59:40 So if we go back to like, you know, the early two thousands or the late nineties, we’ll

00:59:45 see a lot of research on machine learning methods that have some very appealing mathematical

00:59:50 properties like they reduce the convex optimization problems, for instance, but they require very

00:59:56 special inputs.

00:59:57 They require a representation of the input that is clean in some way.

01:00:01 Like for example, clean in the sense that the classes in your multi class classification

01:00:06 problems separate linearly.

01:00:07 So they have some kind of good representation and we call this a feature representation.

01:00:12 And for a long time, people were very worried about features in the world of supervised

01:00:15 learning because somebody had to actually build those features so you couldn’t just

01:00:18 take an image and plug it into your logistic regression or your SVM or something.

01:00:22 How to take that image and process it using some handwritten code.

01:00:26 And then neural nets came along and they could actually learn the features and suddenly we

01:00:30 could apply learning directly to the raw inputs, which was great for images, but it was even

01:00:35 more great for all the other fields where people hadn’t come up with good features yet.

01:00:40 And one of those fields actually reinforcement learning because in reinforcement learning,

01:00:43 the notion of features, if you don’t use neural nets and you have to design your own features

01:00:46 is very, very opaque.

01:00:48 Like it’s very hard to imagine, let’s say I’m playing chess or go.

01:00:53 What is a feature with which I can represent the value function for go or even the optimal

01:00:58 policy for go linearly?

01:00:59 Like I don’t even know how to start thinking about it.

01:01:03 And people tried all sorts of things that would write down, you know, an expert chess

01:01:06 player looks for whether the knight is in the middle of the board or not.

01:01:09 So that’s a feature is knight in middle of board.

01:01:11 And they would write these like long lists of kind of arbitrary made up stuff.

01:01:15 And that was really kind of getting us nowhere.

01:01:17 And that’s a little, chess is a little more accessible than the robotics problem.

01:01:21 Absolutely.

01:01:22 Right.

01:01:23 There’s at least experts in the different features for chess, but still like the neural

01:01:30 network there, to me, that’s, I mean, you put it eloquently and almost made it seem

01:01:35 like a natural step to add neural networks, but the fact that neural networks are able

01:01:41 to discover features in the control problem, it’s very interesting.

01:01:45 It’s hopeful.

01:01:46 I’m not sure what to think about it, but it feels hopeful that the control problem has

01:01:51 features to be learned.

01:01:54 Like I guess my question is, is it surprising to you how far the deep side of deep reinforcement

01:02:02 learning was able to like what the space of problems has been able to tackle from, especially

01:02:07 in games with alpha star and alpha zero and just the representation power there and in

01:02:17 the robotics space and what is your sense of the limits of this representation power

01:02:23 and the control context?

01:02:26 I think that in regard to the limits that here, I think that one thing that makes it

01:02:32 a little hard to fully answer this question is because in settings where we would like

01:02:39 to push these things to the limit, we encounter other bottlenecks.

01:02:44 So like the reason that I can’t get my robot to learn how to like, I don’t know, do the

01:02:51 dishes in the kitchen, it’s not because it’s neural net is not big enough.

01:02:56 It’s because when you try to actually do trial and error learning, reinforcement learning,

01:03:02 directly in the real world where you have the potential to gather these large, highly

01:03:07 varied and complex data sets, you start running into other problems.

01:03:11 Like one problem you run into very quickly, it’ll first sound like a very pragmatic problem,

01:03:16 but it actually turns out to be a pretty deep scientific problem.

01:03:19 Take the robot, put it in your kitchen, have it try to learn to do the dishes with trial

01:03:22 and error.

01:03:23 It’ll break all your dishes and then we’ll have no more dishes to clean.

01:03:27 Now you might think this is a very practical issue, but there’s something to this, which

01:03:30 is that if you have a person trying to do this, a person will have some degree of common

01:03:33 sense.

01:03:34 They’ll break one dish, they’ll be a little more careful with the next one, and if they

01:03:37 break all of them, they’re going to go and get more or something like that.

01:03:41 So there’s all sorts of scaffolding that comes very naturally to us for our learning process.

01:03:46 Like if I have to learn something through trial and error, I have the common sense to

01:03:50 know that I have to try multiple times.

01:03:53 If I screw something up, I ask for help or I reset things or something like that.

01:03:57 And all of that is kind of outside of the classic reinforcement learning problem formulation.

01:04:02 There are other things that can also be categorized as kind of scaffolding, but are very important.

01:04:07 Like for example, where do you get your reward function?

01:04:09 If I want to learn how to pour a cup of water, well, how do I know if I’ve done it correctly?

01:04:15 Now that probably requires an entire computer vision system to be built just to determine

01:04:18 that, and that seems a little bit inelegant.

01:04:21 So there are all sorts of things like this that start to come up when we think through

01:04:24 what we really need to get reinforcement learning to happen at scale in the real world.

01:04:28 And many of these things actually suggest a little bit of a shortcoming in the problem

01:04:32 formulation and a few deeper questions that we have to resolve.

01:04:36 That’s really interesting.

01:04:37 I talked to David Silver about AlphaZero, and it seems like there’s no, again, we haven’t

01:04:45 hit the limit at all in the context where there’s no broken dishes.

01:04:50 So in the case of Go, you can, it’s really about just scaling compute.

01:04:55 So again, like the bottleneck is the amount of money you’re willing to invest in compute

01:05:00 and then maybe the different, the scaffolding around how difficult it is to scale compute

01:05:06 maybe, but there, there’s no limit.

01:05:09 And it’s interesting, now we’ll move to the real world and there’s the broken dishes,

01:05:12 there’s all the, and the reward function, like you mentioned, that’s really nice.

01:05:17 So what, how do we push forward there?

01:05:19 Do you think there’s, there’s this kind of a sample efficiency question that people bring

01:05:25 up of, you know, not having to break a hundred thousand dishes.

01:05:30 Is this an algorithm question?

01:05:33 Is this a data selection like question?

01:05:37 What do you think?

01:05:38 How do we, how do we not break too many dishes?

01:05:41 Yeah.

01:05:42 Well, one way we can think about that is that maybe we need to be better at, at reusing

01:05:51 our data, building that, that iceberg.

01:05:54 So perhaps, perhaps it’s too much to hope that you can have a machine that’s in isolation

01:06:02 in the vacuum without anything else, can just master complex tasks in like in minutes the

01:06:07 way that people do, but perhaps it also doesn’t have to, perhaps what it really needs to do

01:06:10 is have an existence, a lifetime where it does many things and the previous things that

01:06:16 it has done, prepare it to do new things more efficiently.

01:06:20 And you know, the study of these kinds of questions typically falls under categories

01:06:24 like multitask learning or meta learning, but they all fundamentally deal with the same

01:06:29 general theme, which is use experience for doing other things to learn to do new things

01:06:35 efficiently and quickly.

01:06:37 So what do you think about if we just look at the one particular case study of a Tesla

01:06:41 autopilot that has quickly approaching towards a million vehicles on the road where some

01:06:48 percentage of the time, 30, 40% of the time is driven using the computer vision, multitask

01:06:54 hydranet, right?

01:06:57 And then the other percent, that’s what they call it, hydranet.

01:07:03 The other percent is human controlled.

01:07:06 In the human side, how can we use that data?

01:07:09 What’s your sense?

01:07:12 What’s the signal?

01:07:13 Do you have ideas in this autonomous vehicle space when people can lose their lives?

01:07:17 You know, it’s a safety critical environment.

01:07:21 So how do we use that data?

01:07:23 So I think that actually the kind of problems that come up when we want systems that are

01:07:33 reliable and that can kind of understand the limits of their capabilities, they’re actually

01:07:37 very similar to the kind of problems that come up when we’re doing off policy reinforcement

01:07:40 learning.

01:07:41 So as I mentioned before, in off policy reinforcement learning, the big problem is you need to know

01:07:46 when you can trust the predictions of your model, because if you’re trying to evaluate

01:07:50 some pattern of behavior for which your model doesn’t give you an accurate prediction, then

01:07:54 you shouldn’t use that to modify your policy.

01:07:57 It’s actually very similar to the problem that we’re faced when we actually then deploy

01:08:00 that thing and we want to decide whether we trust it in the moment or not.

01:08:05 So perhaps we just need to do a better job of figuring out that part, and that’s a very

01:08:08 deep research question, of course, but it’s also a question that a lot of people are working

01:08:11 on.

01:08:12 So I’m pretty optimistic that we can make some progress on that over the next few years.

01:08:15 What’s the role of simulation in reinforcement learning, deep reinforcement learning, reinforcement

01:08:20 learning?

01:08:21 Like how essential is it?

01:08:23 It’s been essential for the breakthroughs so far for some interesting breakthroughs.

01:08:28 Do you think it’s a crutch that we rely on?

01:08:31 I mean, again, this connects to our off policy discussion, but do you think we can ever get

01:08:37 rid of simulation or do you think simulation will actually take over?

01:08:40 We’ll create more and more realistic simulations that will allow us to solve actual real world

01:08:46 problems, like transfer the models we learn in simulation to real world problems.

01:08:49 I think that simulation is a very pragmatic tool that we can use to get a lot of useful

01:08:54 stuff to work right now, but I think that in the long run, we will need to build machines

01:09:00 that can learn from real data because that’s the only way that we’ll get them to improve

01:09:03 perpetually because if we can’t have our machines learn from real data, if they have to rely

01:09:08 on simulated data, eventually the simulator becomes the bottleneck.

01:09:11 In fact, this is a general thing.

01:09:13 If your machine has any bottleneck that is built by humans and that doesn’t improve from

01:09:19 data, it will eventually be the thing that holds it back.

01:09:23 And if you’re entirely reliant on your simulator, that’ll be the bottleneck.

01:09:25 If you’re entirely reliant on a manually designed controller, that’s going to be the bottleneck.

01:09:30 So simulation is very useful.

01:09:32 It’s very pragmatic, but it’s not a substitute for being able to utilize real experience.

01:09:39 And by the way, this is something that I think is quite relevant now, especially in the context

01:09:44 of some of the things we’ve discussed, because some of these kind of scaffolding issues that

01:09:48 I mentioned, things like the broken dishes and the unknown reward function, like these

01:09:52 are not problems that you would ever stumble on when working in a purely simulated kind

01:09:57 of environment, but they become very apparent when we try to actually run these things in

01:10:01 the real world.

01:10:02 To throw a brief wrench into our discussion, let me ask, do you think we’re living in a

01:10:07 simulation?

01:10:08 Oh, I have no idea.

01:10:09 Do you think that’s a useful thing to even think about, about the fundamental physics

01:10:15 nature of reality?

01:10:18 Or another perspective, the reason I think the simulation hypothesis is interesting is

01:10:24 to think about how difficult is it to create sort of a virtual reality game type situation

01:10:33 that will be sufficiently convincing to us humans or sufficiently enjoyable that we wouldn’t

01:10:38 want to leave.

01:10:39 I mean, that’s actually a practical engineering challenge.

01:10:43 And I personally really enjoy virtual reality, but it’s quite far away.

01:10:47 I kind of think about what would it take for me to want to spend more time in virtual reality

01:10:52 versus the real world.

01:10:55 And that’s a sort of a nice clean question because at that point, if I want to live in

01:11:03 a virtual reality, that means we’re just a few years away where a majority of the population

01:11:08 lives in a virtual reality.

01:11:09 And that’s how we create the simulation, right?

01:11:11 You don’t need to actually simulate the quantum gravity and just every aspect of the universe.

01:11:19 And that’s an interesting question for reinforcement learning too, is if we want to make sufficiently

01:11:24 realistic simulations that may blend the difference between sort of the real world and the simulation,

01:11:32 thereby just some of the things we’ve been talking about, kind of the problems go away

01:11:37 if we can create actually interesting, rich simulations.

01:11:40 It’s an interesting question.

01:11:41 And it actually, I think your question casts your previous question in a very interesting

01:11:46 light, because in some ways asking whether we can, well, the more kind of practical version

01:11:53 is like, you know, can we build simulators that are good enough to train essentially

01:11:57 AI systems that will work in the world?

01:12:02 And it’s kind of interesting to think about this, about what this implies, if true, it

01:12:06 kind of implies that it’s easier to create the universe than it is to create a brain.

01:12:11 And that seems like, put this way, it seems kind of weird.

01:12:14 The aspect of the simulation most interesting to me is the simulation of other humans.

01:12:21 That seems to be a complexity that makes the robotics problem harder.

01:12:27 Now I don’t know if every robotics person agrees with that notion.

01:12:32 Just as a quick aside, what are your thoughts about when the human enters the picture of

01:12:38 the robotics problem?

01:12:39 How does that change the reinforcement learning problem, the learning problem in general?

01:12:44 Yeah, I think that’s a, it’s a kind of a complex question.

01:12:48 And I guess my hope for a while had been that if we build these robotic learning systems

01:12:56 that are multitask, that utilize lots of prior data and that learn from their own experience,

01:13:03 the bit where they have to interact with people will be perhaps handled in much the same way

01:13:07 as all the other bits.

01:13:08 So if they have prior experience of interacting with people and they can learn from their

01:13:12 own experience of interacting with people for this new task, maybe that’ll be enough.

01:13:16 Now, of course, if it’s not enough, there are many other things we can do and there’s

01:13:20 quite a bit of research in that area.

01:13:22 But I think it’s worth a shot to see whether the multi agent interaction, the ability to

01:13:29 understand that other beings in the world have their own goals and tensions and thoughts

01:13:35 and so on, whether that kind of understanding can emerge automatically from simply learning

01:13:41 to do things with and maximize utility.

01:13:44 That information arises from the data.

01:13:46 You’ve said something about gravity, that you don’t need to explicitly inject anything

01:13:53 into the system.

01:13:54 They can be learned from the data.

01:13:55 And gravity is an example of something that could be learned from data, so like the physics

01:13:59 of the world.

01:14:05 What are the limits of what we can learn from data?

01:14:08 Do you really think we can?

01:14:10 So a very simple, clean way to ask that is, do you really think we can learn gravity from

01:14:15 just data, the idea, the laws of gravity?

01:14:19 So something that I think is a common kind of pitfall when thinking about prior knowledge

01:14:25 and learning is to assume that just because we know something, then that it’s better to

01:14:33 tell the machine about that rather than have it figured out on its own.

01:14:36 In many cases, things that are important that affect many of the events that the machine

01:14:44 will experience are actually pretty easy to learn.

01:14:48 If every time you drop something, it falls down, yeah, you might get the Newton’s version,

01:14:54 not Einstein’s version, but it’ll be pretty good and it will probably be sufficient for

01:14:58 you to act rationally in the world because you see the phenomenon all the time.

01:15:03 So things that are readily apparent from the data, we might not need to specify those by

01:15:07 hand.

01:15:08 It might actually be easier to let the machine figure them out.

01:15:10 It just feels like that there might be a space of many local minima in terms of theories

01:15:17 of this world that we would discover and get stuck on, that Newtonian mechanics is not necessarily

01:15:25 easy to come by.

01:15:27 Yeah.

01:15:28 And in fact, in some fields of science, for example, human civilization is itself full

01:15:33 of these local optima.

01:15:34 So for example, if you think about how people tried to figure out biology and medicine for

01:15:40 the longest time, the kind of rules, the kind of principles that serve us very well in our

01:15:45 day to day lives actually serve us very poorly in understanding medicine and biology.

01:15:50 We had kind of very superstitious and weird ideas about how the body worked until the

01:15:55 advent of the modern scientific method.

01:15:58 So that does seem to be a failing of this approach, but it’s also a failing of human

01:16:02 intelligence arguably.

01:16:04 Maybe a small aside, but some, you know, the idea of self play is fascinating in reinforcement

01:16:09 learning sort of these competitive, creating a competitive context in which agents can

01:16:14 play against each other in a, sort of at the same skill level and thereby increasing each

01:16:20 other skill level.

01:16:21 It seems to be this kind of self improving mechanism is exceptionally powerful in the

01:16:26 context where it could be applied.

01:16:29 First of all, is that beautiful to you that this mechanism work as well as it does?

01:16:34 And also can we generalize to other contexts like in the robotic space or anything that’s

01:16:41 applicable to the real world?

01:16:43 I think that it’s a very interesting idea, but I suspect that the bottleneck to actually

01:16:51 generalizing it to the robotic setting is actually going to be the same as the bottleneck

01:16:56 for everything else that we need to be able to build machines that can get better and

01:17:01 better through natural interaction with the world.

01:17:04 And once we can do that, then they can go out and play with, they can play with each

01:17:08 other, they can play with people, they can play with the natural environment.

01:17:13 But before we get there, we’ve got all these other problems we’ve got, we have to get out

01:17:16 of the way.

01:17:17 So there’s no shortcut around that.

01:17:18 You have to interact with a natural environment that.

01:17:21 Well because in a, in a self play setting, you still need a mediating mechanism.

01:17:24 So the, the reason that, you know, self play works for a board game is because the rules

01:17:30 of that board game mediate the interaction between the agents.

01:17:33 So the kind of intelligent behavior that will emerge depends very heavily on the nature

01:17:37 of that mediating mechanism.

01:17:39 So on the side of reward functions, that’s coming up with good reward functions seems

01:17:44 to be the thing that we associate with general intelligence, like human beings seem to value

01:17:50 the idea of developing our own reward functions of, you know, at arriving at meaning and so

01:17:57 on.

01:17:58 And yet for reinforcement learning, we often kind of specify that’s the given.

01:18:02 What’s your sense of how we develop reward, you know, good reward functions?

01:18:08 Yeah, I think that’s a very complicated and very deep question.

01:18:12 And you’re completely right that classically in reinforcement learning, this question,

01:18:16 I guess, kind of been treated as an on issue that you sort of treat the reward as this

01:18:21 external thing that comes from some other bit of your biology and you kind of don’t

01:18:27 worry about it.

01:18:28 And I do think that that’s actually, you know, a little bit of a mistake that we should worry

01:18:32 about it.

01:18:33 And we can approach it in a few different ways.

01:18:34 We can approach it, for instance, by thinking of rewards as a communication medium.

01:18:39 We can say, well, how does a person communicate to a robot what its objective is?

01:18:43 You can approach it also as a sort of more of an intrinsic motivation medium.

01:18:47 You could say, can we write down kind of a general objective that leads to good capability?

01:18:55 Like for example, can you write down some objectives such that even in the absence of

01:18:58 any other task, if you maximize that objective, you’ll sort of learn useful things.

01:19:02 This is something that has sometimes been called unsupervised reinforcement learning,

01:19:07 which I think is a really fascinating area of research, especially today.

01:19:11 We’ve done a bit of work on that recently.

01:19:13 One of the things we’ve studied is whether we can have some notion of unsupervised reinforcement

01:19:19 learning by means of, you know, information theoretic quantities, like for instance, minimizing

01:19:25 a Bayesian measure of surprise.

01:19:26 This is an idea that was, you know, pioneered actually in the computational neuroscience

01:19:30 community by folks like Carl Friston.

01:19:32 And we’ve done some work recently that shows that you can actually learn pretty interesting

01:19:35 skills by essentially behaving in a way that allows you to make accurate predictions about

01:19:41 the world.

01:19:42 Like do the things that will lead to you getting the right answer for prediction.

01:19:48 But you can, you know, by doing this, you can sort of discover stable niches in the

01:19:52 world.

01:19:53 You can discover that if you’re playing Tetris, then correctly, you know, clearing the rows

01:19:57 will let you play Tetris for longer and keep the board nice and clean, which sort of satisfies

01:20:01 some desire for order in the world.

01:20:04 And as a result, get some degree of leverage over your domain.

01:20:07 So we’re exploring that pretty actively.

01:20:08 Is there a role for a human notion of curiosity in itself being the reward, sort of discovering

01:20:15 new things about the world?

01:20:19 So one of the things that I’m pretty interested in is actually whether discovering new things

01:20:26 can actually be an emergent property of some other objective that quantifies capability.

01:20:30 So new things for the sake of new things maybe is not, maybe might not by itself be the right

01:20:36 answer, but perhaps we can figure out an objective for which discovering new things is actually

01:20:42 the natural consequence.

01:20:44 That’s something we’re working on right now, but I don’t have a clear answer for you there

01:20:47 yet that’s still a work in progress.

01:20:49 You mean just that it’s a curious observation to see sort of creative patterns of curiosity

01:20:57 on the way to optimize for a particular task?

01:21:00 On the way to optimize for a particular measure of capability.

01:21:05 Is there ways to understand or anticipate unexpected unintended consequences of particular

01:21:15 reward functions, sort of anticipate the kind of strategies that might be developed and

01:21:22 try to avoid highly detrimental strategies?

01:21:27 So classically, this is something that has been pretty hard in reinforcement learning

01:21:30 because it’s difficult for a designer to have good intuition about, you know, what a learning

01:21:35 algorithm will come up with when they give it some objective.

01:21:38 There are ways to mitigate that.

01:21:40 One way to mitigate it is to actually define an objective that says like, don’t do weird

01:21:45 stuff.

01:21:46 You can actually quantify it.

01:21:47 You can say just like, don’t enter situations that have low probability under the distribution

01:21:52 of states you’ve seen before.

01:21:54 It turns out that that’s actually one very good way to do off policy reinforcement learning

01:21:57 actually.

01:21:59 So we can do some things like that.

01:22:02 If we slowly venture in speaking about reward functions into greater and greater levels

01:22:08 of intelligence, there’s, I mean, Stuart Russell thinks about this, the alignment of AI systems

01:22:16 with us humans.

01:22:18 So how do we ensure that AGI systems align with us humans?

01:22:23 It’s kind of a reward function question of specifying the behavior of AI systems such

01:22:32 that their success aligns with this, with the broader intended success interest of human

01:22:39 beings.

01:22:40 Do you have thoughts on this?

01:22:41 Do you have kind of concerns of where reinforcement learning fits into this, or are you really

01:22:45 focused on the current moment of us being quite far away and trying to solve the robotics

01:22:50 problem?

01:22:51 I don’t have a great answer to this, but, you know, and I do think that this is a problem

01:22:56 that’s important to figure out.

01:22:59 For my part, I’m actually a bit more concerned about the other side of the, of this equation

01:23:04 that, you know, maybe rather than unintended consequences for objectives that are specified

01:23:11 too well, I’m actually more worried right now about unintended consequences for objectives

01:23:15 that are not optimized well enough, which might become a very pressing problem when

01:23:21 we, for instance, try to use these techniques for safety critical systems like cars and

01:23:26 aircraft and so on.

01:23:28 I think at some point we’ll face the issue of objectives being optimized too well, but

01:23:32 right now I think we’re, we’re more likely to face the issue of them not being optimized

01:23:36 well enough.

01:23:37 But you don’t think unintended consequences can arise even when you’re far from optimality,

01:23:41 sort of like on the path to it?

01:23:43 Oh no, I think unintended consequences can absolutely arise.

01:23:46 It’s just, I think right now the bottleneck for improving reliability, safety and things

01:23:52 like that is more with systems that like need to work better, that need to optimize their

01:23:57 objectives better.

01:23:58 Do you have thoughts, concerns about existential threats of human level intelligence that have,

01:24:05 if we put on our hat of looking in 10, 20, 100, 500 years from now, do you have concerns

01:24:11 about existential threats of AI systems?

01:24:15 I think there are absolutely existential threats for AI systems, just like there are for any

01:24:19 powerful technology.

01:24:22 But I think that the, these kinds of problems can take many forms and, and some of those

01:24:28 forms will come down to, you know, people with nefarious intent.

01:24:34 Some of them will come down to AI systems that have some fatal flaws.

01:24:38 And some of them will, will of course come down to AI systems that are too capable in

01:24:42 some way.

01:24:44 But among this set of potential concerns, I would actually be much more concerned about

01:24:50 the first two right now, and principally the one with nefarious humans, because, you know,

01:24:55 just through all of human history, actually it’s the nefarious humans that have been the

01:24:57 problem, not the nefarious machines, than I am about the others.

01:25:01 And I think that right now the best that I can do to make sure things go well is to build

01:25:07 the best technology I can and also hopefully promote responsible use of that technology.

01:25:13 Do you think RL Systems has something to teach us humans?

01:25:19 You said nefarious humans getting us in trouble.

01:25:21 I mean, machine learning systems have in some ways have revealed to us the ethical flaws

01:25:26 in our data.

01:25:27 In that same kind of way, can reinforcement learning teach us about ourselves?

01:25:32 Has it taught something?

01:25:34 What have you learned about yourself from trying to build robots and reinforcement learning

01:25:40 systems?

01:25:42 I’m not sure what I’ve learned about myself, but maybe part of the answer to your question

01:25:49 might become a little bit more apparent once we see more widespread deployment of reinforcement

01:25:55 learning for decision making support in domains like healthcare, education, social media,

01:26:02 etc.

01:26:03 And I think we will see some interesting stuff emerge there.

01:26:06 We will see, for instance, what kind of behaviors these systems come up with in situations where

01:26:12 there is interaction with humans and where they have a possibility of influencing human

01:26:17 behavior.

01:26:18 I think we’re not quite there yet, but maybe in the next few years we’ll see some interesting

01:26:22 stuff come out in that area.

01:26:23 I hope outside the research space, because the exciting space where this could be observed

01:26:28 is sort of large companies that deal with large data, and I hope there’s some transparency.

01:26:35 One of the things that’s unclear when I look at social networks and just online is why

01:26:40 an algorithm did something or whether even an algorithm was involved.

01:26:45 And that’d be interesting from a research perspective, just to observe the results of

01:26:52 algorithms, to open up that data, or to at least be sufficiently transparent about the

01:26:58 behavior of these AI systems in the real world.

01:27:02 What’s your sense?

01:27:03 I don’t know if you looked at the blog post, Bitter Lesson, by Rich Sutton, where it looks

01:27:08 at sort of the big lesson of researching AI and reinforcement learning is that simple

01:27:16 methods, general methods that leverage computation seem to work well.

01:27:21 So basically don’t try to do any kind of fancy algorithms, just wait for computation to get

01:27:26 fast.

01:27:28 Do you share this kind of intuition?

01:27:31 I think the high level idea makes a lot of sense.

01:27:34 I’m not sure that my takeaway would be that we don’t need to work on algorithms.

01:27:37 I think that my takeaway would be that we should work on general algorithms.

01:27:43 And actually, I think that this idea of needing to better automate the acquisition of experience

01:27:52 in the real world actually follows pretty naturally from Rich Sutton’s conclusion.

01:27:58 So if the claim is that automated general methods plus data leads to good results, then

01:28:06 it makes sense that we should build general methods and we should build the kind of methods

01:28:09 that we can deploy and get them to go out there and collect their experience autonomously.

01:28:14 I think that one place where I think that the current state of things falls a little

01:28:19 bit short of that is actually the going out there and collecting the data autonomously,

01:28:23 which is easy to do in a simulated board game, but very hard to do in the real world.

01:28:27 Yeah, it keeps coming back to this one problem, right?

01:28:31 Your mind is focused there now in this real world.

01:28:35 It just seems scary, the step of collecting the data, and it seems unclear to me how we

01:28:43 can do it effectively.

01:28:44 Well, you know, seven billion people in the world, each of them had to do that at some

01:28:49 point in their lives.

01:28:51 And we should leverage that experience that they’ve all done.

01:28:54 We should be able to try to collect that kind of data.

01:28:58 Okay, big questions.

01:29:02 Maybe stepping back through your life, what book or books, technical or fiction or philosophical,

01:29:10 had a big impact on the way you saw the world, on the way you thought about in the world,

01:29:15 your life in general?

01:29:19 And maybe what books, if it’s different, would you recommend people consider reading on their

01:29:24 own intellectual journey?

01:29:26 It could be within reinforcement learning, but it could be very much bigger.

01:29:30 I don’t know if this is like a scientifically, like, particularly meaningful answer.

01:29:39 But like, the honest answer is that I actually found a lot of the work by Isaac Asimov to

01:29:45 be very inspiring when I was younger.

01:29:47 I don’t know if that has anything to do with AI necessarily.

01:29:50 You don’t think it had a ripple effect in your life?

01:29:53 Maybe it did.

01:29:56 But yeah, I think that a vision of a future where, well, first of all, artificial, I might

01:30:06 say artificial intelligence system, artificial robotic systems have, you know, kind of a

01:30:10 big place, a big role in society, and where we try to imagine the sort of the limiting

01:30:18 case of technological advancement and how that might play out in our future history.

01:30:25 But yeah, I think that that was in some way influential.

01:30:30 I don’t really know how.

01:30:33 I would recommend it.

01:30:34 I mean, if nothing else, you’d be well entertained.

01:30:37 When did you first yourself like fall in love with the idea of artificial intelligence,

01:30:41 get captivated by this field?

01:30:45 So my honest answer here is actually that I only really started to think about it as

01:30:52 something that I might want to do actually in graduate school pretty late.

01:30:56 And a big part of that was that until, you know, somewhere around 2009, 2010, it just

01:31:02 wasn’t really high on my priority list because I didn’t think that it was something where

01:31:06 we’re going to see very substantial advances in my lifetime.

01:31:11 And you know, maybe in terms of my career, the time when I really decided I wanted to

01:31:18 work on this was when I actually took a seminar course that was taught by Professor Andrew

01:31:23 Ng.

01:31:24 And, you know, at that point, I, of course, had like a decent understanding of the technical

01:31:29 things involved.

01:31:30 But one of the things that really resonated with me was when he said in the opening lecture

01:31:33 something to the effect of like, well, he used to have graduate students come to him

01:31:37 and talk about how they want to work on AI, and he would kind of chuckle and give them

01:31:40 some math problem to deal with.

01:31:42 But now he’s actually thinking that this is an area where we might see like substantial

01:31:45 advances in our lifetime.

01:31:47 And that kind of got me thinking because, you know, in some abstract sense, yeah, like

01:31:52 you can kind of imagine that, but in a very real sense, when someone who had been working

01:31:56 on that kind of stuff their whole career suddenly says that, yeah, like that had some effect

01:32:02 on me.

01:32:03 Yeah, this might be a special moment in the history of the field.

01:32:08 That this is where we might see some interesting breakthroughs.

01:32:14 So in the space of advice, somebody who’s interested in getting started in machine learning

01:32:19 or reinforcement learning, what advice would you give to maybe an undergraduate student

01:32:23 or maybe even younger, how, what are the first steps to take and further on what are the

01:32:30 steps to take on that journey?

01:32:32 So something that I think is important to do is to not be afraid to like spend time

01:32:43 imagining the kind of outcome that you might like to see.

01:32:46 So you know, one outcome might be a successful career, a large paycheck or something, or

01:32:51 state of the art results on some benchmark, but hopefully that’s not the thing that’s

01:32:54 like the main driving force for somebody.

01:32:57 But I think that if someone who is a student considering a career in AI like takes a little

01:33:04 while, sits down and thinks like, what do I really want to see?

01:33:07 What I want to see a machine do?

01:33:09 What do I want to see a robot do?

01:33:10 What do I want to do?

01:33:11 What do I want to see a natural language system, which is like, imagine, you know, imagine

01:33:15 it almost like a commercial for a future product or something or like, like something that

01:33:19 you’d like to see in the world and then actually sit down and think about the steps that are

01:33:23 necessary to get there.

01:33:25 And hopefully that thing is not a better number on image net classification.

01:33:29 It’s like, it’s probably like an actual thing that we can’t do today that would be really

01:33:32 awesome.

01:33:33 Whether it’s a robot Butler or a, you know, a really awesome healthcare decision making

01:33:38 support system, whatever it is that you find inspiring.

01:33:41 And I think that thinking about that and then backtracking from there and imagining the

01:33:45 steps needed to get there will actually lead to much better research.

01:33:48 It’ll lead to rethinking the assumptions.

01:33:50 It’ll lead to working on the bottlenecks that other people aren’t working on.

01:33:55 And then naturally to turn to you, we’ve talked about reward functions and you just give an

01:34:01 advice on looking forward, how you’d like to see, what kind of change you would like

01:34:05 to make in the world.

01:34:06 What do you think, ridiculous, big question, what do you think is the meaning of life?

01:34:11 What is the meaning of your life?

01:34:13 What gives you fulfillment, purpose, happiness and meaning?

01:34:20 That’s a very big question.

01:34:24 What’s the reward function under which you are operating?

01:34:27 Yeah.

01:34:28 I think one thing that does give, you know, if not meaning, at least satisfaction is some

01:34:33 degree of confidence that I’m working on a problem that really matters.

01:34:37 I feel like it’s less important to me to like actually solve a problem, but it’s quite nice

01:34:42 to take things to spend my time on that I believe really matter.

01:34:49 And I try pretty hard to look for that.

01:34:53 I don’t know if it’s easy to answer this, but if you’re successful, what does that look

01:34:59 like?

01:35:00 What’s the big dream?

01:35:01 Now, of course, success is built on top of success and you keep going forever, but what

01:35:09 is the dream?

01:35:10 Yeah.

01:35:11 So one very concrete thing or maybe as concrete as it’s going to get here is to see machines

01:35:18 that actually get better and better the longer they exist in the world.

01:35:23 And that kind of seems like on the surface, one might even think that that’s something

01:35:26 that we have today, but I think we really don’t.

01:35:28 I think that there is an ending complexity in the universe and to date, all of the machines

01:35:38 that we’ve been able to build don’t sort of improve up to the limit of that complexity.

01:35:44 They hit a wall somewhere.

01:35:45 Maybe they hit a wall because they’re in a simulator that has, that is only a very limited,

01:35:50 very pale imitation of the real world, or they hit a wall because they rely on a label

01:35:54 data set, but they never hit the wall of like running out of stuff to see.

01:36:00 So I’d like to build a machine that can go as far as possible.

01:36:04 Runs up against the ceiling of the complexity of the universe.

01:36:08 Yes.

01:36:09 Well, I don’t think there’s a better way to end it, Sergey.

01:36:12 Thank you so much.

01:36:13 It’s a huge honor.

01:36:14 I can’t wait to see the amazing work that you have to publish and in education space

01:36:20 in terms of reinforcement learning.

01:36:21 Thank you for inspiring the world.

01:36:23 Thank you for the great research you do.

01:36:24 Thank you.

01:36:25 Thanks for listening to this conversation with Sergey Levine and thank you to our sponsors,

01:36:31 Cash App and ExpressVPN.

01:36:33 Please consider supporting this podcast by downloading Cash App and using code LexPodcast

01:36:40 and signing up at expressvpn.com slash LexPod.

01:36:44 Click all the links, buy all the stuff, it’s the best way to support this podcast and the

01:36:50 journey I’m on.

01:36:51 If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple Podcast,

01:36:57 support it on Patreon, or connect with me on Twitter at Lex Friedman, spelled somehow

01:37:02 if you can figure out how without using the letter E, just F R I D M A N.

01:37:08 And now let me leave you with some words from Salvador Dali.

01:37:14 Intelligence without ambition is a bird without wings.

01:37:18 Thank you for listening and hope to see you next time.