Transcript
00:00:00 The following is a conversation with Sergei Levine, a professor at Berkeley and a world
00:00:05 class researcher in deep learning, reinforcement learning, robotics, and computer vision, including
00:00:10 the development of algorithms for end to end training of neural network policies that combine
00:00:15 perception and control, scalable algorithms for inverse reinforcement learning, and, in
00:00:21 general, deep RL algorithms.
00:00:24 Quick summary of the ads.
00:00:25 Two sponsors, Cash App and ExpressVPN.
00:00:28 Please consider supporting the podcast by downloading Cash App and using code LexPodcast
00:00:34 and signing up at expressvpn.com slash lexpod.
00:00:38 Click the links, buy the stuff, it’s the best way to support this podcast and, in general,
00:00:44 the journey I’m on.
00:00:45 If you enjoy this thing, subscribe on YouTube, review it with 5 stars on Apple Podcast, follow
00:00:51 on Spotify, support it on Patreon, or connect with me on Twitter at lexfriedman.
00:00:57 As usual, I’ll do a few minutes of ads now and never any ads in the middle that can break
00:01:01 the flow of the conversation.
00:01:04 This show is presented by Cash App, the number one finance app in the App Store.
00:01:08 When you get it, use code lexpodcast.
00:01:11 Cash App lets you send money to friends, buy Bitcoin, and invest in the stock market with
00:01:15 as little as one dollar.
00:01:18 Since Cash App does fractional share trading, let me mention that the order execution algorithm
00:01:23 that works behind the scenes to create the abstraction of fractional orders is an algorithmic
00:01:29 marvel.
00:01:30 So, big props to the Cash App engineers for taking a step up to the next layer of abstraction
00:01:34 over the stock market, making trading more accessible for new investors and diversification
00:01:40 much easier.
00:01:41 So, again, if you get Cash App from the App Store or Google Play and use the code lexpodcast,
00:01:48 you get $10, and Cash App will also donate $10 to FIRST, an organization that is helping
00:01:54 to advance robotics and STEM education for young people around the world.
00:01:59 This show is also sponsored by ExpressVPN.
00:02:04 Get it at expressvpn.com slash lexpod to support this podcast and to get an extra three months
00:02:11 free on a one year package.
00:02:14 I’ve been using ExpressVPN for many years.
00:02:17 I love it.
00:02:18 I think ExpressVPN is the best VPN out there.
00:02:22 They told me to say it, but it happens to be true in my humble opinion.
00:02:26 It doesn’t log your data, it’s crazy fast, and it’s easy to use literally just one big
00:02:31 power on button.
00:02:32 Again, it’s probably obvious to you, but I should say it again, it’s really important
00:02:37 that they don’t log your data.
00:02:40 It works on Linux and every other operating system, but Linux, of course, is the best
00:02:45 operating system.
00:02:46 Shout out to my favorite flavor, Ubuntu Mate 2004.
00:02:50 Once again, get it at expressvpn.com slash lexpod to support this podcast and to get
00:02:56 an extra three months free on a one year package.
00:03:00 And now, here’s my conversation with Sergey Levine.
00:03:05 What’s the difference between a state of the art human, such as you and I, well, I don’t
00:03:10 know if we qualify as state of the art humans, but a state of the art human and a state of
00:03:14 the art robot?
00:03:16 That’s a very interesting question.
00:03:19 Robot capability is, it’s kind of a, I think it’s a very tricky thing to understand because
00:03:26 there are some things that are difficult that we wouldn’t think are difficult and some things
00:03:29 that are easy that we wouldn’t think are easy.
00:03:33 And there’s also a really big gap between capabilities of robots in terms of hardware
00:03:37 and their physical capability and capabilities of robots in terms of what they can do autonomously.
00:03:43 There is a little video that I think robotics researchers really like to show, especially
00:03:47 robotics learning researchers like myself, from 2004 from Stanford, which demonstrates
00:03:53 a prototype robot called the PR1, and the PR1 was a robot that was designed as a home
00:03:58 assistance robot.
00:03:59 And there’s this beautiful video showing the PR1 tidying up a living room, putting away
00:04:03 toys and at the end bringing a beer to the person sitting on the couch, which looks really
00:04:10 amazing.
00:04:11 And then the punchline is that this robot is entirely controlled by a person.
00:04:16 So in some ways the gap between a state of the art human and state of the art robot,
00:04:20 if the robot has a human brain, is actually not that large.
00:04:23 Now obviously like human bodies are sophisticated and very robust and resilient in many ways,
00:04:28 but on the whole, if we’re willing to like spend a bit of money and do a bit of engineering,
00:04:32 we can kind of close the hardware gap almost.
00:04:35 But the intelligence gap, that one is very wide.
00:04:40 And when you say hardware, you’re referring to the physical, sort of the actuators, the
00:04:43 actual body of the robot, as opposed to the hardware on which the cognition, the hardware
00:04:49 of the nervous system.
00:04:50 Yes, exactly.
00:04:51 I’m referring to the body rather than the mind.
00:04:54 So that means that the kind of the work is cut out for us.
00:04:56 Like while we can still make the body better, we kind of know that the big bottleneck right
00:05:00 now is really the mind.
00:05:02 And how big is that gap?
00:05:03 How big is the difference in your sense of ability to learn, ability to reason, ability
00:05:11 to perceive the world between humans and our best robots?
00:05:16 The gap is very large and the gap becomes larger the more unexpected events can happen
00:05:23 in the world.
00:05:24 So essentially the spectrum along which you can measure the size of that gap is the spectrum
00:05:30 of how open the world is.
00:05:32 If you control everything in the world very tightly, if you put the robot in like a factory
00:05:36 and you tell it where everything is and you rigidly program its motion, then it can do
00:05:41 things, you know, one might even say in a superhuman way.
00:05:43 It can move faster, it’s stronger, it can lift up a car and things like that.
00:05:47 But as soon as anything starts to vary in the environment, now it’ll trip up.
00:05:51 And if many, many things vary like they would like in your kitchen, for example, then things
00:05:55 are pretty much like wide open.
00:05:57 Now, again, we’re going to stick a bit on the philosophical questions, but how much
00:06:03 on the human side of the cognitive abilities in your sense is nature versus nurture?
00:06:11 So how much of it is a product of evolution and how much of it is something we’ll learn
00:06:18 from sort of scratch from the day we’re born?
00:06:22 I’m going to read into your question as asking about the implications of this for AI.
00:06:26 Because I’m not a biologist, I can’t really like speak authoritatively.
00:06:30 So until we go on it, if it’s so, if it’s all about learning, then there’s more hope
00:06:36 for AI.
00:06:38 So the way that I look at this is that, you know, well, first, of course, biology is very
00:06:44 messy.
00:06:45 And it’s if you ask the question, how does a person do something or has a person’s mind
00:06:49 do something, you can come up with a bunch of hypotheses and oftentimes you can find
00:06:54 support for many different, often conflicting hypotheses.
00:06:58 One way that we can approach the question of what the implications of this for AI are
00:07:03 is we can think about what’s sufficient.
00:07:05 So you know, maybe a person is from birth very, very good at some things like, for example,
00:07:11 recognizing faces.
00:07:12 There’s a very strong evolutionary pressure to do that.
00:07:13 If you can recognize your mother’s face, then you’re more likely to survive and therefore
00:07:18 people are good at this.
00:07:20 But we can also ask like, what’s the minimum sufficient thing?
00:07:23 And one of the ways that we can study the minimal sufficient thing is we could, for
00:07:27 example, see what people do in unusual situations.
00:07:29 If you present them with things that evolution couldn’t have prepared them for, you know,
00:07:33 our daily lives actually do this to us all the time.
00:07:36 We didn’t evolve to deal with, you know, automobiles and space flight and whatever.
00:07:41 So there are all these situations that we can find ourselves in and we do very well
00:07:45 there.
00:07:46 Like I can give you a joystick to control a robotic arm, which you’ve never used before
00:07:50 and you might be pretty bad for the first couple of seconds.
00:07:52 But if I tell you like your life depends on using this robotic arm to like open this door,
00:07:58 you’ll probably manage it.
00:07:59 Even though you’ve never seen this device before, you’ve never used the joystick control
00:08:03 us and you’ll kind of muddle through it.
00:08:04 And that’s not your evolved natural ability.
00:08:08 That’s your, your flexibility or your adaptability.
00:08:11 And that’s exactly where our current robotic systems really kind of fall flat.
00:08:14 But I wonder how much general, almost what we think of as common sense, pre trained models
00:08:22 underneath all of that.
00:08:24 So that ability to adapt to a joystick is, requires you to have a kind of, you know,
00:08:32 I’m human.
00:08:33 So it’s hard for me to introspect all the knowledge I have about the world, but it seems
00:08:37 like there might be an iceberg underneath of the amount of knowledge we actually bring
00:08:42 to the table.
00:08:43 That’s kind of the open question.
00:08:45 There’s absolutely an iceberg of knowledge that we bring to the table, but I think it’s
00:08:48 very likely that iceberg of knowledge is actually built up over our lifetimes.
00:08:54 Because we have, you know, we have a lot of prior experience to draw on.
00:08:58 And it kind of makes sense that the right way for us to, you know, to optimize our,
00:09:05 our efficiency, our evolutionary fitness and so on is to utilize all of that experience
00:09:10 to build up the best iceberg we can get.
00:09:13 And that’s actually one of the, you know, while that sounds an awful lot like what machine
00:09:16 learning actually does, I think that for modern machine learning, it’s actually a really big
00:09:20 challenge to take this unstructured mass of experience and distill out something that
00:09:25 looks like a common sense understanding of the world.
00:09:28 And perhaps part of that isn’t, it’s not because something about machine learning itself is,
00:09:32 is broken or hard, but because we’ve been a little too rigid in subscribing to a very
00:09:38 supervised, very rigid notion of learning, you know, kind of the input output, X’s go
00:09:42 to Y’s sort of model.
00:09:43 And maybe what we really need to do is to view the world more as like a mass of experience
00:09:51 that is not necessarily providing any rigid supervision, but sort of providing many, many
00:09:55 instances of things that could be.
00:09:56 And then you take that and you distill it into some sort of common sense understanding.
00:10:00 I see what you’re, you’re painting an optimistic, beautiful picture, especially from the robotics
00:10:06 perspective because that means we just need to invest and build better learning algorithms,
00:10:12 figure out how we can get access to more and more data for those learning algorithms to
00:10:17 extract signal from, and then accumulate that iceberg of knowledge.
00:10:22 It’s a beautiful picture.
00:10:23 It’s a hopeful one.
00:10:25 I think it’s potentially a little bit more than just that.
00:10:29 And this is, this is where we perhaps reach the limits of our current understanding.
00:10:32 But one thing that I think that the research community hasn’t really resolved in a satisfactory
00:10:37 way is how much it matters where that experience comes from, like, you know, do you just like
00:10:43 download everything on the internet and cram it into essentially the 21st century analog
00:10:48 of the giant language model and then see what happens or does it actually matter whether
00:10:54 your machine physically experiences the world or in the sense that it actually attempts
00:10:59 things, observes the outcome of its actions and kind of augments its experience that way.
00:11:03 And it chooses which parts of the world it gets to interact with and observe and learn
00:11:09 from.
00:11:10 Right.
00:11:11 It may be that the world is so complex that simply obtaining a large mass of sort of
00:11:16 IID samples of the world is a very difficult way to go.
00:11:21 But if you are actually interacting with the world and essentially performing this sort
00:11:25 of hard negative mining by attempting what you think might work, observing the sometimes
00:11:30 happy and sometimes sad outcomes of that and augmenting your understanding using that experience
00:11:35 and you’re just doing this continually for many years, maybe that sort of data in some
00:11:40 sense is actually much more favorable to obtaining a common sense understanding.
00:11:44 One reason we might think that this is true is that, you know, what we associate with
00:11:49 common sense or lack of common sense is often characterized by the ability to reason about
00:11:55 kind of counterfactual questions like, you know, if I were to hear this bottle of water
00:12:01 sitting on the table, everything is fine if I were to knock it over, which I’m not going
00:12:04 to do.
00:12:05 But if I were to do that, what would happen?
00:12:07 And I know that nothing good would happen from that.
00:12:10 But if I have a bad understanding of the world, I might think that that’s a good way for me
00:12:14 to like, you know, gain more utility.
00:12:16 If I actually go about my daily life doing the things that my current understanding of
00:12:22 the world suggests will give me high utility, in some ways, I’ll get exactly the right supervision
00:12:28 to tell me not to do those bad things and to keep doing the good things.
00:12:33 So there’s a spectrum between IID, random walk through the space of data, and then there’s
00:12:39 and what we humans do, I don’t even know if we do it optimal, but that might be beyond.
00:12:45 So this open question that you raised, where do you think systems, intelligent systems
00:12:52 that would be able to deal with this world fall?
00:12:56 Can we do pretty well by reading all of Wikipedia, sort of randomly sampling it like language
00:13:02 models do?
00:13:03 Or do we have to be exceptionally selective and intelligent about which aspects of the
00:13:09 world we interact with?
00:13:12 So I think this is first an open scientific problem, and I don’t have like a clear answer,
00:13:15 but I can speculate a little bit.
00:13:18 And what I would speculate is that you don’t need to be super, super careful.
00:13:23 I think it’s less about like, being careful to avoid the useless stuff, and more about
00:13:28 making sure that you hit on the really important stuff.
00:13:31 So perhaps it’s okay, if you spend part of your day, just, you know, guided by your curiosity,
00:13:37 reading interesting regions of your state space, but it’s important for you to, you
00:13:42 know, every once in a while, make sure that you really try out the solutions that your
00:13:47 current model of the world suggests might be effective, and observe whether those solutions
00:13:51 are working as you expect or not.
00:13:53 And perhaps some of that is really essential to have kind of a perpetual improvement loop.
00:13:59 This perpetual improvement loop is really like, that’s really the key, the key that’s
00:14:03 going to potentially distinguish the best current methods from the best methods of tomorrow
00:14:07 in a sense.
00:14:08 How important do you think is exploration or total out of the box thinking exploration
00:14:15 in this space as you jump to totally different domains?
00:14:19 So you kind of mentioned there’s an optimization problem, you kind of kind of explore the specifics
00:14:24 of a particular strategy, whatever the thing you’re trying to solve.
00:14:27 How important is it to explore totally outside of the strategies that have been working for
00:14:33 you so far?
00:14:34 What’s your intuition there?
00:14:35 Yeah, I think it’s a very problem dependent kind of question.
00:14:38 And I think that that’s actually, you know, in some ways that question gets at one of
00:14:45 the big differences between sort of the classic formulation of a reinforcement learning problem
00:14:51 and some of the sort of more open ended reformulations of that problem that have been explored in
00:14:57 recent years.
00:14:58 So classically reinforcement learning is framed as a problem of maximizing utility, like any
00:15:02 kind of rational AI agent, and then anything you do is in service to maximizing that utility.
00:15:08 But a very interesting kind of way to look at, I’m not necessarily saying this is the
00:15:15 best way to look at it, but an interesting alternative way to look at these problems
00:15:17 is as something where you first get to explore the world, however you please, and then afterwards
00:15:24 you will be tasked with doing something.
00:15:26 And that might suggest a somewhat different solution.
00:15:28 So if you don’t know what you’re going to be tasked with doing, and you just want to
00:15:31 prepare yourself optimally for whatever your uncertain future holds, maybe then you will
00:15:35 choose to attain some sort of coverage, build up sort of an arsenal of cognitive tools,
00:15:41 if you will, such that later on when someone tells you, now your job is to fetch the coffee
00:15:46 for me, you will be well prepared to undertake that task.
00:15:49 And that you see that as the modern formulation of the reinforcement learning problem, as
00:15:54 a kind of the more multitask, the general intelligence kind of formulation.
00:16:00 I think that’s one possible vision of where things might be headed.
00:16:04 I don’t think that’s by any means the mainstream or standard way of doing things, and it’s
00:16:08 not like if I had to…
00:16:09 But I like it.
00:16:10 It’s a beautiful vision.
00:16:11 So maybe you actually take a step back.
00:16:14 What is the goal of robotics?
00:16:16 What’s the general problem of robotics we’re trying to solve?
00:16:18 You actually kind of painted two pictures here.
00:16:21 One of sort of the narrow, one of the general.
00:16:23 What in your view is the big problem of robotics?
00:16:26 And ridiculously philosophical high level questions.
00:16:31 I think that maybe there are two ways I can answer this question.
00:16:34 One is there’s a very pragmatic problem, which is like what would make robots, what would
00:16:41 sort of maximize the usefulness of robots?
00:16:44 And there the answer might be something like a system where a system that can perform whatever
00:16:53 task a human user sets for it, within the physical constraints, of course.
00:16:59 If you tell it to teleport to another planet, it probably can’t do that.
00:17:02 But if you ask it to do something that’s within its physical capability, then potentially
00:17:06 with a little bit of additional training or a little bit of additional trial and error,
00:17:10 it ought to be able to figure it out in much the same way as like a human teleoperator
00:17:14 ought to figure out how to drive the robot to do that.
00:17:16 That’s kind of the very pragmatic view of what it would take to kind of solve the robotics
00:17:22 problem, if you will.
00:17:24 But I think that there is a second answer, and that answer is a lot closer to why I want
00:17:29 to work on robotics, which is that I think it’s less about what it would take to do a
00:17:34 really good job in the world of robotics, but more the other way around, what robotics
00:17:39 can bring to the table to help us understand artificial intelligence.
00:17:44 So your dream fundamentally is to understand intelligence?
00:17:48 Yes.
00:17:49 And I think that’s the dream for many people who actually work in this space.
00:17:53 I think that there’s something very pragmatic and very useful about studying robotics, but
00:17:58 I do think that a lot of people that go into this field actually, you know, the things
00:18:02 that they draw inspiration from are the potential for robots to like help us learn about intelligence
00:18:09 and about ourselves.
00:18:10 So that’s fascinating that robotics is basically the space by which you can get closer to understanding
00:18:18 the fundamentals of artificial intelligence.
00:18:20 So what is it about robotics that’s different from some of the other approaches?
00:18:25 So if we look at some of the early breakthroughs in deep learning or in the computer vision
00:18:30 space and the natural language processing, there’s really nice clean benchmarks that
00:18:34 a lot of people competed on and thereby came up with a lot of brilliant ideas.
00:18:38 What’s the fundamental difference to you between computer vision purely defined and ImageNet
00:18:43 and kind of the bigger robotics problem?
00:18:46 So there are a couple of things.
00:18:48 One is that with robotics, you kind of have to take away many of the crutches.
00:18:55 So you have to deal with both the particular problems of perception control and so on,
00:19:01 but you also have to deal with the integration of those things.
00:19:04 And you know, classically, we’ve always thought of the integration as kind of a separate problem.
00:19:08 So a classic kind of modular engineering approach is that we solve the individual subproblems
00:19:12 and then wire them together and then the whole thing works.
00:19:16 And one of the things that we’ve been seeing over the last couple of decades is that, well,
00:19:19 maybe studying the thing as a whole might lead to just like very different solutions
00:19:24 than if we were to study the parts and wire them together.
00:19:26 So the integrative nature of robotics research helps us see, you know, the different perspectives
00:19:32 on the problem.
00:19:34 Another part of the answer is that with robotics, it casts a certain paradox into very clever
00:19:40 relief.
00:19:41 This is sometimes referred to as Moravec’s paradox, the idea that in artificial intelligence,
00:19:48 things that are very hard for people can be very easy for machines and vice versa.
00:19:52 Things that are very easy for people can be very hard for machines.
00:19:54 So you know, integral and differential calculus is pretty difficult to learn for people.
00:20:02 But if you program a computer, do it, it can derive derivatives and integrals for you all
00:20:06 day long without any trouble.
00:20:08 Whereas some things like, you know, drinking from a cup of water, very easy for a person
00:20:13 to do, very hard for a robot to deal with.
00:20:16 And sometimes when we see such blatant discrepancies, that gives us a really strong hint that we’re
00:20:21 missing something important.
00:20:23 So if we really try to zero in on those discrepancies, we might find that little bit that we’re missing.
00:20:28 And it’s not that we need to make machines better or worse at math and better at drinking
00:20:32 water, but just that by studying those discrepancies, we might find some new insight.
00:20:37 So that could be in any space, it doesn’t have to be robotics.
00:20:41 But you’re saying, I mean, it’s kind of interesting that robotics seems to have a lot of those
00:20:48 discrepancies.
00:20:49 So the Hans Marvak paradox is probably referring to the space of the physical interaction,
00:20:56 like you said, object manipulation, walking, all the kind of stuff we do in the physical
00:21:00 world.
00:21:01 How do you make sense if you were to try to disentangle the Marvak paradox, like why is
00:21:13 there such a gap in our intuition about it?
00:21:17 Why do you think manipulating objects is so hard from everything you’ve learned from applying
00:21:23 reinforcement learning in this space?
00:21:25 Yeah, I think that one reason is maybe that for many of the other problems that we’ve
00:21:33 studied in AI and computer science and so on, the notion of input output and supervision
00:21:41 is much, much cleaner.
00:21:42 So computer vision, for example, deals with very complex inputs.
00:21:45 But it’s comparatively a bit easier, at least up to some level of abstraction, to cast it
00:21:52 as a very tightly supervised problem.
00:21:54 It’s comparatively much, much harder to cast robotic manipulation as a very tightly supervised
00:21:59 problem.
00:22:00 You can do it, it just doesn’t seem to work all that well.
00:22:03 So you could say that, well, maybe we get a labeled data set where we know exactly which
00:22:06 motor commands to send, and then we train on that.
00:22:09 But for various reasons, that’s not actually such a great solution.
00:22:13 And it also doesn’t seem to be even remotely similar to how people and animals learn to
00:22:17 do things, because we’re not told by our parents, here’s how you fire your muscles in order
00:22:22 to walk.
00:22:24 So we do get some guidance, but the really low level detailed stuff we figure out mostly
00:22:29 on our own.
00:22:30 And that’s what you mean by tightly coupled, that every single little sub action gets a
00:22:34 supervised signal of whether it’s a good one or not.
00:22:37 Right.
00:22:38 So while in computer vision, you could sort of imagine up to a level of abstraction that
00:22:41 maybe somebody told you this is a car and this is a cat and this is a dog, in motor
00:22:45 control, it’s very clear that that was not the case.
00:22:49 If we look at sort of the sub spaces of robotics, that, again, as you said, robotics integrates
00:22:57 all of them together, and we get to see how this beautiful mess interplays.
00:23:00 But so there’s nevertheless still perception.
00:23:04 So it’s the computer vision problem, broadly speaking, understanding the environment.
00:23:09 And there’s also maybe you can correct me on this kind of categorization of the space,
00:23:14 and there’s prediction in trying to anticipate what things are going to do into the future
00:23:20 in order for you to be able to act in that world.
00:23:24 And then there’s also this game theoretic aspect of how your actions will change the
00:23:31 behavior of others.
00:23:34 In this kind of space, what, and this is bigger than reinforcement learning, this is just
00:23:38 broadly looking at the problem of robotics, what’s the hardest problem here?
00:23:42 Or is there, or is what you said true that when you start to look at all of them together,
00:23:52 that’s a whole nother thing, like you can’t even say which one individually is harder
00:23:57 because all of them together, you should only be looking at them all together.
00:24:01 I think when you look at them all together, some things actually become easier.
00:24:05 And I think that’s actually pretty important.
00:24:07 So we had back in 2014, we had some work, basically our first work on end to end reinforcement
00:24:16 learning for robotic manipulation skills from vision, which at the time was something that
00:24:21 seemed a little inflammatory and controversial in the robotics world.
00:24:25 But other than the inflammatory and controversial part of it, the point that we were actually
00:24:30 trying to make in that work is that for the particular case of combining perception and
00:24:35 control, you could actually do better if you treat them together than if you try to separate
00:24:39 them.
00:24:40 And the way that we tried to demonstrate this is we picked a fairly simple motor control
00:24:43 task where a robot had to insert a little red trapezoid into a trapezoidal hole.
00:24:49 And we had our separated solution, which involved first detecting the hole using a pose detector
00:24:54 and then actuating the arm to put it in.
00:24:57 And then our intent solution, which just mapped pixels to the torques.
00:25:01 And one of the things we observed is that if you use the intent solution, essentially
00:25:05 the pressure on the perception part of the model is actually lower.
00:25:08 Like it doesn’t have to figure out exactly where the thing is in 3D space.
00:25:11 It just needs to figure out where it is, you know, distributing the errors in such a way
00:25:15 that the horizontal difference matters more than the vertical difference because vertically
00:25:19 it just pushes it down all the way until it can’t go any further.
00:25:22 And their perceptual errors are a lot less harmful, whereas perpendicular to the direction
00:25:26 of motion, perceptual errors are much more harmful.
00:25:29 So the point is that if you combine these two things, you can trade off errors between
00:25:33 the components optimally to best accomplish the task.
00:25:38 And the components can actually be weaker while still leading to better overall performance.
00:25:41 It’s a profound idea.
00:25:44 I mean, in the space of pegs and things like that, it’s quite simple.
00:25:48 It almost is tempting to overlook, but that seems to be at least intuitively an idea that
00:25:55 should generalize to basically all aspects of perception and control, that one strengthens
00:26:01 the other.
00:26:02 Yeah.
00:26:03 And we, you know, people who have studied sort of perceptual heuristics in humans and
00:26:07 animals find things like that all the time.
00:26:08 So one very well known example of this is something called the gaze heuristic, which
00:26:12 is a little trick that you can use to intercept a flying object.
00:26:17 So if you want to catch a ball, for instance, you could try to localize it in 3D space,
00:26:21 estimate its velocity, estimate the effect of wind resistance, solve a complex system
00:26:25 of differential equations in your head.
00:26:27 Or you can maintain a running speed so that the object stays in the same position as in
00:26:33 your field of view.
00:26:34 So if it dips a little bit, you speed up.
00:26:35 If it rises a little bit, you slow down.
00:26:38 And if you follow the simple rule, you’ll actually arrive at exactly the place where
00:26:40 the object lands and you’ll catch it.
00:26:43 And humans use it when they play baseball, human pilots use it when they fly airplanes
00:26:46 to figure out if they’re about to collide with somebody, frogs use this to catch insects
00:26:50 and so on and so on.
00:26:51 So this is something that actually happens in nature.
00:26:53 And I’m sure this is just one instance of it that we were able to identify just because
00:26:57 all the scientists were able to identify because it’s so prevalent, but there are probably
00:27:00 many others.
00:27:01 Do you have a, just so we can zoom in as we talk about robotics, do you have a canonical
00:27:06 problem, sort of a simple, clean, beautiful representative problem in robotics that you
00:27:12 think about when you’re thinking about some of these problems?
00:27:16 We talked about robotic manipulation, to me that seems intuitively, at least the robotics
00:27:23 community has converged towards that as a space that’s the canonical problem.
00:27:28 If you agree, then maybe do you zoom in in some particular aspect of that problem that
00:27:33 you just like?
00:27:34 Like if we solve that problem perfectly, it’ll unlock a major step towards human level intelligence.
00:27:44 I don’t think I have like a really great answer to that.
00:27:46 And I think partly the reason I don’t have a great answer kind of has to do with the,
00:27:53 it has to do with the fact that the difficulty is really in the flexibility and adaptability
00:27:57 rather than in doing a particular thing really, really well.
00:28:01 So it’s hard to just say like, oh, if you can, I don’t know, like shuffle a deck of
00:28:06 cards as fast as like a Vegas casino dealer, then you’ll be very proficient.
00:28:12 It’s really the ability to quickly figure out how to do some arbitrary new thing well
00:28:21 enough to like, you know, to move on to the next arbitrary thing.
00:28:26 But the source of newness and uncertainty, have you found problems in which it’s easy
00:28:33 to generate new newnessnesses?
00:28:38 New types of newness.
00:28:40 Yeah.
00:28:41 So a few years ago, so if you had asked me this question around like 2016, maybe I would
00:28:46 have probably said that robotic grasping is a really great example of that because it’s
00:28:51 a task with great real world utility.
00:28:54 Like you will get a lot of money if you can do it well.
00:28:57 What is robotic grasping?
00:28:58 Picking up any object with a robotic hand.
00:29:02 Exactly.
00:29:03 So you will get a lot of money if you do it well, because lots of people want to run warehouses
00:29:06 with robots and it’s highly non trivial because very different objects will require very different
00:29:13 grasping strategies.
00:29:15 But actually since then, people have gotten really good at building systems to solve this
00:29:19 problem to the point where I’m not actually sure how much more progress we can make with
00:29:25 that as like the main guiding thing.
00:29:29 But it’s kind of interesting to see the kind of methods that have actually worked well
00:29:32 in that space because robotic grasping classically used to be regarded very much as kind of almost
00:29:39 like a geometry problem.
00:29:41 So people who have studied the history of computer vision will find this very familiar
00:29:46 that it’s kind of in the same way that in the early days of computer vision, people
00:29:49 thought of it very much as like an inverse graphics thing.
00:29:52 In robotic grasping, people thought of it as an inverse physics problem essentially.
00:29:57 You look at what’s in front of you, figure out the shapes, then use your best estimate
00:30:01 of the laws of physics to figure out where to put your fingers on, you pick up the thing.
00:30:05 And it turns out that works really well for robotic grasping instantiated in many different
00:30:10 recent works, including our own, but also ones from many other labs is to use learning
00:30:15 methods with some combination of either exhaustive simulation or like actual real world trial
00:30:21 and error.
00:30:22 And it turns out that those things actually work really well and then you don’t have to
00:30:24 worry about solving geometry problems or physics problems.
00:30:29 What are, just by the way, in the grasping, what are the difficulties that have been worked
00:30:35 on?
00:30:36 So one is like the materials of things, maybe occlusions on the perception side.
00:30:41 Why is it such a difficult, why is picking stuff up such a difficult problem?
00:30:45 Yeah, it’s a difficult problem because the number of things that you might have to deal
00:30:50 with or the variety of things that you have to deal with is extremely large.
00:30:54 And oftentimes things that work for one class of objects won’t work for other classes of
00:30:59 objects.
00:31:00 So if you, if you get really good at picking up boxes and now you have to pick up plastic
00:31:05 bags, you know, you just need to employ a very different strategy.
00:31:09 And there are many properties of objects that are more than just their geometry that has
00:31:15 to do with, you know, the bits that are easier to pick up, the bits that are hard to pick
00:31:19 up, the bits that are more flexible, the bits that will cause the thing to pivot and bend
00:31:23 and drop out of your hand versus the bits that result in a nice secure grasp.
00:31:28 Things that are flexible, things that if you pick them up the wrong way, they’ll fall upside
00:31:31 down and the contents will spill out.
00:31:33 So there’s all these little details that come up, but the task is still kind of can be characterized
00:31:38 as one task.
00:31:39 Like there’s a very clear notion of you did it or you didn’t do it.
00:31:43 So in terms of spilling things, there creeps in this notion that starts to sound and feel
00:31:50 like common sense reasoning.
00:31:53 Do you think solving the general problem of robotics requires common sense reasoning,
00:32:01 requires general intelligence, this kind of human level capability of, you know, like
00:32:09 you said, be robust and deal with uncertainty, but also be able to sort of reason and assimilate
00:32:14 different pieces of knowledge that you have?
00:32:17 Yeah.
00:32:18 What are your thoughts on the needs?
00:32:23 Of common sense reasoning in the space of the general robotics problem?
00:32:28 So I’m going to slightly dodge that question and say that I think maybe actually it’s the
00:32:32 other way around is that studying robotics can help us understand how to put common sense
00:32:38 into our AI systems.
00:32:40 One way to think about common sense is that, and why our current systems might lack common
00:32:45 sense is that common sense is an emergent property of actually having to interact with
00:32:51 a particular world, a particular universe, and get things done in that universe.
00:32:56 So you might think that, for instance, like an image captioning system, maybe it looks
00:33:01 at pictures of the world and it types out English sentences.
00:33:05 So it kind of deals with our world.
00:33:09 And then you can easily construct situations where image captioning systems do things that
00:33:12 defy common sense, like give it a picture of a person wearing a fur coat and we’ll say
00:33:16 it’s a teddy bear.
00:33:18 But I think what’s really happening in those settings is that the system doesn’t actually
00:33:22 live in our world.
00:33:24 It lives in its own world that consists of pixels and English sentences and doesn’t actually
00:33:28 consist of having to put on a fur coat in the winter so you don’t get cold.
00:33:33 So perhaps the reason for the disconnect is that the systems that we have now simply inhabit
00:33:39 a different universe.
00:33:40 And if we build AI systems that are forced to deal with all of the messiness and complexity
00:33:45 of our universe, maybe they will have to acquire common sense to essentially maximize their
00:33:50 utility.
00:33:51 Whereas the systems we’re building now don’t have to do that.
00:33:53 They can take some shortcuts.
00:33:56 That’s fascinating.
00:33:57 You’ve a couple of times already sort of reframed the role of robotics in this whole thing.
00:34:02 And for some reason, I don’t know if my way of thinking is common, but I thought like
00:34:08 we need to understand and solve intelligence in order to solve robotics.
00:34:13 And you’re kind of framing it as, no, robotics is one of the best ways to just study artificial
00:34:18 intelligence and build sort of like, robotics is like the right space in which you get to
00:34:24 explore some of the fundamental learning mechanisms, fundamental sort of multimodal multitask aggregation
00:34:33 of knowledge mechanisms that are required for general intelligence.
00:34:36 It’s really interesting way to think about it, but let me ask about learning.
00:34:41 Can the general sort of robotics, the epitome of the robotics problem be solved purely through
00:34:47 learning, perhaps end to end learning, sort of learning from scratch as opposed to injecting
00:34:55 human expertise and rules and heuristics and so on?
00:35:00 I think that in terms of the spirit of the question, I would say yes.
00:35:04 I mean, I think that though in some ways it’s maybe like an overly sharp dichotomy, I think
00:35:12 that in some ways when we build algorithms, at some point a person does something, a person
00:35:20 turned on the computer, a person implemented a TensorFlow.
00:35:26 But yeah, I think that in terms of the point that you’re getting at, I do think the answer
00:35:29 is yes.
00:35:30 I think that we can solve many problems that have previously required meticulous manual
00:35:36 engineering through automated optimization techniques.
00:35:40 And actually one thing I will say on this topic is I don’t think this is actually a
00:35:43 very radical or very new idea.
00:35:45 I think people have been thinking about automated optimization techniques as a way to do control
00:35:51 for a very, very long time.
00:35:53 And in some ways what’s changed is really more the name.
00:35:58 So today we would say that, oh, my robot does machine learning, it does reinforcement learning.
00:36:03 Maybe in the 1960s you’d say, oh, my robot is doing optimal control.
00:36:08 And maybe the difference between typing out a system of differential equations and doing
00:36:12 feedback linearization versus training a neural net, maybe it’s not such a large difference.
00:36:17 It’s just pushing the optimization deeper and deeper into the thing.
00:36:21 Well, it’s interesting you think that way, but especially with deep learning that the
00:36:28 accumulation of sort of experiences in data form to form deep representations starts to
00:36:35 feel like knowledge as opposed to optimal control.
00:36:38 So this feels like there’s an accumulation of knowledge through the learning process.
00:36:42 Yes.
00:36:43 Yeah.
00:36:44 So I think that is a good point.
00:36:45 That one big difference between learning based systems and classic optimal control systems
00:36:49 is that learning based systems in principle should get better and better the more they
00:36:53 do something.
00:36:54 Right.
00:36:55 And I do think that that’s actually a very, very powerful difference.
00:36:58 So if we look back at the world of expert systems and symbolic AI and so on of using
00:37:04 logic to accumulate expertise, human expertise, human encoded expertise, do you think that
00:37:11 will have a role at some point?
00:37:13 The deep learning, machine learning, reinforcement learning has shown incredible results and
00:37:20 breakthroughs and just inspired thousands, maybe millions of researchers.
00:37:26 But there’s this less popular now, but it used to be popular idea of symbolic AI.
00:37:32 Do you think that will have a role?
00:37:35 I think in some ways the descendants of symbolic AI actually already have a role.
00:37:44 So this is the highly biased history from my perspective.
00:37:49 You say that, well, initially we thought that rational decision making involves logical
00:37:53 manipulation.
00:37:54 So you have some model of the world expressed in terms of logic.
00:37:59 You have some query, like what action do I take in order for X to be true?
00:38:04 And then you manipulate your logical symbolic representation to get an answer.
00:38:08 What that turned into somewhere in the 1990s is, well, instead of building kind of predicates
00:38:14 and statements that have true or false values, we’ll build probabilistic systems where things
00:38:20 have probabilities associated and probabilities of being true and false.
00:38:23 And that turned into Bayes nets.
00:38:25 And that provided sort of a boost to what were really still essentially logical inference
00:38:30 systems, just probabilistic logical inference systems.
00:38:33 And then people said, well, let’s actually learn the individual probabilities inside
00:38:37 these models.
00:38:39 And then people said, well, let’s not even specify the nodes in the models, let’s just
00:38:43 put a big neural net in there.
00:38:45 But in many ways, I see these as actually kind of descendants from the same idea.
00:38:48 It’s essentially instantiating rational decision making by means of some inference process
00:38:54 and learning by means of an optimization process.
00:38:57 So in a sense, I would say, yes, that it has a place.
00:39:00 And in many ways that place is, it already holds that place.
00:39:04 It’s already in there.
00:39:05 Yeah.
00:39:06 It’s just quite different.
00:39:07 It looks slightly different than it was before.
00:39:09 Yeah.
00:39:10 But there are some things that we can think about that make this a little bit more obvious.
00:39:13 Like if I train a big neural net model to predict what will happen in response to my
00:39:17 robot’s actions, and then I run probabilistic inference, meaning I invert that model to
00:39:22 figure out the actions that lead to some plausible outcome, like to me, that seems like a kind
00:39:26 of logic.
00:39:27 You have a model of the world that just happens to be expressed by a neural net, and you are
00:39:32 doing some inference procedure, some sort of manipulation on that model to figure out
00:39:37 the answer to a query that you have.
00:39:39 It’s the interpretability.
00:39:41 It’s the explainability, though, that seems to be lacking more so because the nice thing
00:39:46 about sort of expert systems is you can follow the reasoning of the system that to us mere
00:39:52 humans is somehow compelling.
00:39:56 It’s just I don’t know what to make of this fact that there’s a human desire for intelligence
00:40:04 systems to be able to convey in a poetic way to us why it made the decisions it did, like
00:40:12 tell a convincing story.
00:40:15 And perhaps that’s like a silly human thing, like we shouldn’t expect that of intelligence
00:40:22 systems.
00:40:23 I’m super happy that there is intelligence systems out there.
00:40:27 But if I were to sort of psychoanalyze the researchers at the time, I would say expert
00:40:33 systems connected to that part, that desire of AI researchers for systems to be explainable.
00:40:40 I mean, maybe on that topic, do you have a hope that sort of inferences of learning based
00:40:48 systems will be as explainable as the dream was with expert systems, for example?
00:40:55 I think it’s a very complicated question because I think that in some ways the question of
00:40:59 explainability is kind of very closely tied to the question of like performance, like,
00:41:07 you know, why do you want your system to explain itself so that when it screws up, you can
00:41:11 kind of figure out why it did it.
00:41:14 But in some ways that’s a much bigger problem, actually.
00:41:17 Like your system might screw up and then it might screw up in how it explains itself.
00:41:22 Or you might have some bug somewhere so that it’s not actually doing what it was supposed
00:41:26 to do.
00:41:27 So, you know, maybe a good way to view that problem is really as a problem, as a bigger
00:41:32 problem of verification and validation, of which explainability is sort of one component.
00:41:38 I see.
00:41:39 I just see it differently.
00:41:41 I see explainability, you put it beautifully, I think you actually summarize the field of
00:41:45 explainability.
00:41:46 But to me, there’s another aspect of explainability, which is like storytelling that has nothing
00:41:52 to do with errors or with, like, it uses errors as elements of its story as opposed to a fundamental
00:42:05 need to be explainable when errors occur.
00:42:08 It’s just that for other intelligent systems to be in our world, we seem to want to tell
00:42:12 each other stories.
00:42:14 And that’s true in the political world, that’s true in the academic world.
00:42:19 And that, you know, neural networks are less capable of doing that, or perhaps they’re
00:42:24 equally capable of storytelling and storytelling.
00:42:26 Maybe it doesn’t matter what the fundamentals of the system are.
00:42:30 You just need to be a good storyteller.
00:42:32 Maybe one specific story I can tell you about in that space is actually about some work
00:42:38 that was done by my former collaborator, who’s now a professor at MIT named Jacob Andreas.
00:42:43 Jacob actually works in natural language processing, but he had this idea to do a little bit of
00:42:47 work in reinforcement learning on how natural language can basically structure the internals
00:42:53 of policies trained with RL.
00:42:55 And one of the things he did is he set up a model that attempts to perform some task
00:43:01 that’s defined by a reward function, but the model reads in a natural language instruction.
00:43:06 So this is a pretty common thing to do in instruction following.
00:43:08 So you tell it like, you know, go to the red house and then it’s supposed to go to the red house.
00:43:13 But then one of the things that Jacob did is he treated that sentence, not as a command
00:43:18 from a person, but as a representation of the internal kind of a state of the mind of
00:43:25 this policy, essentially.
00:43:26 So that when it was faced with a new task, what it would do is it would basically try
00:43:30 to think of possible language descriptions, attempt to do them and see if they led to
00:43:34 the right outcome.
00:43:35 So it would kind of think out loud, like, you know, I’m faced with this new task.
00:43:38 What am I going to do?
00:43:39 Let me go to the red house.
00:43:40 Oh, that didn’t work.
00:43:41 Let me go to the blue room or something.
00:43:43 Let me go to the green plant.
00:43:45 And once it got some reward, it would say, oh, go to the green plant.
00:43:47 That’s what’s working.
00:43:48 I’m going to go to the green plant.
00:43:49 And then you could look at the string that it came up with, and that was a description
00:43:51 of how it thought it should solve the problem.
00:43:54 So you could do, you could basically incorporate language as internal state and you can start
00:43:58 getting some handle on these kinds of things.
00:44:01 And then what I was kind of trying to get to is that also, if you add to the reward
00:44:05 function, the convincingness of that story.
00:44:10 So I have another reward signal of like people who review that story, how much they like
00:44:15 it.
00:44:16 So that, you know, initially that could be a hyperparameter sort of hard coded heuristic
00:44:22 type of thing, but it’s an interesting notion of the convincingness of the story becoming
00:44:30 part of the reward function, the objective function of the explainability.
00:44:34 That’s in the world of sort of Twitter and fake news, that might be a scary notion that
00:44:40 the nature of truth may not be as important as the convincingness of the, how convincing
00:44:45 you are in telling the story around the facts.
00:44:49 Well, let me ask the basic question.
00:44:55 You’re one of the world class researchers in reinforcement learning, deep reinforcement
00:44:58 learning, certainly in the robotic space.
00:45:01 What is reinforcement learning?
00:45:04 I think that what reinforcement learning refers to today is really just the kind of the modern
00:45:09 incarnation of learning based control.
00:45:13 So classically reinforcement learning has a much more narrow definition, which is that
00:45:16 it’s literally learning from reinforcement, like the thing does something and then it
00:45:20 gets a reward or punishment.
00:45:22 But really I think the way the term is used today is it’s used to refer more broadly to
00:45:26 learning based control.
00:45:28 So some kind of system that’s supposed to be controlling something and it uses data
00:45:33 to get better.
00:45:34 And what does control mean?
00:45:35 So this action is the fundamental element there.
00:45:38 It means making rational decisions.
00:45:41 And rational decisions are decisions that maximize a measure of utility.
00:45:44 And sequentially, so you made decisions time and time and time again.
00:45:48 Now like it’s easier to see that kind of idea in the space of maybe games and the space
00:45:54 of robotics.
00:45:55 Do you see it bigger than that?
00:45:58 Is it applicable?
00:45:59 Like where are the limits of the applicability of reinforcement learning?
00:46:04 Yeah, so rational decision making is essentially the encapsulation of the AI problem viewed
00:46:12 through a particular lens.
00:46:13 So any problem that we would want a machine to do, an intelligent machine, can likely
00:46:18 be represented as a decision making problem.
00:46:20 Learning images is a decision making problem, although not a sequential one typically.
00:46:26 Controlling a chemical plant is a decision making problem.
00:46:30 Deciding what videos to recommend on YouTube is a decision making problem.
00:46:34 And one of the really appealing things about reinforcement learning is if it does encapsulate
00:46:39 the range of all these decision making problems, perhaps working on reinforcement learning
00:46:43 is one of the ways to reach a very broad swath of AI problems.
00:46:50 What is the fundamental difference between reinforcement learning and maybe supervised
00:46:55 machine learning?
00:46:57 So reinforcement learning can be viewed as a generalization of supervised machine learning.
00:47:02 You can certainly cast supervised learning as a reinforcement learning problem.
00:47:05 You can just say your loss function is the negative of your reward.
00:47:09 But you have stronger assumptions.
00:47:10 You have the assumption that someone actually told you what the correct answer was, that
00:47:14 your data was IID and so on.
00:47:16 So you could view reinforcement learning as essentially relaxing some of those assumptions.
00:47:20 Now that’s not always a very productive way to look at it because if you actually have
00:47:22 a supervised learning problem, you’ll probably solve it much more effectively by using supervised
00:47:26 learning methods because it’s easier.
00:47:29 But you can view reinforcement learning as a generalization of that.
00:47:32 No, for sure.
00:47:33 But they’re fundamentally different.
00:47:36 That’s a mathematical statement.
00:47:37 That’s absolutely correct.
00:47:38 But it seems that reinforcement learning, the kind of tools we bring to the table today
00:47:43 of today.
00:47:44 So maybe down the line, everything will be a reinforcement learning problem.
00:47:49 Just like you said, image classification should be mapped to a reinforcement learning problem.
00:47:53 But today, the tools and ideas, the way we think about them are different, sort of supervised
00:48:01 learning has been used very effectively to solve basic narrow AI problems.
00:48:07 Reinforcement learning kind of represents the dream of AI.
00:48:11 It’s very much so in the research space now in sort of captivating the imagination of
00:48:17 people of what we can do with intelligent systems, but it hasn’t yet had as wide of
00:48:22 an impact as the supervised learning approaches.
00:48:25 So my question comes from the more practical sense, like what do you see is the gap between
00:48:32 the more general reinforcement learning and the very specific, yes, it’s a question decision
00:48:38 making with one step in the sequence of the supervised learning?
00:48:43 So from a practical standpoint, I think that one thing that is potentially a little tough
00:48:49 now, and this is I think something that we’ll see, this is a gap that we might see closing
00:48:53 over the next couple of years, is the ability of reinforcement learning algorithms to effectively
00:48:57 utilize large amounts of prior data.
00:49:00 So one of the reasons why it’s a bit difficult today to use reinforcement learning for all
00:49:05 the things that we might want to use it for is that in most of the settings where we want
00:49:10 to do rational decision making, it’s a little bit tough to just deploy some policy that
00:49:15 does crazy stuff and learns purely through trial and error.
00:49:18 It’s much easier to collect a lot of data, a lot of logs of some other policy that you’ve
00:49:23 got, and then maybe if you can get a good policy out of that, then you deploy it and
00:49:28 let it kind of fine tune a little bit.
00:49:30 But algorithmically, it’s quite difficult to do that.
00:49:33 So I think that once we figure out how to get reinforcement learning to bootstrap effectively
00:49:37 from large data sets, then we’ll see very, very rapid growth in applications of these
00:49:44 technologies.
00:49:45 So this is what’s referred to as off policy reinforcement learning or offline RL or batch
00:49:48 RL.
00:49:50 And I think we’re seeing a lot of research right now that’s bringing us closer and closer
00:49:53 to that.
00:49:54 Can you maybe paint the picture of the different methods?
00:49:57 So you said off policy, what’s value based reinforcement learning?
00:50:02 What’s policy based?
00:50:03 What’s model based?
00:50:04 What’s off policy, on policy?
00:50:05 What are the different categories of reinforcement learning?
00:50:07 Okay.
00:50:08 So one way we can think about reinforcement learning is that it’s, in some very fundamental
00:50:14 way, it’s about learning models that can answer kind of what if questions.
00:50:20 So what would happen if I take this action that I hadn’t taken before?
00:50:24 And you do that, of course, from experience, from data.
00:50:26 And oftentimes you do it in a loop.
00:50:28 So you build a model that answers these what if questions, use it to figure out the best
00:50:32 action you can take, and then go and try taking that and see if the outcome agrees with what
00:50:36 you predicted.
00:50:38 So the different kinds of techniques basically refer to different ways of doing it.
00:50:43 So model based methods answer a question of what state you would get, basically what would
00:50:48 happen to the world if you were to take a certain action.
00:50:50 Value based methods, they answer the question of what value you would get, meaning what
00:50:55 utility you would get.
00:50:57 But in a sense, they’re not really all that different because they’re both really just
00:51:00 answering these what if questions.
00:51:03 Now unfortunately for us, with current machine learning methods, answering what if questions
00:51:07 can be really hard because they are really questions about things that didn’t happen.
00:51:12 If you wanted to answer what if questions about things that did happen, you wouldn’t
00:51:14 need a learn model.
00:51:15 You would just like repeat the thing that worked before.
00:51:19 And that’s really a big part of why RL is a little bit tough.
00:51:23 So if you have a purely on policy kind of online process, then you ask these what if
00:51:28 questions, you make some mistakes, then you go and try doing those mistaken things.
00:51:33 And then you observe kind of the counter examples that will teach you not to do those things
00:51:36 again.
00:51:37 If you have a bunch of off policy data and you just want to synthesize the best policy
00:51:42 you can out of that data, then you really have to deal with the challenges of making
00:51:46 these counterfactual.
00:51:47 First of all, what’s a policy?
00:51:50 A policy is a model or some kind of function that maps from observations of the world to
00:51:59 actions.
00:52:00 So in reinforcement learning, we often refer to the current configuration of the world
00:52:05 as the state.
00:52:06 So we say the state kind of encompasses everything you need to fully define where the world is
00:52:10 at the moment.
00:52:11 And depending on how we formulate the problem, we might say you either get to see the state
00:52:15 or you get to see an observation, which is some snapshot, some piece of the state.
00:52:19 So policy just includes everything in it in order to be able to act in this world.
00:52:25 Yes.
00:52:26 And so what does off policy mean?
00:52:29 Yeah, so the terms on policy and off policy refer to how you get your data.
00:52:33 So if you get your data from somebody else who was doing some other stuff, maybe you
00:52:37 get your data from some manually programmed system that was just running in the world
00:52:43 before that’s referred to as off policy data.
00:52:46 But if you got the data by actually acting in the world based on what your current policy
00:52:50 thinks is good, we call that on policy data.
00:52:53 And obviously on policy data is more useful to you because if your current policy makes
00:52:58 some bad decisions, you will actually see that those decisions are bad.
00:53:01 Off policy data, however, might be much easier to obtain because maybe that’s all the logged
00:53:06 data that you have from before.
00:53:08 So we talk about offline, talked about autonomous vehicles so you can envision off policy kind
00:53:14 of approaches in robotic spaces where there’s already a ton of robots out there, but they
00:53:19 don’t get the luxury of being able to explore based on a reinforcement learning framework.
00:53:26 So how do we make, again, open question, but how do we make off policy methods work?
00:53:32 Yeah.
00:53:33 So this is something that has been kind of a big open problem for a while.
00:53:37 And in the last few years, people have made a little bit of progress on that.
00:53:41 You know, I can tell you about, and it’s not by any means solved yet, but I can tell you
00:53:44 some of the things that, for example, we’ve done to try to address some of the challenges.
00:53:49 It turns out that one really big challenge with off policy reinforcement learning is
00:53:53 that you can’t really trust your models to give accurate predictions for any possible
00:53:59 action.
00:54:00 So if I’ve never tried to, if in my data set I never saw somebody steering the car off
00:54:05 the road onto the sidewalk, my value function or my model is probably not going to predict
00:54:11 the right thing if I ask what would happen if I were to steer the car off the road onto
00:54:14 the sidewalk.
00:54:15 So one of the important things you have to do to get off policy RL to work is you have
00:54:20 to be able to figure out whether a given action will result in a trustworthy prediction or
00:54:24 not.
00:54:25 And you can use a kind of distribution estimation methods, kind of density estimation methods
00:54:31 to try to figure that out.
00:54:32 So you could figure out that, well, this action, my model is telling me that it’s great, but
00:54:35 it looks totally different from any action I’ve taken before, so my model is probably
00:54:38 not correct.
00:54:39 And you can incorporate regularization terms into your learning objective that will essentially
00:54:45 tell you not to ask those questions that your model is unable to answer.
00:54:50 What would lead to breakthroughs in this space, do you think?
00:54:54 Like what’s needed?
00:54:55 Is this a data set question?
00:54:57 Do we need to collect big benchmark data sets that allow us to explore the space?
00:55:03 Is it a new kinds of methodologies?
00:55:08 Like what’s your sense?
00:55:09 Or maybe coming together in a space of robotics and defining the right problem to be working
00:55:14 on?
00:55:15 I think for off policy reinforcement learning in particular, it’s very much an algorithms
00:55:18 question right now.
00:55:19 And this is something that I think is great because an algorithms question is that that
00:55:25 just takes some very smart people to get together and think about it really hard, whereas if
00:55:29 it was like a data problem or a hardware problem, that would take some serious engineering.
00:55:34 So that’s why I’m pretty excited about that problem because I think that we’re in a position
00:55:38 where we can make some real progress on it just by coming up with the right algorithms.
00:55:42 In terms of which algorithms they could be, the problems at their core are very related
00:55:47 to problems in things like causal inference.
00:55:51 Because what you’re really dealing with is situations where you have a model, a statistical
00:55:55 model, that’s trying to make predictions about things that it hadn’t seen before.
00:56:00 And if it’s a model that’s generalizing properly, that’ll make good predictions.
00:56:04 If it’s a model that picks up on spurious correlations, that will not generalize properly.
00:56:09 And then you have an arsenal of tools you can use.
00:56:11 You could, for example, figure out what are the regions where it’s trustworthy, or on
00:56:15 the other hand, you could try to make it generalize better somehow, or some combination of the
00:56:18 two.
00:56:20 Is there room for mixing where most of it, like 90, 95% is off policy, you already have
00:56:30 the data set, and then you get to send the robot out to do a little exploration?
00:56:36 What’s that role of mixing them together?
00:56:38 Yeah, absolutely.
00:56:39 I think that this is something that you actually described very well at the beginning of our
00:56:45 discussion when you talked about the iceberg.
00:56:47 This is the iceberg.
00:56:48 The 99% of your prior experience, that’s your iceberg.
00:56:51 You’d use that for off policy reinforcement learning.
00:56:54 And then, of course, if you’ve never opened that particular kind of door with that particular
00:56:59 lock before, then you have to go out and fiddle with it a little bit.
00:57:02 And that’s that additional 1% to help you figure out a new task.
00:57:05 And I think that’s actually a pretty good recipe going forward.
00:57:08 Is this, to you, the most exciting space of reinforcement learning now?
00:57:12 Or is there, what’s, and maybe taking a step back, not just now, but what’s, to you, is
00:57:18 the most beautiful idea, apologize for the romanticized question, but the beautiful idea
00:57:23 or concept in reinforcement learning?
00:57:27 In general, I actually think that one of the things that is a very beautiful idea in reinforcement
00:57:32 learning is just the idea that you can obtain a near optimal control or near optimal policy
00:57:41 without actually having a complete model of the world.
00:57:45 This is, you know, it’s something that feels perhaps kind of obvious if you just hear the
00:57:53 term reinforcement learning or you think about trial and error learning.
00:57:55 But from a controls perspective, it’s a very weird thing because classically, you know,
00:58:01 we think about engineered systems and controlling engineered systems as the problem of writing
00:58:07 down some equations and then figuring out given these equations, you know, basically
00:58:11 solve for X, figure out the thing that maximizes its performance.
00:58:16 And the theory of reinforcement learning actually gives us a mathematically principled framework
00:58:21 to think, to reason about, you know, optimizing some quantity when you don’t actually know
00:58:27 the equations that govern that system.
00:58:28 And I don’t, to me, that’s actually seems kind of, you know, very elegant, not something
00:58:35 that sort of becomes immediately obvious, at least in the mathematical sense.
00:58:40 Does it make sense to you that it works at all?
00:58:42 Well, I think it makes sense when you take some time to think about it, but it is a little
00:58:48 surprising.
00:58:49 Well, then taking a step into the more deeper representations, which is also very surprising
00:58:56 of sort of the richness of the state space, the space of environments that this kind of
00:59:04 approach can operate in, can you maybe say what is deep reinforcement learning?
00:59:10 Well, deep reinforcement learning simply refers to taking reinforcement learning algorithms
00:59:16 and combining them with high capacity neural net representations.
00:59:20 Which is, you know, kind of, it might at first seem like a pretty arbitrary thing, just take
00:59:24 these two components and stick them together.
00:59:26 But the reason that it’s something that has become so important in recent years is that
00:59:32 reinforcement learning, it kind of faces an exacerbated version of a problem that has
00:59:38 faced many other machine learning techniques.
00:59:40 So if we go back to like, you know, the early two thousands or the late nineties, we’ll
00:59:45 see a lot of research on machine learning methods that have some very appealing mathematical
00:59:50 properties like they reduce the convex optimization problems, for instance, but they require very
00:59:56 special inputs.
00:59:57 They require a representation of the input that is clean in some way.
01:00:01 Like for example, clean in the sense that the classes in your multi class classification
01:00:06 problems separate linearly.
01:00:07 So they have some kind of good representation and we call this a feature representation.
01:00:12 And for a long time, people were very worried about features in the world of supervised
01:00:15 learning because somebody had to actually build those features so you couldn’t just
01:00:18 take an image and plug it into your logistic regression or your SVM or something.
01:00:22 How to take that image and process it using some handwritten code.
01:00:26 And then neural nets came along and they could actually learn the features and suddenly we
01:00:30 could apply learning directly to the raw inputs, which was great for images, but it was even
01:00:35 more great for all the other fields where people hadn’t come up with good features yet.
01:00:40 And one of those fields actually reinforcement learning because in reinforcement learning,
01:00:43 the notion of features, if you don’t use neural nets and you have to design your own features
01:00:46 is very, very opaque.
01:00:48 Like it’s very hard to imagine, let’s say I’m playing chess or go.
01:00:53 What is a feature with which I can represent the value function for go or even the optimal
01:00:58 policy for go linearly?
01:00:59 Like I don’t even know how to start thinking about it.
01:01:03 And people tried all sorts of things that would write down, you know, an expert chess
01:01:06 player looks for whether the knight is in the middle of the board or not.
01:01:09 So that’s a feature is knight in middle of board.
01:01:11 And they would write these like long lists of kind of arbitrary made up stuff.
01:01:15 And that was really kind of getting us nowhere.
01:01:17 And that’s a little, chess is a little more accessible than the robotics problem.
01:01:21 Absolutely.
01:01:22 Right.
01:01:23 There’s at least experts in the different features for chess, but still like the neural
01:01:30 network there, to me, that’s, I mean, you put it eloquently and almost made it seem
01:01:35 like a natural step to add neural networks, but the fact that neural networks are able
01:01:41 to discover features in the control problem, it’s very interesting.
01:01:45 It’s hopeful.
01:01:46 I’m not sure what to think about it, but it feels hopeful that the control problem has
01:01:51 features to be learned.
01:01:54 Like I guess my question is, is it surprising to you how far the deep side of deep reinforcement
01:02:02 learning was able to like what the space of problems has been able to tackle from, especially
01:02:07 in games with alpha star and alpha zero and just the representation power there and in
01:02:17 the robotics space and what is your sense of the limits of this representation power
01:02:23 and the control context?
01:02:26 I think that in regard to the limits that here, I think that one thing that makes it
01:02:32 a little hard to fully answer this question is because in settings where we would like
01:02:39 to push these things to the limit, we encounter other bottlenecks.
01:02:44 So like the reason that I can’t get my robot to learn how to like, I don’t know, do the
01:02:51 dishes in the kitchen, it’s not because it’s neural net is not big enough.
01:02:56 It’s because when you try to actually do trial and error learning, reinforcement learning,
01:03:02 directly in the real world where you have the potential to gather these large, highly
01:03:07 varied and complex data sets, you start running into other problems.
01:03:11 Like one problem you run into very quickly, it’ll first sound like a very pragmatic problem,
01:03:16 but it actually turns out to be a pretty deep scientific problem.
01:03:19 Take the robot, put it in your kitchen, have it try to learn to do the dishes with trial
01:03:22 and error.
01:03:23 It’ll break all your dishes and then we’ll have no more dishes to clean.
01:03:27 Now you might think this is a very practical issue, but there’s something to this, which
01:03:30 is that if you have a person trying to do this, a person will have some degree of common
01:03:33 sense.
01:03:34 They’ll break one dish, they’ll be a little more careful with the next one, and if they
01:03:37 break all of them, they’re going to go and get more or something like that.
01:03:41 So there’s all sorts of scaffolding that comes very naturally to us for our learning process.
01:03:46 Like if I have to learn something through trial and error, I have the common sense to
01:03:50 know that I have to try multiple times.
01:03:53 If I screw something up, I ask for help or I reset things or something like that.
01:03:57 And all of that is kind of outside of the classic reinforcement learning problem formulation.
01:04:02 There are other things that can also be categorized as kind of scaffolding, but are very important.
01:04:07 Like for example, where do you get your reward function?
01:04:09 If I want to learn how to pour a cup of water, well, how do I know if I’ve done it correctly?
01:04:15 Now that probably requires an entire computer vision system to be built just to determine
01:04:18 that, and that seems a little bit inelegant.
01:04:21 So there are all sorts of things like this that start to come up when we think through
01:04:24 what we really need to get reinforcement learning to happen at scale in the real world.
01:04:28 And many of these things actually suggest a little bit of a shortcoming in the problem
01:04:32 formulation and a few deeper questions that we have to resolve.
01:04:36 That’s really interesting.
01:04:37 I talked to David Silver about AlphaZero, and it seems like there’s no, again, we haven’t
01:04:45 hit the limit at all in the context where there’s no broken dishes.
01:04:50 So in the case of Go, you can, it’s really about just scaling compute.
01:04:55 So again, like the bottleneck is the amount of money you’re willing to invest in compute
01:05:00 and then maybe the different, the scaffolding around how difficult it is to scale compute
01:05:06 maybe, but there, there’s no limit.
01:05:09 And it’s interesting, now we’ll move to the real world and there’s the broken dishes,
01:05:12 there’s all the, and the reward function, like you mentioned, that’s really nice.
01:05:17 So what, how do we push forward there?
01:05:19 Do you think there’s, there’s this kind of a sample efficiency question that people bring
01:05:25 up of, you know, not having to break a hundred thousand dishes.
01:05:30 Is this an algorithm question?
01:05:33 Is this a data selection like question?
01:05:37 What do you think?
01:05:38 How do we, how do we not break too many dishes?
01:05:41 Yeah.
01:05:42 Well, one way we can think about that is that maybe we need to be better at, at reusing
01:05:51 our data, building that, that iceberg.
01:05:54 So perhaps, perhaps it’s too much to hope that you can have a machine that’s in isolation
01:06:02 in the vacuum without anything else, can just master complex tasks in like in minutes the
01:06:07 way that people do, but perhaps it also doesn’t have to, perhaps what it really needs to do
01:06:10 is have an existence, a lifetime where it does many things and the previous things that
01:06:16 it has done, prepare it to do new things more efficiently.
01:06:20 And you know, the study of these kinds of questions typically falls under categories
01:06:24 like multitask learning or meta learning, but they all fundamentally deal with the same
01:06:29 general theme, which is use experience for doing other things to learn to do new things
01:06:35 efficiently and quickly.
01:06:37 So what do you think about if we just look at the one particular case study of a Tesla
01:06:41 autopilot that has quickly approaching towards a million vehicles on the road where some
01:06:48 percentage of the time, 30, 40% of the time is driven using the computer vision, multitask
01:06:54 hydranet, right?
01:06:57 And then the other percent, that’s what they call it, hydranet.
01:07:03 The other percent is human controlled.
01:07:06 In the human side, how can we use that data?
01:07:09 What’s your sense?
01:07:12 What’s the signal?
01:07:13 Do you have ideas in this autonomous vehicle space when people can lose their lives?
01:07:17 You know, it’s a safety critical environment.
01:07:21 So how do we use that data?
01:07:23 So I think that actually the kind of problems that come up when we want systems that are
01:07:33 reliable and that can kind of understand the limits of their capabilities, they’re actually
01:07:37 very similar to the kind of problems that come up when we’re doing off policy reinforcement
01:07:40 learning.
01:07:41 So as I mentioned before, in off policy reinforcement learning, the big problem is you need to know
01:07:46 when you can trust the predictions of your model, because if you’re trying to evaluate
01:07:50 some pattern of behavior for which your model doesn’t give you an accurate prediction, then
01:07:54 you shouldn’t use that to modify your policy.
01:07:57 It’s actually very similar to the problem that we’re faced when we actually then deploy
01:08:00 that thing and we want to decide whether we trust it in the moment or not.
01:08:05 So perhaps we just need to do a better job of figuring out that part, and that’s a very
01:08:08 deep research question, of course, but it’s also a question that a lot of people are working
01:08:11 on.
01:08:12 So I’m pretty optimistic that we can make some progress on that over the next few years.
01:08:15 What’s the role of simulation in reinforcement learning, deep reinforcement learning, reinforcement
01:08:20 learning?
01:08:21 Like how essential is it?
01:08:23 It’s been essential for the breakthroughs so far for some interesting breakthroughs.
01:08:28 Do you think it’s a crutch that we rely on?
01:08:31 I mean, again, this connects to our off policy discussion, but do you think we can ever get
01:08:37 rid of simulation or do you think simulation will actually take over?
01:08:40 We’ll create more and more realistic simulations that will allow us to solve actual real world
01:08:46 problems, like transfer the models we learn in simulation to real world problems.
01:08:49 I think that simulation is a very pragmatic tool that we can use to get a lot of useful
01:08:54 stuff to work right now, but I think that in the long run, we will need to build machines
01:09:00 that can learn from real data because that’s the only way that we’ll get them to improve
01:09:03 perpetually because if we can’t have our machines learn from real data, if they have to rely
01:09:08 on simulated data, eventually the simulator becomes the bottleneck.
01:09:11 In fact, this is a general thing.
01:09:13 If your machine has any bottleneck that is built by humans and that doesn’t improve from
01:09:19 data, it will eventually be the thing that holds it back.
01:09:23 And if you’re entirely reliant on your simulator, that’ll be the bottleneck.
01:09:25 If you’re entirely reliant on a manually designed controller, that’s going to be the bottleneck.
01:09:30 So simulation is very useful.
01:09:32 It’s very pragmatic, but it’s not a substitute for being able to utilize real experience.
01:09:39 And by the way, this is something that I think is quite relevant now, especially in the context
01:09:44 of some of the things we’ve discussed, because some of these kind of scaffolding issues that
01:09:48 I mentioned, things like the broken dishes and the unknown reward function, like these
01:09:52 are not problems that you would ever stumble on when working in a purely simulated kind
01:09:57 of environment, but they become very apparent when we try to actually run these things in
01:10:01 the real world.
01:10:02 To throw a brief wrench into our discussion, let me ask, do you think we’re living in a
01:10:07 simulation?
01:10:08 Oh, I have no idea.
01:10:09 Do you think that’s a useful thing to even think about, about the fundamental physics
01:10:15 nature of reality?
01:10:18 Or another perspective, the reason I think the simulation hypothesis is interesting is
01:10:24 to think about how difficult is it to create sort of a virtual reality game type situation
01:10:33 that will be sufficiently convincing to us humans or sufficiently enjoyable that we wouldn’t
01:10:38 want to leave.
01:10:39 I mean, that’s actually a practical engineering challenge.
01:10:43 And I personally really enjoy virtual reality, but it’s quite far away.
01:10:47 I kind of think about what would it take for me to want to spend more time in virtual reality
01:10:52 versus the real world.
01:10:55 And that’s a sort of a nice clean question because at that point, if I want to live in
01:11:03 a virtual reality, that means we’re just a few years away where a majority of the population
01:11:08 lives in a virtual reality.
01:11:09 And that’s how we create the simulation, right?
01:11:11 You don’t need to actually simulate the quantum gravity and just every aspect of the universe.
01:11:19 And that’s an interesting question for reinforcement learning too, is if we want to make sufficiently
01:11:24 realistic simulations that may blend the difference between sort of the real world and the simulation,
01:11:32 thereby just some of the things we’ve been talking about, kind of the problems go away
01:11:37 if we can create actually interesting, rich simulations.
01:11:40 It’s an interesting question.
01:11:41 And it actually, I think your question casts your previous question in a very interesting
01:11:46 light, because in some ways asking whether we can, well, the more kind of practical version
01:11:53 is like, you know, can we build simulators that are good enough to train essentially
01:11:57 AI systems that will work in the world?
01:12:02 And it’s kind of interesting to think about this, about what this implies, if true, it
01:12:06 kind of implies that it’s easier to create the universe than it is to create a brain.
01:12:11 And that seems like, put this way, it seems kind of weird.
01:12:14 The aspect of the simulation most interesting to me is the simulation of other humans.
01:12:21 That seems to be a complexity that makes the robotics problem harder.
01:12:27 Now I don’t know if every robotics person agrees with that notion.
01:12:32 Just as a quick aside, what are your thoughts about when the human enters the picture of
01:12:38 the robotics problem?
01:12:39 How does that change the reinforcement learning problem, the learning problem in general?
01:12:44 Yeah, I think that’s a, it’s a kind of a complex question.
01:12:48 And I guess my hope for a while had been that if we build these robotic learning systems
01:12:56 that are multitask, that utilize lots of prior data and that learn from their own experience,
01:13:03 the bit where they have to interact with people will be perhaps handled in much the same way
01:13:07 as all the other bits.
01:13:08 So if they have prior experience of interacting with people and they can learn from their
01:13:12 own experience of interacting with people for this new task, maybe that’ll be enough.
01:13:16 Now, of course, if it’s not enough, there are many other things we can do and there’s
01:13:20 quite a bit of research in that area.
01:13:22 But I think it’s worth a shot to see whether the multi agent interaction, the ability to
01:13:29 understand that other beings in the world have their own goals and tensions and thoughts
01:13:35 and so on, whether that kind of understanding can emerge automatically from simply learning
01:13:41 to do things with and maximize utility.
01:13:44 That information arises from the data.
01:13:46 You’ve said something about gravity, that you don’t need to explicitly inject anything
01:13:53 into the system.
01:13:54 They can be learned from the data.
01:13:55 And gravity is an example of something that could be learned from data, so like the physics
01:13:59 of the world.
01:14:05 What are the limits of what we can learn from data?
01:14:08 Do you really think we can?
01:14:10 So a very simple, clean way to ask that is, do you really think we can learn gravity from
01:14:15 just data, the idea, the laws of gravity?
01:14:19 So something that I think is a common kind of pitfall when thinking about prior knowledge
01:14:25 and learning is to assume that just because we know something, then that it’s better to
01:14:33 tell the machine about that rather than have it figured out on its own.
01:14:36 In many cases, things that are important that affect many of the events that the machine
01:14:44 will experience are actually pretty easy to learn.
01:14:48 If every time you drop something, it falls down, yeah, you might get the Newton’s version,
01:14:54 not Einstein’s version, but it’ll be pretty good and it will probably be sufficient for
01:14:58 you to act rationally in the world because you see the phenomenon all the time.
01:15:03 So things that are readily apparent from the data, we might not need to specify those by
01:15:07 hand.
01:15:08 It might actually be easier to let the machine figure them out.
01:15:10 It just feels like that there might be a space of many local minima in terms of theories
01:15:17 of this world that we would discover and get stuck on, that Newtonian mechanics is not necessarily
01:15:25 easy to come by.
01:15:27 Yeah.
01:15:28 And in fact, in some fields of science, for example, human civilization is itself full
01:15:33 of these local optima.
01:15:34 So for example, if you think about how people tried to figure out biology and medicine for
01:15:40 the longest time, the kind of rules, the kind of principles that serve us very well in our
01:15:45 day to day lives actually serve us very poorly in understanding medicine and biology.
01:15:50 We had kind of very superstitious and weird ideas about how the body worked until the
01:15:55 advent of the modern scientific method.
01:15:58 So that does seem to be a failing of this approach, but it’s also a failing of human
01:16:02 intelligence arguably.
01:16:04 Maybe a small aside, but some, you know, the idea of self play is fascinating in reinforcement
01:16:09 learning sort of these competitive, creating a competitive context in which agents can
01:16:14 play against each other in a, sort of at the same skill level and thereby increasing each
01:16:20 other skill level.
01:16:21 It seems to be this kind of self improving mechanism is exceptionally powerful in the
01:16:26 context where it could be applied.
01:16:29 First of all, is that beautiful to you that this mechanism work as well as it does?
01:16:34 And also can we generalize to other contexts like in the robotic space or anything that’s
01:16:41 applicable to the real world?
01:16:43 I think that it’s a very interesting idea, but I suspect that the bottleneck to actually
01:16:51 generalizing it to the robotic setting is actually going to be the same as the bottleneck
01:16:56 for everything else that we need to be able to build machines that can get better and
01:17:01 better through natural interaction with the world.
01:17:04 And once we can do that, then they can go out and play with, they can play with each
01:17:08 other, they can play with people, they can play with the natural environment.
01:17:13 But before we get there, we’ve got all these other problems we’ve got, we have to get out
01:17:16 of the way.
01:17:17 So there’s no shortcut around that.
01:17:18 You have to interact with a natural environment that.
01:17:21 Well because in a, in a self play setting, you still need a mediating mechanism.
01:17:24 So the, the reason that, you know, self play works for a board game is because the rules
01:17:30 of that board game mediate the interaction between the agents.
01:17:33 So the kind of intelligent behavior that will emerge depends very heavily on the nature
01:17:37 of that mediating mechanism.
01:17:39 So on the side of reward functions, that’s coming up with good reward functions seems
01:17:44 to be the thing that we associate with general intelligence, like human beings seem to value
01:17:50 the idea of developing our own reward functions of, you know, at arriving at meaning and so
01:17:57 on.
01:17:58 And yet for reinforcement learning, we often kind of specify that’s the given.
01:18:02 What’s your sense of how we develop reward, you know, good reward functions?
01:18:08 Yeah, I think that’s a very complicated and very deep question.
01:18:12 And you’re completely right that classically in reinforcement learning, this question,
01:18:16 I guess, kind of been treated as an on issue that you sort of treat the reward as this
01:18:21 external thing that comes from some other bit of your biology and you kind of don’t
01:18:27 worry about it.
01:18:28 And I do think that that’s actually, you know, a little bit of a mistake that we should worry
01:18:32 about it.
01:18:33 And we can approach it in a few different ways.
01:18:34 We can approach it, for instance, by thinking of rewards as a communication medium.
01:18:39 We can say, well, how does a person communicate to a robot what its objective is?
01:18:43 You can approach it also as a sort of more of an intrinsic motivation medium.
01:18:47 You could say, can we write down kind of a general objective that leads to good capability?
01:18:55 Like for example, can you write down some objectives such that even in the absence of
01:18:58 any other task, if you maximize that objective, you’ll sort of learn useful things.
01:19:02 This is something that has sometimes been called unsupervised reinforcement learning,
01:19:07 which I think is a really fascinating area of research, especially today.
01:19:11 We’ve done a bit of work on that recently.
01:19:13 One of the things we’ve studied is whether we can have some notion of unsupervised reinforcement
01:19:19 learning by means of, you know, information theoretic quantities, like for instance, minimizing
01:19:25 a Bayesian measure of surprise.
01:19:26 This is an idea that was, you know, pioneered actually in the computational neuroscience
01:19:30 community by folks like Carl Friston.
01:19:32 And we’ve done some work recently that shows that you can actually learn pretty interesting
01:19:35 skills by essentially behaving in a way that allows you to make accurate predictions about
01:19:41 the world.
01:19:42 Like do the things that will lead to you getting the right answer for prediction.
01:19:48 But you can, you know, by doing this, you can sort of discover stable niches in the
01:19:52 world.
01:19:53 You can discover that if you’re playing Tetris, then correctly, you know, clearing the rows
01:19:57 will let you play Tetris for longer and keep the board nice and clean, which sort of satisfies
01:20:01 some desire for order in the world.
01:20:04 And as a result, get some degree of leverage over your domain.
01:20:07 So we’re exploring that pretty actively.
01:20:08 Is there a role for a human notion of curiosity in itself being the reward, sort of discovering
01:20:15 new things about the world?
01:20:19 So one of the things that I’m pretty interested in is actually whether discovering new things
01:20:26 can actually be an emergent property of some other objective that quantifies capability.
01:20:30 So new things for the sake of new things maybe is not, maybe might not by itself be the right
01:20:36 answer, but perhaps we can figure out an objective for which discovering new things is actually
01:20:42 the natural consequence.
01:20:44 That’s something we’re working on right now, but I don’t have a clear answer for you there
01:20:47 yet that’s still a work in progress.
01:20:49 You mean just that it’s a curious observation to see sort of creative patterns of curiosity
01:20:57 on the way to optimize for a particular task?
01:21:00 On the way to optimize for a particular measure of capability.
01:21:05 Is there ways to understand or anticipate unexpected unintended consequences of particular
01:21:15 reward functions, sort of anticipate the kind of strategies that might be developed and
01:21:22 try to avoid highly detrimental strategies?
01:21:27 So classically, this is something that has been pretty hard in reinforcement learning
01:21:30 because it’s difficult for a designer to have good intuition about, you know, what a learning
01:21:35 algorithm will come up with when they give it some objective.
01:21:38 There are ways to mitigate that.
01:21:40 One way to mitigate it is to actually define an objective that says like, don’t do weird
01:21:45 stuff.
01:21:46 You can actually quantify it.
01:21:47 You can say just like, don’t enter situations that have low probability under the distribution
01:21:52 of states you’ve seen before.
01:21:54 It turns out that that’s actually one very good way to do off policy reinforcement learning
01:21:57 actually.
01:21:59 So we can do some things like that.
01:22:02 If we slowly venture in speaking about reward functions into greater and greater levels
01:22:08 of intelligence, there’s, I mean, Stuart Russell thinks about this, the alignment of AI systems
01:22:16 with us humans.
01:22:18 So how do we ensure that AGI systems align with us humans?
01:22:23 It’s kind of a reward function question of specifying the behavior of AI systems such
01:22:32 that their success aligns with this, with the broader intended success interest of human
01:22:39 beings.
01:22:40 Do you have thoughts on this?
01:22:41 Do you have kind of concerns of where reinforcement learning fits into this, or are you really
01:22:45 focused on the current moment of us being quite far away and trying to solve the robotics
01:22:50 problem?
01:22:51 I don’t have a great answer to this, but, you know, and I do think that this is a problem
01:22:56 that’s important to figure out.
01:22:59 For my part, I’m actually a bit more concerned about the other side of the, of this equation
01:23:04 that, you know, maybe rather than unintended consequences for objectives that are specified
01:23:11 too well, I’m actually more worried right now about unintended consequences for objectives
01:23:15 that are not optimized well enough, which might become a very pressing problem when
01:23:21 we, for instance, try to use these techniques for safety critical systems like cars and
01:23:26 aircraft and so on.
01:23:28 I think at some point we’ll face the issue of objectives being optimized too well, but
01:23:32 right now I think we’re, we’re more likely to face the issue of them not being optimized
01:23:36 well enough.
01:23:37 But you don’t think unintended consequences can arise even when you’re far from optimality,
01:23:41 sort of like on the path to it?
01:23:43 Oh no, I think unintended consequences can absolutely arise.
01:23:46 It’s just, I think right now the bottleneck for improving reliability, safety and things
01:23:52 like that is more with systems that like need to work better, that need to optimize their
01:23:57 objectives better.
01:23:58 Do you have thoughts, concerns about existential threats of human level intelligence that have,
01:24:05 if we put on our hat of looking in 10, 20, 100, 500 years from now, do you have concerns
01:24:11 about existential threats of AI systems?
01:24:15 I think there are absolutely existential threats for AI systems, just like there are for any
01:24:19 powerful technology.
01:24:22 But I think that the, these kinds of problems can take many forms and, and some of those
01:24:28 forms will come down to, you know, people with nefarious intent.
01:24:34 Some of them will come down to AI systems that have some fatal flaws.
01:24:38 And some of them will, will of course come down to AI systems that are too capable in
01:24:42 some way.
01:24:44 But among this set of potential concerns, I would actually be much more concerned about
01:24:50 the first two right now, and principally the one with nefarious humans, because, you know,
01:24:55 just through all of human history, actually it’s the nefarious humans that have been the
01:24:57 problem, not the nefarious machines, than I am about the others.
01:25:01 And I think that right now the best that I can do to make sure things go well is to build
01:25:07 the best technology I can and also hopefully promote responsible use of that technology.
01:25:13 Do you think RL Systems has something to teach us humans?
01:25:19 You said nefarious humans getting us in trouble.
01:25:21 I mean, machine learning systems have in some ways have revealed to us the ethical flaws
01:25:26 in our data.
01:25:27 In that same kind of way, can reinforcement learning teach us about ourselves?
01:25:32 Has it taught something?
01:25:34 What have you learned about yourself from trying to build robots and reinforcement learning
01:25:40 systems?
01:25:42 I’m not sure what I’ve learned about myself, but maybe part of the answer to your question
01:25:49 might become a little bit more apparent once we see more widespread deployment of reinforcement
01:25:55 learning for decision making support in domains like healthcare, education, social media,
01:26:02 etc.
01:26:03 And I think we will see some interesting stuff emerge there.
01:26:06 We will see, for instance, what kind of behaviors these systems come up with in situations where
01:26:12 there is interaction with humans and where they have a possibility of influencing human
01:26:17 behavior.
01:26:18 I think we’re not quite there yet, but maybe in the next few years we’ll see some interesting
01:26:22 stuff come out in that area.
01:26:23 I hope outside the research space, because the exciting space where this could be observed
01:26:28 is sort of large companies that deal with large data, and I hope there’s some transparency.
01:26:35 One of the things that’s unclear when I look at social networks and just online is why
01:26:40 an algorithm did something or whether even an algorithm was involved.
01:26:45 And that’d be interesting from a research perspective, just to observe the results of
01:26:52 algorithms, to open up that data, or to at least be sufficiently transparent about the
01:26:58 behavior of these AI systems in the real world.
01:27:02 What’s your sense?
01:27:03 I don’t know if you looked at the blog post, Bitter Lesson, by Rich Sutton, where it looks
01:27:08 at sort of the big lesson of researching AI and reinforcement learning is that simple
01:27:16 methods, general methods that leverage computation seem to work well.
01:27:21 So basically don’t try to do any kind of fancy algorithms, just wait for computation to get
01:27:26 fast.
01:27:28 Do you share this kind of intuition?
01:27:31 I think the high level idea makes a lot of sense.
01:27:34 I’m not sure that my takeaway would be that we don’t need to work on algorithms.
01:27:37 I think that my takeaway would be that we should work on general algorithms.
01:27:43 And actually, I think that this idea of needing to better automate the acquisition of experience
01:27:52 in the real world actually follows pretty naturally from Rich Sutton’s conclusion.
01:27:58 So if the claim is that automated general methods plus data leads to good results, then
01:28:06 it makes sense that we should build general methods and we should build the kind of methods
01:28:09 that we can deploy and get them to go out there and collect their experience autonomously.
01:28:14 I think that one place where I think that the current state of things falls a little
01:28:19 bit short of that is actually the going out there and collecting the data autonomously,
01:28:23 which is easy to do in a simulated board game, but very hard to do in the real world.
01:28:27 Yeah, it keeps coming back to this one problem, right?
01:28:31 Your mind is focused there now in this real world.
01:28:35 It just seems scary, the step of collecting the data, and it seems unclear to me how we
01:28:43 can do it effectively.
01:28:44 Well, you know, seven billion people in the world, each of them had to do that at some
01:28:49 point in their lives.
01:28:51 And we should leverage that experience that they’ve all done.
01:28:54 We should be able to try to collect that kind of data.
01:28:58 Okay, big questions.
01:29:02 Maybe stepping back through your life, what book or books, technical or fiction or philosophical,
01:29:10 had a big impact on the way you saw the world, on the way you thought about in the world,
01:29:15 your life in general?
01:29:19 And maybe what books, if it’s different, would you recommend people consider reading on their
01:29:24 own intellectual journey?
01:29:26 It could be within reinforcement learning, but it could be very much bigger.
01:29:30 I don’t know if this is like a scientifically, like, particularly meaningful answer.
01:29:39 But like, the honest answer is that I actually found a lot of the work by Isaac Asimov to
01:29:45 be very inspiring when I was younger.
01:29:47 I don’t know if that has anything to do with AI necessarily.
01:29:50 You don’t think it had a ripple effect in your life?
01:29:53 Maybe it did.
01:29:56 But yeah, I think that a vision of a future where, well, first of all, artificial, I might
01:30:06 say artificial intelligence system, artificial robotic systems have, you know, kind of a
01:30:10 big place, a big role in society, and where we try to imagine the sort of the limiting
01:30:18 case of technological advancement and how that might play out in our future history.
01:30:25 But yeah, I think that that was in some way influential.
01:30:30 I don’t really know how.
01:30:33 I would recommend it.
01:30:34 I mean, if nothing else, you’d be well entertained.
01:30:37 When did you first yourself like fall in love with the idea of artificial intelligence,
01:30:41 get captivated by this field?
01:30:45 So my honest answer here is actually that I only really started to think about it as
01:30:52 something that I might want to do actually in graduate school pretty late.
01:30:56 And a big part of that was that until, you know, somewhere around 2009, 2010, it just
01:31:02 wasn’t really high on my priority list because I didn’t think that it was something where
01:31:06 we’re going to see very substantial advances in my lifetime.
01:31:11 And you know, maybe in terms of my career, the time when I really decided I wanted to
01:31:18 work on this was when I actually took a seminar course that was taught by Professor Andrew
01:31:23 Ng.
01:31:24 And, you know, at that point, I, of course, had like a decent understanding of the technical
01:31:29 things involved.
01:31:30 But one of the things that really resonated with me was when he said in the opening lecture
01:31:33 something to the effect of like, well, he used to have graduate students come to him
01:31:37 and talk about how they want to work on AI, and he would kind of chuckle and give them
01:31:40 some math problem to deal with.
01:31:42 But now he’s actually thinking that this is an area where we might see like substantial
01:31:45 advances in our lifetime.
01:31:47 And that kind of got me thinking because, you know, in some abstract sense, yeah, like
01:31:52 you can kind of imagine that, but in a very real sense, when someone who had been working
01:31:56 on that kind of stuff their whole career suddenly says that, yeah, like that had some effect
01:32:02 on me.
01:32:03 Yeah, this might be a special moment in the history of the field.
01:32:08 That this is where we might see some interesting breakthroughs.
01:32:14 So in the space of advice, somebody who’s interested in getting started in machine learning
01:32:19 or reinforcement learning, what advice would you give to maybe an undergraduate student
01:32:23 or maybe even younger, how, what are the first steps to take and further on what are the
01:32:30 steps to take on that journey?
01:32:32 So something that I think is important to do is to not be afraid to like spend time
01:32:43 imagining the kind of outcome that you might like to see.
01:32:46 So you know, one outcome might be a successful career, a large paycheck or something, or
01:32:51 state of the art results on some benchmark, but hopefully that’s not the thing that’s
01:32:54 like the main driving force for somebody.
01:32:57 But I think that if someone who is a student considering a career in AI like takes a little
01:33:04 while, sits down and thinks like, what do I really want to see?
01:33:07 What I want to see a machine do?
01:33:09 What do I want to see a robot do?
01:33:10 What do I want to do?
01:33:11 What do I want to see a natural language system, which is like, imagine, you know, imagine
01:33:15 it almost like a commercial for a future product or something or like, like something that
01:33:19 you’d like to see in the world and then actually sit down and think about the steps that are
01:33:23 necessary to get there.
01:33:25 And hopefully that thing is not a better number on image net classification.
01:33:29 It’s like, it’s probably like an actual thing that we can’t do today that would be really
01:33:32 awesome.
01:33:33 Whether it’s a robot Butler or a, you know, a really awesome healthcare decision making
01:33:38 support system, whatever it is that you find inspiring.
01:33:41 And I think that thinking about that and then backtracking from there and imagining the
01:33:45 steps needed to get there will actually lead to much better research.
01:33:48 It’ll lead to rethinking the assumptions.
01:33:50 It’ll lead to working on the bottlenecks that other people aren’t working on.
01:33:55 And then naturally to turn to you, we’ve talked about reward functions and you just give an
01:34:01 advice on looking forward, how you’d like to see, what kind of change you would like
01:34:05 to make in the world.
01:34:06 What do you think, ridiculous, big question, what do you think is the meaning of life?
01:34:11 What is the meaning of your life?
01:34:13 What gives you fulfillment, purpose, happiness and meaning?
01:34:20 That’s a very big question.
01:34:24 What’s the reward function under which you are operating?
01:34:27 Yeah.
01:34:28 I think one thing that does give, you know, if not meaning, at least satisfaction is some
01:34:33 degree of confidence that I’m working on a problem that really matters.
01:34:37 I feel like it’s less important to me to like actually solve a problem, but it’s quite nice
01:34:42 to take things to spend my time on that I believe really matter.
01:34:49 And I try pretty hard to look for that.
01:34:53 I don’t know if it’s easy to answer this, but if you’re successful, what does that look
01:34:59 like?
01:35:00 What’s the big dream?
01:35:01 Now, of course, success is built on top of success and you keep going forever, but what
01:35:09 is the dream?
01:35:10 Yeah.
01:35:11 So one very concrete thing or maybe as concrete as it’s going to get here is to see machines
01:35:18 that actually get better and better the longer they exist in the world.
01:35:23 And that kind of seems like on the surface, one might even think that that’s something
01:35:26 that we have today, but I think we really don’t.
01:35:28 I think that there is an ending complexity in the universe and to date, all of the machines
01:35:38 that we’ve been able to build don’t sort of improve up to the limit of that complexity.
01:35:44 They hit a wall somewhere.
01:35:45 Maybe they hit a wall because they’re in a simulator that has, that is only a very limited,
01:35:50 very pale imitation of the real world, or they hit a wall because they rely on a label
01:35:54 data set, but they never hit the wall of like running out of stuff to see.
01:36:00 So I’d like to build a machine that can go as far as possible.
01:36:04 Runs up against the ceiling of the complexity of the universe.
01:36:08 Yes.
01:36:09 Well, I don’t think there’s a better way to end it, Sergey.
01:36:12 Thank you so much.
01:36:13 It’s a huge honor.
01:36:14 I can’t wait to see the amazing work that you have to publish and in education space
01:36:20 in terms of reinforcement learning.
01:36:21 Thank you for inspiring the world.
01:36:23 Thank you for the great research you do.
01:36:24 Thank you.
01:36:25 Thanks for listening to this conversation with Sergey Levine and thank you to our sponsors,
01:36:31 Cash App and ExpressVPN.
01:36:33 Please consider supporting this podcast by downloading Cash App and using code LexPodcast
01:36:40 and signing up at expressvpn.com slash LexPod.
01:36:44 Click all the links, buy all the stuff, it’s the best way to support this podcast and the
01:36:50 journey I’m on.
01:36:51 If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple Podcast,
01:36:57 support it on Patreon, or connect with me on Twitter at Lex Friedman, spelled somehow
01:37:02 if you can figure out how without using the letter E, just F R I D M A N.
01:37:08 And now let me leave you with some words from Salvador Dali.
01:37:14 Intelligence without ambition is a bird without wings.
01:37:18 Thank you for listening and hope to see you next time.