Yann LeCun: Deep Learning, ConvNets, and Self-Supervised Learning #36

Transcript

00:00:00 The following is a conversation with Yann LeCun.

00:00:03 He’s considered to be one of the fathers of deep learning,

00:00:06 which, if you’ve been hiding under a rock,

00:00:09 is the recent revolution in AI that has captivated the world

00:00:12 with the possibility of what machines can learn from data.

00:00:16 He’s a professor at New York University,

00:00:18 a vice president and chief AI scientist at Facebook,

00:00:21 and co recipient of the Turing Award

00:00:24 for his work on deep learning.

00:00:26 He’s probably best known as the founding father

00:00:28 of convolutional neural networks,

00:00:30 in particular their application

00:00:32 to optical character recognition

00:00:34 and the famed MNIST dataset.

00:00:37 He is also an outspoken personality,

00:00:40 unafraid to speak his mind in a distinctive French accent

00:00:43 and explore provocative ideas,

00:00:45 both in the rigorous medium of academic research

00:00:48 and the somewhat less rigorous medium

00:00:51 of Twitter and Facebook.

00:00:52 This is the Artificial Intelligence Podcast.

00:00:55 If you enjoy it, subscribe on YouTube,

00:00:57 give it five stars on iTunes, support it on Patreon,

00:01:00 or simply connect with me on Twitter at Lex Friedman,

00:01:03 spelled F R I D M A N.

00:01:06 And now, here’s my conversation with Yann LeCun.

00:01:11 You said that 2001 Space Odyssey

00:01:13 is one of your favorite movies.

00:01:16 Hal 9000 decides to get rid of the astronauts

00:01:20 for people who haven’t seen the movie, spoiler alert,

00:01:23 because he, it, she believes that the astronauts,

00:01:29 they will interfere with the mission.

00:01:31 Do you see Hal as flawed in some fundamental way

00:01:34 or even evil, or did he do the right thing?

00:01:38 Neither.

00:01:39 There’s no notion of evil in that context,

00:01:43 other than the fact that people die,

00:01:44 but it was an example of what people call

00:01:48 value misalignment, right?

00:01:50 You give an objective to a machine,

00:01:52 and the machine strives to achieve this objective.

00:01:55 And if you don’t put any constraints on this objective,

00:01:58 like don’t kill people and don’t do things like this,

00:02:02 the machine, given the power, will do stupid things

00:02:06 just to achieve this objective,

00:02:08 or damaging things to achieve this objective.

00:02:10 It’s a little bit like, I mean, we’re used to this

00:02:12 in the context of human society.

00:02:15 We put in place laws to prevent people

00:02:20 from doing bad things, because spontaneously,

00:02:22 they would do those bad things, right?

00:02:24 So we have to shape their cost function,

00:02:28 their objective function, if you want,

00:02:29 through laws to kind of correct,

00:02:31 and education, obviously, to sort of correct for those.

00:02:36 So maybe just pushing a little further on that point,

00:02:41 how, you know, there’s a mission,

00:02:44 there’s this fuzziness around,

00:02:46 the ambiguity around what the actual mission is,

00:02:49 but, you know, do you think that there will be a time,

00:02:55 from a utilitarian perspective,

00:02:56 where an AI system, where it is not misalignment,

00:02:59 where it is alignment, for the greater good of society,

00:03:02 that an AI system will make decisions that are difficult?

00:03:05 Well, that’s the trick.

00:03:06 I mean, eventually we’ll have to figure out how to do this.

00:03:10 And again, we’re not starting from scratch,

00:03:12 because we’ve been doing this with humans for millennia.

00:03:16 So designing objective functions for people

00:03:19 is something that we know how to do.

00:03:20 And we don’t do it by, you know, programming things,

00:03:24 although the legal code is called code.

00:03:29 So that tells you something.

00:03:30 And it’s actually the design of an objective function.

00:03:33 That’s really what legal code is, right?

00:03:34 It tells you, here is what you can do,

00:03:36 here is what you can’t do.

00:03:37 If you do it, you pay that much,

00:03:39 that’s an objective function.

00:03:41 So there is this idea somehow that it’s a new thing

00:03:44 for people to try to design objective functions

00:03:46 that are aligned with the common good.

00:03:47 But no, we’ve been writing laws for millennia

00:03:49 and that’s exactly what it is.

00:03:52 So that’s where, you know, the science of lawmaking

00:03:57 and computer science will.

00:04:00 Come together.

00:04:01 Will come together.

00:04:02 So there’s nothing special about HAL or AI systems,

00:04:06 it’s just the continuation of tools used

00:04:09 to make some of these difficult ethical judgments

00:04:11 that laws make.

00:04:13 Yeah, and we have systems like this already

00:04:15 that make many decisions for ourselves in society

00:04:19 that need to be designed in a way that they,

00:04:22 like rules about things that sometimes have bad side effects

00:04:27 and we have to be flexible enough about those rules

00:04:29 so that they can be broken when it’s obvious

00:04:31 that they shouldn’t be applied.

00:04:34 So you don’t see this on the camera here,

00:04:35 but all the decoration in this room

00:04:36 is all pictures from 2001 and Space Odyssey.

00:04:41 Wow, is that by accident or is there a lot?

00:04:43 No, by accident, it’s by design.

00:04:47 Oh, wow.

00:04:48 So if you were to build HAL 10,000,

00:04:52 so an improvement of HAL 9,000, what would you improve?

00:04:57 Well, first of all, I wouldn’t ask it to hold secrets

00:05:00 and tell lies because that’s really what breaks it

00:05:03 in the end, that’s the fact that it’s asking itself

00:05:06 questions about the purpose of the mission

00:05:08 and it’s, you know, pieces things together that it’s heard,

00:05:11 you know, all the secrecy of the preparation of the mission

00:05:14 and the fact that it was the discovery

00:05:16 on the lunar surface that really was kept secret

00:05:19 and one part of HAL’s memory knows this

00:05:22 and the other part does not know it

00:05:24 and is supposed to not tell anyone

00:05:26 and that creates internal conflict.

00:05:28 So you think there’s never should be a set of things

00:05:32 that an AI system should not be allowed,

00:05:36 like a set of facts that should not be shared

00:05:39 with the human operators?

00:05:42 Well, I think, no, I think it should be a bit like

00:05:46 in the design of autonomous AI systems,

00:05:52 there should be the equivalent of, you know,

00:05:54 the oath that a hypocrite oath

00:05:59 that doctors sign up to, right?

00:06:02 So there’s certain things, certain rules

00:06:04 that you have to abide by and we can sort of hardwire this

00:06:07 into our machines to kind of make sure they don’t go.

00:06:11 So I’m not, you know, an advocate of the three laws

00:06:14 of robotics, you know, the Asimov kind of thing

00:06:17 because I don’t think it’s practical,

00:06:18 but, you know, some level of limits.

00:06:23 But to be clear, these are not questions

00:06:27 that are kind of really worth asking today

00:06:32 because we just don’t have the technology to do this.

00:06:34 We don’t have autonomous intelligent machines,

00:06:36 we have intelligent machines.

00:06:37 Some are intelligent machines that are very specialized,

00:06:41 but they don’t really sort of satisfy an objective.

00:06:43 They’re just, you know, kind of trained to do one thing.

00:06:46 So until we have some idea for design

00:06:50 of a full fledged autonomous intelligent system,

00:06:53 asking the question of how we design this objective,

00:06:55 I think is a little too abstract.

00:06:58 It’s a little too abstract.

00:06:59 There’s useful elements to it in that it helps us understand

00:07:04 our own ethical codes, humans.

00:07:07 So even just as a thought experiment,

00:07:10 if you imagine that an AGI system is here today,

00:07:14 how would we program it is a kind of nice thought experiment

00:07:17 of constructing how should we have a law,

00:07:21 have a system of laws for us humans.

00:07:24 It’s just a nice practical tool.

00:07:26 And I think there’s echoes of that idea too

00:07:29 in the AI systems we have today

00:07:32 that don’t have to be that intelligent.

00:07:33 Yeah.

00:07:34 Like autonomous vehicles.

00:07:35 These things start creeping in that are worth thinking about,

00:07:39 but certainly they shouldn’t be framed as how.

00:07:42 Yeah.

00:07:43 Looking back, what is the most,

00:07:46 I’m sorry if it’s a silly question,

00:07:49 but what is the most beautiful

00:07:51 or surprising idea in deep learning

00:07:53 or AI in general that you’ve ever come across?

00:07:56 Sort of personally, when you said back

00:08:00 and just had this kind of,

00:08:01 oh, that’s pretty cool moment.

00:08:03 That’s nice.

00:08:04 That’s surprising.

00:08:05 I don’t know if it’s an idea

00:08:06 rather than a sort of empirical fact.

00:08:12 The fact that you can build gigantic neural nets,

00:08:16 train them on relatively small amounts of data relatively

00:08:23 with stochastic gradient descent

00:08:24 and that it actually works,

00:08:26 breaks everything you read in every textbook, right?

00:08:29 Every pre deep learning textbook that told you,

00:08:32 you need to have fewer parameters

00:08:33 and you have data samples.

00:08:37 If you have a non convex objective function,

00:08:38 you have no guarantee of convergence.

00:08:40 All those things that you read in textbook

00:08:42 and they tell you to stay away from this

00:08:43 and they’re all wrong.

00:08:45 The huge number of parameters, non convex,

00:08:48 and somehow which is very relative

00:08:50 to the number of parameters data,

00:08:53 it’s able to learn anything.

00:08:54 Right.

00:08:55 Does that still surprise you today?

00:08:57 Well, it was kind of obvious to me

00:09:00 before I knew anything that this is a good idea.

00:09:04 And then it became surprising that it worked

00:09:06 because I started reading those textbooks.

00:09:09 Okay.

00:09:10 Okay.

00:09:10 So can you talk through the intuition

00:09:12 of why it was obvious to you if you remember?

00:09:14 Well, okay.

00:09:15 So the intuition was it’s sort of like,

00:09:17 those people in the late 19th century

00:09:19 who proved that heavier than air flight was impossible.

00:09:25 And of course you have birds, right?

00:09:26 They do fly.

00:09:28 And so on the face of it,

00:09:30 it’s obviously wrong as an empirical question, right?

00:09:33 And so we have the same kind of thing

00:09:34 that we know that the brain works.

00:09:38 We don’t know how, but we know it works.

00:09:39 And we know it’s a large network of neurons and interaction

00:09:43 and that learning takes place by changing the connection.

00:09:45 So kind of getting this level of inspiration

00:09:48 without copying the details,

00:09:49 but sort of trying to derive basic principles,

00:09:52 and that kind of gives you a clue

00:09:56 as to which direction to go.

00:09:58 There’s also the idea somehow that I’ve been convinced of

00:10:01 since I was an undergrad that, even before,

00:10:04 that intelligence is inseparable from learning.

00:10:06 So the idea somehow that you can create

00:10:10 an intelligent machine by basically programming,

00:10:14 for me it was a non starter from the start.

00:10:17 Every intelligent entity that we know about

00:10:20 arrives at this intelligence through learning.

00:10:24 So machine learning was a completely obvious path.

00:10:29 Also because I’m lazy, so, you know, kind of.

00:10:32 He’s automate basically everything

00:10:35 and learning is the automation of intelligence.

00:10:37 So do you think, so what is learning then?

00:10:42 What falls under learning?

00:10:44 Because do you think of reasoning as learning?

00:10:48 Well, reasoning is certainly a consequence

00:10:51 of learning as well, just like other functions of the brain.

00:10:56 The big question about reasoning is,

00:10:58 how do you make reasoning compatible

00:11:00 with gradient based learning?

00:11:02 Do you think neural networks can be made to reason?

00:11:04 Yes, there is no question about that.

00:11:07 Again, we have a good example, right?

00:11:10 The question is how?

00:11:11 So the question is how much prior structure

00:11:14 do you have to put in the neural net

00:11:15 so that something like human reasoning

00:11:17 will emerge from it, you know, from learning?

00:11:20 Another question is all of our kind of model

00:11:24 of what reasoning is that are based on logic

00:11:27 are discrete and are therefore incompatible

00:11:31 with gradient based learning.

00:11:32 And I’m a very strong believer

00:11:34 in this idea of gradient based learning.

00:11:35 I don’t believe that other types of learning

00:11:39 that don’t use kind of gradient information if you want.

00:11:41 So you don’t like discrete mathematics?

00:11:43 You don’t like anything discrete?

00:11:45 Well, that’s, it’s not that I don’t like it,

00:11:46 it’s just that it’s incompatible with learning

00:11:49 and I’m a big fan of learning, right?

00:11:51 So in fact, that’s perhaps one reason

00:11:53 why deep learning has been kind of looked at

00:11:57 with suspicion by a lot of computer scientists

00:11:58 because the math is very different.

00:11:59 The math that you use for deep learning,

00:12:02 you know, it kind of has more to do with,

00:12:05 you know, cybernetics, the kind of math you do

00:12:08 in electrical engineering than the kind of math

00:12:10 you do in computer science.

00:12:12 And, you know, nothing in machine learning is exact, right?

00:12:15 Computer science is all about sort of, you know,

00:12:18 obviously compulsive attention to details of like,

00:12:21 you know, every index has to be right.

00:12:23 And you can prove that an algorithm is correct, right?

00:12:26 Machine learning is the science of sloppiness, really.

00:12:30 That’s beautiful.

00:12:32 So, okay, maybe let’s feel around in the dark

00:12:38 of what is a neural network that reasons

00:12:41 or a system that works with continuous functions

00:12:47 that’s able to do, build knowledge,

00:12:52 however we think about reasoning,

00:12:54 build on previous knowledge, build on extra knowledge,

00:12:57 create new knowledge,

00:12:59 generalize outside of any training set to ever build.

00:13:03 What does that look like?

00:13:04 If, yeah, maybe give inklings of thoughts

00:13:08 of what that might look like.

00:13:10 Yeah, I mean, yes and no.

00:13:12 If I had precise ideas about this,

00:13:14 I think, you know, we’d be building it right now.

00:13:17 And there are people working on this

00:13:19 whose main research interest is actually exactly that, right?

00:13:22 So what you need to have is a working memory.

00:13:25 So you need to have some device, if you want,

00:13:29 some subsystem that can store a relatively large number

00:13:34 of factual episodic information for, you know,

00:13:39 a reasonable amount of time.

00:13:40 So, you know, in the brain, for example,

00:13:43 there are kind of three main types of memory.

00:13:45 One is the sort of memory of the state of your cortex.

00:13:53 And that sort of disappears within 20 seconds.

00:13:55 You can’t remember things for more than about 20 seconds

00:13:58 or a minute if you don’t have any other form of memory.

00:14:02 The second type of memory, which is longer term,

00:14:04 is still short term, is the hippocampus.

00:14:06 So you can, you know, you came into this building,

00:14:08 you remember where the exit is, where the elevators are.

00:14:14 You have some map of that building

00:14:15 that’s stored in your hippocampus.

00:14:17 You might remember something about what I said,

00:14:20 you know, a few minutes ago.

00:14:21 I forgot it all already.

00:14:22 Of course, it’s been erased, but, you know,

00:14:24 but that would be in your hippocampus.

00:14:27 And then the longer term memory is in the synapse,

00:14:30 the synapses, right?

00:14:32 So what you need if you want a system

00:14:34 that’s capable of reasoning

00:14:35 is that you want the hippocampus like thing, right?

00:14:40 And that’s what people have tried to do

00:14:41 with memory networks and, you know,

00:14:43 neural training machines and stuff like that, right?

00:14:45 And now with transformers,

00:14:47 which have sort of a memory in there,

00:14:50 kind of self attention system.

00:14:51 You can think of it this way.

00:14:55 So that’s one element you need.

00:14:57 Another thing you need is some sort of network

00:14:59 that can access this memory,

00:15:03 get an information back and then kind of crunch on it

00:15:08 and then do this iteratively multiple times

00:15:10 because a chain of reasoning is a process

00:15:15 by which you update your knowledge

00:15:19 about the state of the world,

00:15:20 about, you know, what’s going to happen, et cetera.

00:15:22 And that has to be this sort of

00:15:25 recurrent operation basically.

00:15:27 And you think that kind of,

00:15:29 if we think about a transformer,

00:15:31 so that seems to be too small

00:15:32 to contain the knowledge that’s,

00:15:36 to represent the knowledge

00:15:37 that’s contained in Wikipedia, for example.

00:15:39 Well, a transformer doesn’t have this idea of recurrence.

00:15:42 It’s got a fixed number of layers

00:15:43 and that’s the number of steps that, you know,

00:15:44 limits basically its representation.

00:15:47 But recurrence would build on the knowledge somehow.

00:15:51 I mean, it would evolve the knowledge

00:15:54 and expand the amount of information perhaps

00:15:58 or useful information within that knowledge.

00:16:00 But is this something that just can emerge with size?

00:16:04 Because it seems like everything we have now is too small.

00:16:06 Not just, no, it’s not clear.

00:16:09 I mean, how you access and write

00:16:11 into an associative memory in an efficient way.

00:16:13 I mean, sort of the original memory network

00:16:15 maybe had something like the right architecture,

00:16:17 but if you try to scale up a memory network

00:16:20 so that the memory contains all the Wikipedia,

00:16:22 it doesn’t quite work.

00:16:24 Right.

00:16:25 So there’s a need for new ideas there, okay.

00:16:28 But it’s not the only form of reasoning.

00:16:30 So there’s another form of reasoning,

00:16:31 which is true, which is very classical also

00:16:34 in some types of AI.

00:16:36 And it’s based on, let’s call it energy minimization.

00:16:40 Okay, so you have some sort of objective,

00:16:44 some energy function that represents

00:16:47 the quality or the negative quality, okay.

00:16:53 Energy goes up when things get bad

00:16:54 and they get low when things get good.

00:16:57 So let’s say you want to figure out,

00:17:00 what gestures do I need to do

00:17:03 to grab an object or walk out the door.

00:17:08 If you have a good model of your own body,

00:17:10 a good model of the environment,

00:17:12 using this kind of energy minimization,

00:17:14 you can do planning.

00:17:16 And in optimal control,

00:17:19 it’s called model predictive control.

00:17:22 You have a model of what’s gonna happen in the world

00:17:24 as a consequence of your actions.

00:17:25 And that allows you to, by energy minimization,

00:17:28 figure out the sequence of action

00:17:29 that optimizes a particular objective function,

00:17:32 which measures, minimizes the number of times

00:17:34 you’re gonna hit something

00:17:35 and the energy you’re gonna spend

00:17:36 doing the gesture and et cetera.

00:17:39 So that’s a form of reasoning.

00:17:42 Planning is a form of reasoning.

00:17:43 And perhaps what led to the ability of humans to reason

00:17:48 is the fact that, or species that appear before us

00:17:53 had to do some sort of planning

00:17:55 to be able to hunt and survive

00:17:56 and survive the winter in particular.

00:17:59 And so it’s the same capacity that you need to have.

00:18:03 So in your intuition is,

00:18:07 if we look at expert systems

00:18:09 and encoding knowledge as logic systems,

00:18:13 as graphs, in this kind of way,

00:18:16 is not a useful way to think about knowledge?

00:18:20 Graphs are a little brittle or logic representation.

00:18:23 So basically, variables that have values

00:18:27 and then constraint between them

00:18:29 that are represented by rules,

00:18:31 is a little too rigid and too brittle, right?

00:18:32 So some of the early efforts in that respect

00:18:38 were to put probabilities on them.

00:18:41 So a rule, if you have this and that symptom,

00:18:44 you have this disease with that probability

00:18:47 and you should prescribe that antibiotic

00:18:49 with that probability, right?

00:18:50 That’s the mycin system from the 70s.

00:18:54 And that’s what that branch of AI led to,

00:18:58 based on networks and graphical models

00:19:00 and causal inference and variational method.

00:19:04 So there is certainly a lot of interesting

00:19:10 work going on in this area.

00:19:11 The main issue with this is knowledge acquisition.

00:19:13 How do you reduce a bunch of data to a graph of this type?

00:19:18 Yeah, it relies on the expert, on the human being,

00:19:22 to encode, to add knowledge.

00:19:24 And that’s essentially impractical.

00:19:27 Yeah, it’s not scalable.

00:19:29 That’s a big question.

00:19:30 The second question is,

00:19:31 do you want to represent knowledge as symbols

00:19:34 and do you want to manipulate them with logic?

00:19:37 And again, that’s incompatible with learning.

00:19:39 So one suggestion, which Jeff Hinton

00:19:43 has been advocating for many decades,

00:19:45 is replace symbols by vectors.

00:19:49 Think of it as pattern of activities

00:19:50 in a bunch of neurons or units

00:19:53 or whatever you want to call them.

00:19:55 And replace logic by continuous functions.

00:19:59 Okay, and that becomes now compatible.

00:20:01 There’s a very good set of ideas

00:20:04 by, written in a paper about 10 years ago

00:20:07 by Leon Boutout, who is here at Facebook.

00:20:13 The title of the paper is,

00:20:14 From Machine Learning to Machine Reasoning.

00:20:15 And his idea is that a learning system

00:20:19 should be able to manipulate objects

00:20:20 that are in a space

00:20:23 and then put the result back in the same space.

00:20:24 So it’s this idea of working memory, basically.

00:20:28 And it’s very enlightening.

00:20:30 And in a sense, that might learn something

00:20:33 like the simple expert systems.

00:20:37 I mean, you can learn basic logic operations there.

00:20:42 Yeah, quite possibly.

00:20:43 There’s a big debate on sort of how much prior structure

00:20:46 you have to put in for this kind of stuff to emerge.

00:20:49 That’s the debate I have with Gary Marcus

00:20:50 and people like that.

00:20:51 Yeah, yeah, so, and the other person,

00:20:55 so I just talked to Judea Pearl,

00:20:57 from the you mentioned causal inference world.

00:21:00 So his worry is that the current neural networks

00:21:04 are not able to learn what causes

00:21:09 what causal inference between things.

00:21:12 So I think he’s right and wrong about this.

00:21:15 If he’s talking about the sort of classic

00:21:20 type of neural nets,

00:21:21 people sort of didn’t worry too much about this.

00:21:23 But there’s a lot of people now working on causal inference.

00:21:26 And there’s a paper that just came out last week

00:21:27 by Leon Boutou, among others,

00:21:29 David Lopez, Baz, and a bunch of other people,

00:21:32 exactly on that problem of how do you kind of

00:21:36 get a neural net to sort of pay attention

00:21:39 to real causal relationships,

00:21:41 which may also solve issues of bias in data

00:21:46 and things like this, so.

00:21:48 I’d like to read that paper

00:21:49 because that ultimately the challenges

00:21:51 also seems to fall back on the human expert

00:21:56 to ultimately decide causality between things.

00:22:01 People are not very good

00:22:02 at establishing causality, first of all.

00:22:04 So first of all, you talk to physicists

00:22:06 and physicists actually don’t believe in causality

00:22:08 because look at all the basic laws of microphysics

00:22:12 are time reversible, so there’s no causality.

00:22:15 The arrow of time is not real, yeah.

00:22:17 It’s as soon as you start looking at macroscopic systems

00:22:20 where there is unpredictable randomness,

00:22:22 where there is clearly an arrow of time,

00:22:25 but it’s a big mystery in physics, actually,

00:22:27 how that emerges.

00:22:29 Is it emergent or is it part of

00:22:31 the fundamental fabric of reality?

00:22:34 Or is it a bias of intelligent systems

00:22:36 that because of the second law of thermodynamics,

00:22:39 we perceive a particular arrow of time,

00:22:41 but in fact, it’s kind of arbitrary, right?

00:22:45 So yeah, physicists, mathematicians,

00:22:47 they don’t care about, I mean,

00:22:48 the math doesn’t care about the flow of time.

00:22:51 Well, certainly, macrophysics doesn’t.

00:22:54 People themselves are not very good

00:22:55 at establishing causal relationships.

00:22:58 If you ask, I think it was in one of Seymour Papert’s book

00:23:02 on children learning.

00:23:06 He studied with Jean Piaget.

00:23:08 He’s the guy who coauthored the book Perceptron

00:23:11 with Marvin Minsky that kind of killed

00:23:12 the first wave of neural nets,

00:23:14 but he was actually a learning person.

00:23:17 He, in the sense of studying learning in humans

00:23:21 and machines, that’s why he got interested in Perceptron.

00:23:24 And he wrote that if you ask a little kid

00:23:29 about what is the cause of the wind,

00:23:33 a lot of kids will say, they will think for a while

00:23:35 and they’ll say, oh, it’s the branches in the trees,

00:23:38 they move and that creates wind, right?

00:23:40 So they get the causal relationship backwards.

00:23:42 And it’s because their understanding of the world

00:23:44 and intuitive physics is not that great, right?

00:23:46 I mean, these are like, you know, four or five year old kids.

00:23:49 You know, it gets better,

00:23:50 and then you understand that this, it can be, right?

00:23:54 But there are many things which we can,

00:23:57 because of our common sense understanding of things,

00:24:00 what people call common sense,

00:24:03 and our understanding of physics,

00:24:05 we can, there’s a lot of stuff

00:24:07 that we can figure out causality.

00:24:08 Even with diseases, we can figure out

00:24:10 what’s not causing what, often.

00:24:14 There’s a lot of mystery, of course,

00:24:16 but the idea is that you should be able

00:24:18 to encode that into systems,

00:24:20 because it seems unlikely they’d be able

00:24:21 to figure that out themselves.

00:24:22 Well, whenever we can do intervention,

00:24:24 but you know, all of humanity has been completely deluded

00:24:27 for millennia, probably since its existence,

00:24:30 about a very, very wrong causal relationship,

00:24:33 where whatever you can explain, you attribute it to,

00:24:35 you know, some deity, some divinity, right?

00:24:39 And that’s a cop out, that’s a way of saying like,

00:24:41 I don’t know the cause, so you know, God did it, right?

00:24:43 So you mentioned Marvin Minsky,

00:24:46 and the irony of, you know,

00:24:51 maybe causing the first AI winter.

00:24:54 You were there in the 90s, you were there in the 80s,

00:24:56 of course.

00:24:58 In the 90s, why do you think people lost faith

00:25:00 in deep learning, in the 90s, and found it again,

00:25:04 a decade later, over a decade later?

00:25:06 Yeah, it wasn’t called deep learning yet,

00:25:07 it was just called neural nets, but yeah,

00:25:11 they lost interest.

00:25:13 I mean, I think I would put that around 1995,

00:25:16 at least the machine learning community,

00:25:18 there was always a neural net community,

00:25:19 but it became kind of disconnected

00:25:23 from sort of mainstream machine learning, if you want.

00:25:26 There were, it was basically electrical engineering

00:25:30 that kept at it, and computer science gave up on neural nets.

00:25:38 I don’t know, you know, I was too close to it

00:25:40 to really sort of analyze it with sort of an unbiased eye,

00:25:46 if you want, but I would make a few guesses.

00:25:50 So the first one is, at the time, neural nets were,

00:25:55 it was very hard to make them work,

00:25:57 in the sense that you would implement backprop

00:26:02 in your favorite language, and that favorite language

00:26:06 was not Python, it was not MATLAB,

00:26:08 it was not any of those things,

00:26:09 because they didn’t exist, right?

00:26:10 You had to write it in Fortran OC,

00:26:13 or something like this, right?

00:26:16 So you would experiment with it,

00:26:18 you would probably make some very basic mistakes,

00:26:21 like, you know, badly initialize your weights,

00:26:23 make the network too small,

00:26:24 because you read in the textbook, you know,

00:26:25 you don’t want too many parameters, right?

00:26:27 And of course, you know, and you would train on XOR,

00:26:29 because you didn’t have any other data set to trade on.

00:26:32 And of course, you know, it works half the time.

00:26:33 So you would say, I give up.

00:26:36 Also, you would train it with batch gradient,

00:26:37 which, you know, isn’t that sufficient.

00:26:40 So there’s a lot of, there’s a bag of tricks

00:26:42 that you had to know to make those things work,

00:26:44 or you had to reinvent, and a lot of people just didn’t,

00:26:48 and they just couldn’t make it work.

00:26:51 So that’s one thing.

00:26:52 The investment in software platform

00:26:54 to be able to kind of, you know, display things,

00:26:58 figure out why things don’t work,

00:26:59 kind of get a good intuition for how to get them to work,

00:27:02 have enough flexibility so you can create, you know,

00:27:04 network architectures like convolutional nets

00:27:06 and stuff like that.

00:27:08 It was hard.

00:27:09 I mean, you had to write everything from scratch.

00:27:10 And again, you didn’t have any Python

00:27:11 or MATLAB or anything, right?

00:27:14 I read that, sorry to interrupt,

00:27:15 but I read that you wrote in Lisp

00:27:17 the first versions of Lanet with convolutional networks,

00:27:22 which by the way, one of my favorite languages.

00:27:25 That’s how I knew you were legit.

00:27:27 Turing award, whatever.

00:27:29 You programmed in Lisp, that’s…

00:27:30 It’s still my favorite language,

00:27:31 but it’s not that we programmed in Lisp,

00:27:34 it’s that we had to write our Lisp interpreter, okay?

00:27:38 Because it’s not like we used one that existed.

00:27:40 So we wrote a Lisp interpreter that we hooked up to,

00:27:43 you know, a backend library that we wrote also

00:27:46 for sort of neural net computation.

00:27:48 And then after a few years around 1991,

00:27:50 we invented this idea of basically having modules

00:27:54 that know how to forward propagate

00:27:56 and back propagate gradients,

00:27:57 and then interconnecting those modules in a graph.

00:28:01 Number two had made proposals on this,

00:28:03 about this in the late eighties,

00:28:04 and we were able to implement this using our Lisp system.

00:28:08 Eventually we wanted to use that system

00:28:09 to build production code for character recognition

00:28:13 at Bell Labs.

00:28:14 So we actually wrote a compiler for that Lisp interpreter

00:28:16 so that Patricia Simard, who is now at Microsoft,

00:28:19 kind of did the bulk of it with Leon and me.

00:28:22 And so we could write our system in Lisp

00:28:24 and then compile to C,

00:28:26 and then we’ll have a self contained complete system

00:28:29 that could kind of do the entire thing.

00:28:33 Neither PyTorch nor TensorFlow can do this today.

00:28:36 Yeah, okay, it’s coming.

00:28:37 Yeah.

00:28:40 I mean, there’s something like that in PyTorch

00:28:42 called TorchScript.

00:28:44 And so, you know, we had to write our Lisp interpreter,

00:28:46 we had to write our Lisp compiler,

00:28:48 we had to invest a huge amount of effort to do this.

00:28:50 And not everybody,

00:28:52 if you don’t completely believe in the concept,

00:28:55 you’re not going to invest the time to do this.

00:28:57 Now at the time also, you know,

00:28:59 or today, this would turn into Torch or PyTorch

00:29:02 or TensorFlow or whatever,

00:29:03 we’d put it in open source, everybody would use it

00:29:05 and, you know, realize it’s good.

00:29:07 Back before 1995, working at AT&T,

00:29:11 there’s no way the lawyers would let you

00:29:13 release anything in open source of this nature.

00:29:17 And so we could not distribute our code really.

00:29:20 And on that point,

00:29:21 and sorry to go on a million tangents,

00:29:23 but on that point, I also read that there was some,

00:29:26 almost like a patent on convolutional neural networks

00:29:30 at Bell Labs.

00:29:32 So that, first of all, I mean, just.

00:29:35 There’s two actually.

00:29:38 That ran out.

00:29:39 Thankfully, in 2007.

00:29:41 In 2007.

00:29:42 So I’m gonna, what,

00:29:46 can we just talk about that for a second?

00:29:48 I know you’re a Facebook, but you’re also at NYU.

00:29:51 And what does it mean to patent ideas

00:29:55 like these software ideas, essentially?

00:29:58 Or what are mathematical ideas?

00:30:02 Or what are they?

00:30:03 Okay, so they’re not mathematical ideas.

00:30:05 They are, you know, algorithms.

00:30:07 And there was a period where the US Patent Office

00:30:11 would allow the patent of software

00:30:14 as long as it was embodied.

00:30:16 The Europeans are very different.

00:30:18 They don’t quite accept that.

00:30:20 They have a different concept.

00:30:21 But, you know, I don’t, I no longer,

00:30:24 I mean, I never actually strongly believed in this,

00:30:26 but I don’t believe in this kind of patent.

00:30:28 Facebook basically doesn’t believe in this kind of patent.

00:30:34 Google fires patents because they’ve been burned with Apple.

00:30:39 And so now they do this for defensive purpose,

00:30:41 but usually they say,

00:30:42 we’re not gonna sue you if you infringe.

00:30:44 Facebook has a similar policy.

00:30:47 They say, you know, we fire patents on certain things

00:30:49 for defensive purpose.

00:30:50 We’re not gonna sue you if you infringe,

00:30:52 unless you sue us.

00:30:54 So the industry does not believe in patents.

00:30:59 They are there because of, you know,

00:31:00 the legal landscape and various things.

00:31:03 But I don’t really believe in patents

00:31:06 for this kind of stuff.

00:31:07 So that’s a great thing.

00:31:09 So I…

00:31:10 I’ll tell you a worse story, actually.

00:31:11 So what happens was the first patent about convolutional net

00:31:15 was about kind of the early version of convolutional net

00:31:18 that didn’t have separate pooling layers.

00:31:19 It had convolutional layers

00:31:22 which tried more than one, if you want, right?

00:31:25 And then there was a second one on convolutional nets

00:31:28 with separate pooling layers, trained with backprop.

00:31:31 And there were files filed in 89 and 1990

00:31:35 or something like this.

00:31:36 At the time, the life of a patent was 17 years.

00:31:40 So here’s what happened over the next few years

00:31:42 is that we started developing character recognition

00:31:45 technology around convolutional nets.

00:31:48 And in 1994,

00:31:52 a check reading system was deployed in ATM machines.

00:31:56 In 1995, it was for large check reading machines

00:31:59 in back offices, et cetera.

00:32:00 And those systems were developed by an engineering group

00:32:04 that we were collaborating with at AT&T.

00:32:07 And they were commercialized by NCR,

00:32:08 which at the time was a subsidiary of AT&T.

00:32:11 Now AT&T split up in 1996,

00:32:17 early 1996.

00:32:18 And the lawyers just looked at all the patents

00:32:20 and they distributed the patents among the various companies.

00:32:23 They gave the convolutional net patent to NCR

00:32:26 because they were actually selling products that used it.

00:32:29 But nobody at NCR had any idea what a convolutional net was.

00:32:32 Yeah.

00:32:33 Okay.

00:32:34 So between 1996 and 2007,

00:32:38 so there’s a whole period until 2002

00:32:39 where I didn’t actually work on machine learning

00:32:42 or convolutional net.

00:32:42 I resumed working on this around 2002.

00:32:45 And between 2002 and 2007,

00:32:47 I was working on them, crossing my finger

00:32:49 that nobody at NCR would notice.

00:32:51 Nobody noticed.

00:32:52 Yeah, and I hope that this kind of somewhat,

00:32:55 as you said, lawyers aside,

00:32:58 relative openness of the community now will continue.

00:33:02 It accelerates the entire progress of the industry.

00:33:05 And the problems that Facebook and Google

00:33:11 and others are facing today

00:33:13 is not whether Facebook or Google or Microsoft or IBM

00:33:16 or whoever is ahead of the other.

00:33:18 It’s that we don’t have the technology

00:33:19 to build the things we want to build.

00:33:21 We want to build intelligent virtual assistants

00:33:23 that have common sense.

00:33:24 We don’t have monopoly on good ideas for this.

00:33:26 We don’t believe we do.

00:33:27 Maybe others believe they do, but we don’t.

00:33:30 Okay.

00:33:31 If a startup tells you they have the secret

00:33:33 to human level intelligence and common sense,

00:33:36 don’t believe them, they don’t.

00:33:38 And it’s gonna take the entire work

00:33:42 of the world research community for a while

00:33:45 to get to the point where you can go off

00:33:47 and each of those companies

00:33:49 kind of start to build things on this.

00:33:50 We’re not there yet.

00:33:51 It’s absolutely, and this calls to the gap

00:33:54 between the space of ideas

00:33:57 and the rigorous testing of those ideas

00:34:00 of practical application that you often speak to.

00:34:03 You’ve written advice saying don’t get fooled

00:34:06 by people who claim to have a solution

00:34:08 to artificial general intelligence,

00:34:10 who claim to have an AI system

00:34:11 that works just like the human brain

00:34:14 or who claim to have figured out how the brain works.

00:34:17 Ask them what the error rate they get

00:34:20 on MNIST or ImageNet.

00:34:23 So this is a little dated by the way.

00:34:25 2000, I mean five years, who’s counting?

00:34:28 Okay, but I think your opinion is still,

00:34:30 MNIST and ImageNet, yes, may be dated,

00:34:34 there may be new benchmarks, right?

00:34:36 But I think that philosophy is one you still

00:34:39 in somewhat hold, that benchmarks

00:34:43 and the practical testing, the practical application

00:34:45 is where you really get to test the ideas.

00:34:48 Well, it may not be completely practical.

00:34:49 Like for example, it could be a toy data set,

00:34:52 but it has to be some sort of task

00:34:54 that the community as a whole has accepted

00:34:57 as some sort of standard kind of benchmark if you want.

00:35:00 It doesn’t need to be real.

00:35:01 So for example, many years ago here at FAIR,

00:35:05 people, Jason West and Antoine Borne

00:35:07 and a few others proposed the Babi tasks,

00:35:09 which were kind of a toy problem to test

00:35:12 the ability of machines to reason actually

00:35:14 to access working memory and things like this.

00:35:16 And it was very useful even though it wasn’t a real task.

00:35:20 MNIST is kind of halfway real task.

00:35:23 So toy problems can be very useful.

00:35:26 It’s just that I was really struck by the fact

00:35:29 that a lot of people, particularly a lot of people

00:35:31 with money to invest would be fooled by people telling them,

00:35:34 oh, we have the algorithm of the cortex

00:35:37 and you should give us 50 million.

00:35:39 Yes, absolutely.

00:35:40 So there’s a lot of people who try to take advantage

00:35:45 of the hype for business reasons and so on.

00:35:48 But let me sort of talk to this idea

00:35:50 that sort of new ideas, the ideas that push the field

00:35:55 forward may not yet have a benchmark

00:35:58 or it may be very difficult to establish a benchmark.

00:36:00 I agree.

00:36:01 That’s part of the process.

00:36:02 Establishing benchmarks is part of the process.

00:36:04 So what are your thoughts about,

00:36:07 so we have these benchmarks on around stuff we can do

00:36:10 with images from classification to captioning

00:36:14 to just every kind of information you can pull off

00:36:16 from images and the surface level.

00:36:18 There’s audio data sets, there’s some video.

00:36:22 What can we start, natural language, what kind of stuff,

00:36:27 what kind of benchmarks do you see that start creeping

00:36:30 on to more something like intelligence, like reasoning,

00:36:34 like maybe you don’t like the term,

00:36:37 but AGI echoes of that kind of formulation.

00:36:41 A lot of people are working on interactive environments

00:36:44 in which you can train and test intelligence systems.

00:36:48 So there, for example, it’s the classical paradigm

00:36:54 of supervised learning is that you have a data set,

00:36:57 you partition it into a training set, validation set,

00:37:00 test set, and there’s a clear protocol, right?

00:37:03 But what if that assumes that the samples

00:37:06 are statistically independent, you can exchange them,

00:37:10 the order in which you see them shouldn’t matter,

00:37:12 things like that.

00:37:13 But what if the answer you give determines

00:37:16 the next sample you see, which is the case, for example,

00:37:18 in robotics, right?

00:37:19 You robot does something and then it gets exposed

00:37:22 to a new room, and depending on where it goes,

00:37:25 the room would be different.

00:37:26 So that creates the exploration problem.

00:37:30 The what if the samples, so that creates also a dependency

00:37:34 between samples, right?

00:37:35 You, if you move, if you can only move in space,

00:37:39 the next sample you’re gonna see is gonna be probably

00:37:41 in the same building, most likely, right?

00:37:44 So all the assumptions about the validity

00:37:47 of this training set, test set hypothesis break.

00:37:51 Whenever a machine can take an action

00:37:53 that has an influence in the world,

00:37:54 and it’s what it’s gonna see.

00:37:56 So people are setting up artificial environments

00:38:00 where that takes place, right?

00:38:02 The robot runs around a 3D model of a house

00:38:05 and can interact with objects and things like this.

00:38:08 So you do robotics based simulation,

00:38:10 you have those opening a gym type thing

00:38:14 or Mujoko kind of simulated robots

00:38:18 and you have games, things like that.

00:38:21 So that’s where the field is going really,

00:38:23 this kind of environment.

00:38:25 Now, back to the question of AGI.

00:38:28 I don’t like the term AGI because it implies

00:38:33 that human intelligence is general

00:38:35 and human intelligence is nothing like general.

00:38:38 It’s very, very specialized.

00:38:40 We think it’s general.

00:38:41 We’d like to think of ourselves

00:38:42 as having general intelligence.

00:38:43 We don’t, we’re very specialized.

00:38:46 We’re only slightly more general than.

00:38:47 Why does it feel general?

00:38:48 So you kind of, the term general.

00:38:52 I think what’s impressive about humans is ability to learn,

00:38:56 as we were talking about learning,

00:38:58 to learn in just so many different domains.

00:39:01 It’s perhaps not arbitrarily general,

00:39:04 but just you can learn in many domains

00:39:06 and integrate that knowledge somehow.

00:39:08 Okay.

00:39:09 The knowledge persists.

00:39:09 So let me take a very specific example.

00:39:11 Yes.

00:39:12 It’s not an example.

00:39:13 It’s more like a quasi mathematical demonstration.

00:39:17 So you have about 1 million fibers

00:39:18 coming out of one of your eyes.

00:39:20 Okay, 2 million total,

00:39:21 but let’s talk about just one of them.

00:39:23 It’s 1 million nerve fibers, your optical nerve.

00:39:27 Let’s imagine that they are binary.

00:39:28 So they can be active or inactive, right?

00:39:30 So the input to your visual cortex is 1 million bits.

00:39:34 Mm hmm.

00:39:36 Now they’re connected to your brain in a particular way,

00:39:39 and your brain has connections

00:39:41 that are kind of a little bit like a convolutional net,

00:39:44 they’re kind of local, you know, in space

00:39:46 and things like this.

00:39:47 Now, imagine I play a trick on you.

00:39:50 It’s a pretty nasty trick, I admit.

00:39:53 I cut your optical nerve,

00:39:55 and I put a device that makes a random perturbation

00:39:58 of a permutation of all the nerve fibers.

00:40:01 So now what comes to your brain

00:40:04 is a fixed but random permutation of all the pixels.

00:40:09 There’s no way in hell that your visual cortex,

00:40:11 even if I do this to you in infancy,

00:40:14 will actually learn vision

00:40:16 to the same level of quality that you can.

00:40:20 Got it, and you’re saying there’s no way you’ve learned that?

00:40:22 No, because now two pixels that are nearby in the world

00:40:25 will end up in very different places in your visual cortex,

00:40:29 and your neurons there have no connections with each other

00:40:31 because they’re only connected locally.

00:40:33 So this whole, our entire, the hardware is built

00:40:36 in many ways to support?

00:40:38 The locality of the real world.

00:40:40 Yes, that’s specialization.

00:40:42 Yeah, but it’s still pretty damn impressive,

00:40:44 so it’s not perfect generalization, it’s not even close.

00:40:46 No, no, it’s not that it’s not even close, it’s not at all.

00:40:50 Yeah, it’s not, it’s specialized, yeah.

00:40:52 So how many Boolean functions?

00:40:54 So let’s imagine you want to train your visual system

00:40:58 to recognize particular patterns of those one million bits.

00:41:03 Okay, so that’s a Boolean function, right?

00:41:05 Either the pattern is here or not here,

00:41:07 this is a two way classification

00:41:09 with one million binary inputs.

00:41:13 How many such Boolean functions are there?

00:41:16 Okay, you have two to the one million

00:41:19 combinations of inputs,

00:41:21 for each of those you have an output bit,

00:41:24 and so you have two to the one million

00:41:27 Boolean functions of this type, okay?

00:41:30 Which is an unimaginably large number.

00:41:33 How many of those functions can actually be computed

00:41:35 by your visual cortex?

00:41:37 And the answer is a tiny, tiny, tiny, tiny, tiny, tiny sliver.

00:41:41 Like an enormously tiny sliver.

00:41:43 Yeah, yeah.

00:41:44 So we are ridiculously specialized.

00:41:48 Okay.

00:41:49 But, okay, that’s an argument against the word general.

00:41:54 I think there’s a, I agree with your intuition,

00:41:59 but I’m not sure it’s, it seems the brain is impressively

00:42:06 capable of adjusting to things, so.

00:42:09 It’s because we can’t imagine tasks

00:42:13 that are outside of our comprehension, right?

00:42:16 So we think we’re general because we’re general

00:42:18 of all the things that we can apprehend.

00:42:20 But there is a huge world out there

00:42:23 of things that we have no idea.

00:42:24 We call that heat, by the way.

00:42:26 Heat.

00:42:27 Heat.

00:42:28 So, at least physicists call that heat,

00:42:30 or they call it entropy, which is kind of.

00:42:33 You have a thing full of gas, right?

00:42:39 Closed system for gas.

00:42:40 Right?

00:42:41 Closed or not closed.

00:42:42 It has pressure, it has temperature, it has, you know,

00:42:47 and you can write equations, PV equal N on T,

00:42:50 you know, things like that, right?

00:42:52 When you reduce the volume, the temperature goes up,

00:42:54 the pressure goes up, you know, things like that, right?

00:42:57 For perfect gas, at least.

00:42:59 Those are the things you can know about that system.

00:43:02 And it’s a tiny, tiny number of bits

00:43:04 compared to the complete information

00:43:06 of the state of the entire system.

00:43:08 Because the state of the entire system

00:43:09 will give you the position of momentum

00:43:11 of every molecule of the gas.

00:43:14 And what you don’t know about it is the entropy,

00:43:17 and you interpret it as heat.

00:43:20 The energy contained in that thing is what we call heat.

00:43:24 Now, it’s very possible that, in fact,

00:43:28 there is some very strong structure

00:43:30 in how those molecules are moving.

00:43:31 It’s just that they are in a way

00:43:33 that we are just not wired to perceive.

00:43:35 Yeah, we’re ignorant to it.

00:43:36 And there’s, in your infinite amount of things,

00:43:40 we’re not wired to perceive.

00:43:41 And you’re right, that’s a nice way to put it.

00:43:44 We’re general to all the things we can imagine,

00:43:47 which is a very tiny subset of all things that are possible.

00:43:51 So it’s like comograph complexity

00:43:53 or the comograph chitin sum of complexity.

00:43:55 Yeah.

00:43:56 You know, every bit string or every integer is random,

00:44:02 except for all the ones that you can actually write down.

00:44:05 Yeah.

00:44:06 Yeah.

00:44:06 Yeah.

00:44:07 Yeah.

00:44:08 Yeah.

00:44:09 Yeah.

00:44:10 Yeah, okay.

00:44:12 So beautifully put.

00:44:13 But, you know, so we can just call it artificial intelligence.

00:44:15 We don’t need to have a general.

00:44:17 Or human level.

00:44:18 Human level intelligence is good.

00:44:20 You know, you’ll start, anytime you touch human,

00:44:24 it gets interesting because, you know,

00:44:30 it’s because we attach ourselves to human

00:44:33 and it’s difficult to define what human intelligence is.

00:44:36 Yeah.

00:44:37 Nevertheless, my definition is maybe dem impressive

00:44:42 intelligence, okay?

00:44:43 Dem impressive demonstration of intelligence, whatever.

00:44:46 And so on that topic, most successes in deep learning

00:44:51 have been in supervised learning.

00:44:53 What is your view on unsupervised learning?

00:44:57 Is there a hope to reduce involvement of human input

00:45:03 and still have successful systems

00:45:05 that have practical use?

00:45:08 Yeah, I mean, there’s definitely a hope.

00:45:09 It’s more than a hope, actually.

00:45:11 It’s mounting evidence for it.

00:45:13 And that’s basically all I do.

00:45:16 Like, the only thing I’m interested in at the moment is,

00:45:19 I call it self supervised learning, not unsupervised.

00:45:21 Because unsupervised learning is a loaded term.

00:45:25 People who know something about machine learning,

00:45:27 you know, tell you, so you’re doing clustering or PCA,

00:45:30 which is not the case.

00:45:31 And the white public, you know,

00:45:32 when you say unsupervised learning,

00:45:33 oh my God, machines are gonna learn by themselves

00:45:35 without supervision.

00:45:37 You know, they see this as…

00:45:39 Where’s the parents?

00:45:40 Yeah, so I call it self supervised learning

00:45:42 because, in fact, the underlying algorithms that are used

00:45:46 are the same algorithms as the supervised learning

00:45:48 algorithms, except that what we train them to do

00:45:52 is not predict a particular set of variables,

00:45:55 like the category of an image,

00:46:00 and not to predict a set of variables

00:46:02 that have been provided by human labelers.

00:46:06 But what you’re trying the machine to do

00:46:07 is basically reconstruct a piece of its input

00:46:10 that is being maxed out, essentially.

00:46:14 You can think of it this way, right?

00:46:15 So show a piece of video to a machine

00:46:18 and ask it to predict what’s gonna happen next.

00:46:20 And of course, after a while, you can show what happens

00:46:23 and the machine will kind of train itself

00:46:26 to do better at that task.

00:46:28 You can do like all the latest, most successful models

00:46:32 in natural language processing,

00:46:33 use self supervised learning.

00:46:36 You know, sort of BERT style systems, for example, right?

00:46:38 You show it a window of a dozen words on a text corpus,

00:46:43 you take out 15% of the words,

00:46:46 and then you train the machine to predict the words

00:46:49 that are missing, that self supervised learning.

00:46:52 It’s not predicting the future,

00:46:53 it’s just predicting things in the middle,

00:46:56 but you could have it predict the future,

00:46:57 that’s what language models do.

00:46:59 So you construct, so in an unsupervised way,

00:47:01 you construct a model of language.

00:47:03 Do you think…

00:47:05 Or video or the physical world or whatever, right?

00:47:09 How far do you think that can take us?

00:47:12 Do you think BERT understands anything?

00:47:18 To some level, it has a shallow understanding of text,

00:47:23 but it needs to, I mean,

00:47:24 to have kind of true human level intelligence,

00:47:26 I think you need to ground language in reality.

00:47:29 So some people are attempting to do this, right?

00:47:32 Having systems that kind of have some visual representation

00:47:35 of what is being talked about,

00:47:37 which is one reason you need

00:47:38 those interactive environments actually.

00:47:41 But this is like a huge technical problem

00:47:43 that is not solved,

00:47:45 and that explains why self supervised learning

00:47:47 works in the context of natural language,

00:47:49 but does not work in the context, or at least not well,

00:47:52 in the context of image recognition and video,

00:47:55 although it’s making progress quickly.

00:47:57 And the reason, that reason is the fact that

00:48:01 it’s much easier to represent uncertainty in the prediction

00:48:05 in a context of natural language

00:48:06 than it is in the context of things like video and images.

00:48:10 So for example, if I ask you to predict

00:48:12 what words are missing,

00:48:14 15% of the words that I’ve taken out.

00:48:17 The possibilities are small.

00:48:19 That means… It’s small, right?

00:48:20 There is 100,000 words in the lexicon,

00:48:23 and what the machine spits out

00:48:24 is a big probability vector, right?

00:48:27 It’s a bunch of numbers between zero and one

00:48:29 that sum to one.

00:48:30 And we know how to do this with computers.

00:48:34 So there, representing uncertainty in the prediction

00:48:36 is relatively easy, and that’s, in my opinion,

00:48:39 why those techniques work for NLP.

00:48:42 For images, if you ask…

00:48:45 If you block a piece of an image,

00:48:46 and you ask the system,

00:48:47 reconstruct that piece of the image,

00:48:49 there are many possible answers.

00:48:51 They are all perfectly legit, right?

00:48:54 And how do you represent this set of possible answers?

00:48:58 You can’t train a system to make one prediction.

00:49:00 You can’t train a neural net to say,

00:49:02 here it is, that’s the image,

00:49:04 because there’s a whole set of things

00:49:06 that are compatible with it.

00:49:07 So how do you get the machine to represent

00:49:08 not a single output, but a whole set of outputs?

00:49:13 And similarly with video prediction,

00:49:17 there’s a lot of things that can happen

00:49:19 in the future of video.

00:49:20 You’re looking at me right now.

00:49:21 I’m not moving my head very much,

00:49:22 but I might turn my head to the left or to the right.

00:49:26 If you don’t have a system that can predict this,

00:49:30 and you train it with least square

00:49:31 to minimize the error with the prediction

00:49:33 and what I’m doing,

00:49:34 what you get is a blurry image of myself

00:49:36 in all possible future positions that I might be in,

00:49:39 which is not a good prediction.

00:49:41 So there might be other ways

00:49:43 to do the self supervision for visual scenes.

00:49:48 Like what?

00:49:48 I mean, if I knew, I wouldn’t tell you,

00:49:52 publish it first, I don’t know.

00:49:55 No, there might be.

00:49:57 So I mean, these are kind of,

00:50:00 there might be artificial ways of like self play in games,

00:50:03 the way you can simulate part of the environment.

00:50:05 Oh, that doesn’t solve the problem.

00:50:06 It’s just a way of generating data.

00:50:10 But because you have more of a control,

00:50:12 like maybe you can control,

00:50:14 yeah, it’s a way to generate data.

00:50:16 That’s right.

00:50:16 And because you can do huge amounts of data generation,

00:50:20 that doesn’t, you’re right.

00:50:21 Well, it creeps up on the problem from the side of data,

00:50:26 and you don’t think that’s the right way to creep up.

00:50:27 It doesn’t solve this problem

00:50:28 of handling uncertainty in the world, right?

00:50:30 So if you have a machine learn a predictive model

00:50:35 of the world in a game that is deterministic

00:50:38 or quasi deterministic, it’s easy, right?

00:50:42 Just give a few frames of the game to a ConvNet,

00:50:45 put a bunch of layers,

00:50:47 and then have the game generates the next few frames.

00:50:49 And if the game is deterministic, it works fine.

00:50:54 And that includes feeding the system with the action

00:50:59 that your little character is gonna take.

00:51:03 The problem comes from the fact that the real world

00:51:06 and most games are not entirely predictable.

00:51:09 And so there you get those blurry predictions

00:51:11 and you can’t do planning with blurry predictions, right?

00:51:14 So if you have a perfect model of the world,

00:51:17 you can, in your head, run this model

00:51:20 with a hypothesis for a sequence of actions,

00:51:24 and you’re going to predict the outcome

00:51:25 of that sequence of actions.

00:51:28 But if your model is imperfect, how can you plan?

00:51:32 Yeah, it quickly explodes.

00:51:34 What are your thoughts on the extension of this,

00:51:37 which topic I’m super excited about,

00:51:39 it’s connected to something you were talking about

00:51:41 in terms of robotics, is active learning.

00:51:44 So as opposed to sort of completely unsupervised

00:51:47 or self supervised learning,

00:51:51 you ask the system for human help

00:51:54 for selecting parts you want annotated next.

00:51:58 So if you think about a robot exploring a space

00:52:00 or a baby exploring a space

00:52:02 or a system exploring a data set,

00:52:05 every once in a while asking for human input,

00:52:07 do you see value in that kind of work?

00:52:12 I don’t see transformative value.

00:52:14 It’s going to make things that we can already do

00:52:18 more efficient or they will learn slightly more efficiently,

00:52:20 but it’s not going to make machines

00:52:21 sort of significantly more intelligent.

00:52:23 I think, and by the way, there is no opposition,

00:52:29 there’s no conflict between self supervised learning,

00:52:34 reinforcement learning and supervised learning

00:52:35 or imitation learning or active learning.

00:52:39 I see self supervised learning

00:52:40 as a preliminary to all of the above.

00:52:43 Yes.

00:52:44 So the example I use very often is how is it that,

00:52:50 so if you use classical reinforcement learning,

00:52:54 deep reinforcement learning, if you want,

00:52:57 the best methods today,

00:53:01 so called model free reinforcement learning

00:53:03 to learn to play Atari games,

00:53:04 take about 80 hours of training to reach the level

00:53:07 that any human can reach in about 15 minutes.

00:53:11 They get better than humans, but it takes them a long time.

00:53:16 Alpha star, okay, the, you know,

00:53:20 Aureal Vinyals and his teams,

00:53:22 the system to play StarCraft plays,

00:53:27 you know, a single map, a single type of player.

00:53:32 A single player and can reach better than human level

00:53:38 with about the equivalent of 200 years of training

00:53:43 playing against itself.

00:53:45 It’s 200 years, right?

00:53:46 It’s not something that no human can ever do.

00:53:50 I mean, I’m not sure what lesson to take away from that.

00:53:52 Okay, now take those algorithms,

00:53:54 the best algorithms we have today

00:53:57 to train a car to drive itself.

00:54:00 It would probably have to drive millions of hours.

00:54:02 It will have to kill thousands of pedestrians.

00:54:04 It will have to run into thousands of trees.

00:54:06 It will have to run off cliffs.

00:54:08 And it had to run off cliff multiple times

00:54:10 before it figures out that it’s a bad idea, first of all.

00:54:14 And second of all, before it figures out how not to do it.

00:54:17 And so, I mean, this type of learning obviously

00:54:19 does not reflect the kind of learning

00:54:21 that animals and humans do.

00:54:23 There is something missing

00:54:24 that’s really, really important there.

00:54:26 And my hypothesis, which I’ve been advocating

00:54:28 for like five years now,

00:54:30 is that we have predictive models of the world

00:54:34 that include the ability to predict under uncertainty.

00:54:38 And what allows us to not run off a cliff

00:54:43 when we learn to drive,

00:54:44 most of us can learn to drive in about 20 or 30 hours

00:54:47 of training without ever crashing, causing any accident.

00:54:50 And if we drive next to a cliff,

00:54:53 we know that if we turn the wheel to the right,

00:54:55 the car is gonna run off the cliff

00:54:57 and nothing good is gonna come out of this.

00:54:58 Because we have a pretty good model of intuitive physics

00:55:00 that tells us the car is gonna fall.

00:55:02 We know about gravity.

00:55:04 Babies learn this around the age of eight or nine months

00:55:07 that objects don’t float, they fall.

00:55:11 And we have a pretty good idea of the effect

00:55:13 of turning the wheel on the car

00:55:15 and we know we need to stay on the road.

00:55:16 So there’s a lot of things that we bring to the table,

00:55:19 which is basically our predictive model of the world.

00:55:22 And that model allows us to not do stupid things.

00:55:25 And to basically stay within the context

00:55:28 of things we need to do.

00:55:29 We still face unpredictable situations

00:55:32 and that’s how we learn.

00:55:34 But that allows us to learn really, really, really quickly.

00:55:37 So that’s called model based reinforcement learning.

00:55:41 There’s some imitation and supervised learning

00:55:43 because we have a driving instructor

00:55:44 that tells us occasionally what to do.

00:55:47 But most of the learning is learning the model,

00:55:52 learning physics that we’ve done since we were babies.

00:55:55 That’s where all, almost all the learning.

00:55:56 And the physics is somewhat transferable from,

00:56:00 it’s transferable from scene to scene.

00:56:01 Stupid things are the same everywhere.

00:56:04 Yeah, I mean, if you have experience of the world,

00:56:07 you don’t need to be from a particularly intelligent species

00:56:11 to know that if you spill water from a container,

00:56:16 the rest is gonna get wet.

00:56:18 You might get wet.

00:56:20 So cats know this, right?

00:56:22 Yeah.

00:56:23 Right, so the main problem we need to solve

00:56:27 is how do we learn models of the world?

00:56:29 That’s what I’m interested in.

00:56:31 That’s what self supervised learning is all about.

00:56:34 If you were to try to construct a benchmark for,

00:56:39 let’s look at MNIST.

00:56:41 I love that data set.

00:56:44 Do you think it’s useful, interesting, slash possible

00:56:48 to perform well on MNIST with just one example

00:56:52 of each digit and how would we solve that problem?

00:56:58 The answer is probably yes.

00:56:59 The question is what other type of learning

00:57:02 are you allowed to do?

00:57:03 So if what you’re allowed to do is train

00:57:04 on some gigantic data set of labeled digit,

00:57:07 that’s called transfer learning.

00:57:08 And we know that works, okay?

00:57:11 We do this at Facebook, like in production, right?

00:57:13 We train large convolutional nets to predict hashtags

00:57:17 that people type on Instagram

00:57:18 and we train on billions of images, literally billions.

00:57:20 And then we chop off the last layer

00:57:22 and fine tune on whatever task we want.

00:57:24 That works really well.

00:57:26 You can beat the ImageNet record with this.

00:57:28 We actually open sourced the whole thing

00:57:30 like a few weeks ago.

00:57:31 Yeah, that’s still pretty cool.

00:57:33 But yeah, so what would be impressive?

00:57:36 What’s useful and impressive?

00:57:38 What kind of transfer learning

00:57:39 would be useful and impressive?

00:57:40 Is it Wikipedia, that kind of thing?

00:57:42 No, no, so I don’t think transfer learning

00:57:44 is really where we should focus.

00:57:46 We should try to do,

00:57:48 you know, have a kind of scenario for Benchmark

00:57:51 where you have unlabeled data

00:57:53 and you can, and it’s very large number of unlabeled data.

00:57:58 It could be video clips.

00:58:00 It could be where you do, you know, frame prediction.

00:58:03 It could be images where you could choose to,

00:58:06 you know, mask a piece of it, could be whatever,

00:58:10 but they’re unlabeled and you’re not allowed to label them.

00:58:13 So you do some training on this,

00:58:18 and then you train on a particular supervised task,

00:58:24 ImageNet or a NIST,

00:58:26 and you measure how your test error decrease

00:58:30 or validation error decreases

00:58:31 as you increase the number of label training samples.

00:58:35 Okay, and what you’d like to see is that,

00:58:40 you know, your error decreases much faster

00:58:43 than if you train from scratch from random weights.

00:58:46 So that to reach the same level of performance

00:58:48 and a completely supervised, purely supervised system

00:58:52 would reach you would need way fewer samples.

00:58:54 So that’s the crucial question

00:58:55 because it will answer the question to like, you know,

00:58:58 people interested in medical image analysis.

00:59:01 Okay, you know, if I want to get to a particular level

00:59:05 of error rate for this task,

00:59:07 I know I need a million samples.

00:59:10 Can I do, you know, self supervised pre training

00:59:13 to reduce this to about 100 or something?

00:59:15 And you think the answer there

00:59:16 is self supervised pre training?

00:59:18 Yeah, some form, some form of it.

00:59:23 Telling you active learning, but you disagree.

00:59:26 No, it’s not useless.

00:59:28 It’s just not gonna lead to a quantum leap.

00:59:30 It’s just gonna make things that we already do.

00:59:32 So you’re way smarter than me.

00:59:33 I just disagree with you.

00:59:35 But I don’t have anything to back that.

00:59:37 It’s just intuition.

00:59:38 So I worked a lot of large scale data sets

00:59:40 and there’s something that might be magic

00:59:43 in active learning, but okay.

00:59:45 And at least I said it publicly.

00:59:48 At least I’m being an idiot publicly.

00:59:50 Okay.

00:59:51 It’s not being an idiot.

00:59:52 It’s, you know, working with the data you have.

00:59:54 I mean, I mean, certainly people are doing things like,

00:59:56 okay, I have 3000 hours of, you know,

00:59:59 imitation learning for start driving car,

01:00:01 but most of those are incredibly boring.

01:00:03 What I like is select, you know, 10% of them

01:00:05 that are kind of the most informative.

01:00:07 And with just that, I would probably reach the same.

01:00:10 So it’s a weak form of active learning if you want.

01:00:14 Yes, but there might be a much stronger version.

01:00:18 Yeah, that’s right.

01:00:18 That’s what, and that’s an awful question if it exists.

01:00:21 The question is how much stronger can you get?

01:00:24 Elon Musk is confident.

01:00:26 Talked to him recently.

01:00:28 He’s confident that large scale data and deep learning

01:00:30 can solve the autonomous driving problem.

01:00:33 What are your thoughts on the limits,

01:00:36 possibilities of deep learning in this space?

01:00:38 It’s obviously part of the solution.

01:00:40 I mean, I don’t think we’ll ever have a set driving system

01:00:43 or at least not in the foreseeable future

01:00:45 that does not use deep learning.

01:00:47 Let me put it this way.

01:00:48 Now, how much of it?

01:00:49 So in the history of sort of engineering,

01:00:54 particularly sort of AI like systems,

01:00:58 there’s generally a first phase where everything is built by hand.

01:01:01 Then there is a second phase.

01:01:02 And that was the case for autonomous driving 20, 30 years ago.

01:01:06 There’s a phase where there’s a little bit of learning is used,

01:01:09 but there’s a lot of engineering that’s involved in kind of

01:01:12 taking care of corner cases and putting limits, et cetera,

01:01:16 because the learning system is not perfect.

01:01:18 And then as technology progresses,

01:01:21 we end up relying more and more on learning.

01:01:23 That’s the history of character recognition,

01:01:25 it’s the history of science.

01:01:27 Character recognition is the history of speech recognition,

01:01:29 now computer vision, natural language processing.

01:01:31 And I think the same is going to happen with autonomous driving

01:01:36 that currently the methods that are closest

01:01:40 to providing some level of autonomy,

01:01:43 some decent level of autonomy

01:01:44 where you don’t expect a driver to kind of do anything

01:01:48 is where you constrain the world.

01:01:50 So you only run within 100 square kilometers

01:01:53 or square miles in Phoenix where the weather is nice

01:01:56 and the roads are wide, which is what Waymo is doing.

01:02:00 You completely overengineer the car with tons of LIDARs

01:02:04 and sophisticated sensors that are too expensive

01:02:08 for consumer cars,

01:02:09 but they’re fine if you just run a fleet.

01:02:13 And you engineer the hell out of everything else.

01:02:16 You map the entire world.

01:02:17 So you have complete 3D model of everything.

01:02:20 So the only thing that the perception system

01:02:22 has to take care of is moving objects

01:02:24 and construction and sort of things that weren’t in your map.

01:02:30 And you can engineer a good SLAM system and all that stuff.

01:02:34 So that’s kind of the current approach

01:02:35 that’s closest to some level of autonomy.

01:02:37 But I think eventually the longterm solution

01:02:39 is going to rely more and more on learning

01:02:43 and possibly using a combination

01:02:45 of self supervised learning and model based reinforcement

01:02:49 or something like that.

01:02:50 But ultimately learning will be not just at the core,

01:02:54 but really the fundamental part of the system.

01:02:57 Yeah, it already is, but it will become more and more.

01:03:00 What do you think it takes to build a system

01:03:02 with human level intelligence?

01:03:04 You talked about the AI system in the movie Her

01:03:07 being way out of reach, our current reach.

01:03:10 This might be outdated as well, but.

01:03:12 It’s still way out of reach.

01:03:13 It’s still way out of reach.

01:03:15 What would it take to build Her?

01:03:18 Do you think?

01:03:19 So I can tell you the first two obstacles

01:03:21 that we have to clear,

01:03:22 but I don’t know how many obstacles there are after this.

01:03:24 So the image I usually use is that

01:03:26 there is a bunch of mountains that we have to climb

01:03:28 and we can see the first one,

01:03:29 but we don’t know if there are 50 mountains behind it or not.

01:03:33 And this might be a good sort of metaphor

01:03:34 for why AI researchers in the past

01:03:38 have been overly optimistic about the result of AI.

01:03:43 You know, for example,

01:03:45 Noel and Simon wrote the general problem solver

01:03:49 and they called it the general problem solver.

01:03:51 General problem solver.

01:03:52 And of course, the first thing you realize

01:03:54 is that all the problems you want to solve are exponential.

01:03:56 And so you can’t actually use it for anything useful,

01:03:59 but you know.

01:04:00 Yeah, so yeah, all you see is the first peak.

01:04:02 So in general, what are the first couple of peaks for Her?

01:04:05 So the first peak, which is precisely what I’m working on

01:04:08 is self supervised learning.

01:04:10 How do we get machines to run models of the world

01:04:12 by observation, kind of like babies and like young animals?

01:04:15 So we’ve been working with, you know, cognitive scientists.

01:04:21 So this Emmanuelle Dupoux, who’s at FAIR in Paris,

01:04:24 is a half time, is also a researcher in a French university.

01:04:30 And he has this chart that shows that which,

01:04:36 how many months of life baby humans

01:04:38 kind of learn different concepts.

01:04:40 And you can measure this in sort of various ways.

01:04:44 So things like distinguishing animate objects

01:04:49 from inanimate objects,

01:04:50 you can tell the difference at age two, three months.

01:04:54 Whether an object is going to stay stable,

01:04:56 is going to fall, you know,

01:04:58 about four months, you can tell.

01:05:00 You know, there are various things like this.

01:05:02 And then things like gravity,

01:05:04 the fact that objects are not supposed to float in the air,

01:05:06 but are supposed to fall,

01:05:07 you run this around the age of eight or nine months.

01:05:10 If you look at the data,

01:05:11 eight or nine months, if you look at a lot of,

01:05:14 you know, eight month old babies,

01:05:15 you give them a bunch of toys on their high chair.

01:05:19 First thing they do is they throw them on the ground

01:05:20 and they look at them.

01:05:21 It’s because, you know, they’re learning about,

01:05:23 actively learning about gravity.

01:05:26 Gravity, yeah.

01:05:26 Okay, so they’re not trying to annoy you,

01:05:29 but they, you know, they need to do the experiment, right?

01:05:32 Yeah.

01:05:33 So, you know, how do we get machines to learn like babies,

01:05:36 mostly by observation with a little bit of interaction

01:05:39 and learning those models of the world?

01:05:41 Because I think that’s really a crucial piece

01:05:43 of an intelligent autonomous system.

01:05:46 So if you think about the architecture

01:05:47 of an intelligent autonomous system,

01:05:49 it needs to have a predictive model of the world.

01:05:51 So something that says, here is a world at time T,

01:05:54 here is a state of the world at time T plus one,

01:05:55 if I take this action.

01:05:57 And it’s not a single answer, it can be a…

01:05:59 Yeah, it can be a distribution, yeah.

01:06:01 Yeah, well, but we don’t know how to represent

01:06:03 distributions in high dimensional T spaces.

01:06:04 So it’s gotta be something weaker than that, okay?

01:06:07 But with some representation of uncertainty.

01:06:09 If you have that, then you can do what optimal control

01:06:12 theorists call model predictive control,

01:06:14 which means that you can run your model

01:06:16 with a hypothesis for a sequence of action

01:06:18 and then see the result.

01:06:20 Now, what you need, the other thing you need

01:06:22 is some sort of objective that you want to optimize.

01:06:24 Am I reaching the goal of grabbing this object?

01:06:27 Am I minimizing energy?

01:06:28 Am I whatever, right?

01:06:30 So there is some sort of objective that you have to minimize.

01:06:33 And so in your head, if you have this model,

01:06:35 you can figure out the sequence of action

01:06:37 that will optimize your objective.

01:06:38 That objective is something that ultimately is rooted

01:06:42 in your basal ganglia, at least in the human brain,

01:06:44 that’s what it’s basal ganglia,

01:06:47 computes your level of contentment or miscontentment.

01:06:50 I don’t know if that’s a word.

01:06:52 Unhappiness, okay?

01:06:53 Yeah, yeah.

01:06:54 Discontentment.

01:06:55 Discontentment, maybe.

01:06:56 And so your entire behavior is driven towards

01:07:01 kind of minimizing that objective,

01:07:03 which is maximizing your contentment,

01:07:05 computed by your basal ganglia.

01:07:07 And what you have is an objective function,

01:07:10 which is basically a predictor

01:07:12 of what your basal ganglia is going to tell you.

01:07:14 So you’re not going to put your hand on fire

01:07:16 because you know it’s going to burn

01:07:19 and you’re going to get hurt.

01:07:21 And you’re predicting this because of your model

01:07:23 of the world and your sort of predictor

01:07:25 of this objective, right?

01:07:27 So if you have those three components,

01:07:31 you have four components,

01:07:32 you have the hardwired objective,

01:07:36 hardwired contentment objective computer,

01:07:41 if you want, calculator.

01:07:43 And then you have the three components.

01:07:45 One is the objective predictor,

01:07:46 which basically predicts your level of contentment.

01:07:48 One is the model of the world.

01:07:52 And there’s a third module I didn’t mention,

01:07:54 which is the module that will figure out

01:07:57 the best course of action to optimize an objective

01:08:00 given your model, okay?

01:08:03 Yeah.

01:08:04 And you can call this a policy network

01:08:07 or something like that, right?

01:08:09 Now, you need those three components

01:08:11 to act autonomously intelligently.

01:08:13 And you can be stupid in three different ways.

01:08:16 You can be stupid because your model of the world is wrong.

01:08:19 You can be stupid because your objective is not aligned

01:08:22 with what you actually want to achieve, okay?

01:08:27 In humans, that would be a psychopath.

01:08:30 And then the third way you can be stupid

01:08:33 is that you have the right model,

01:08:34 you have the right objective,

01:08:36 but you’re unable to figure out a course of action

01:08:38 to optimize your objective given your model.

01:08:41 Okay.

01:08:44 Some people who are in charge of big countries

01:08:45 actually have all three that are wrong.

01:08:47 All right.

01:08:50 Which countries?

01:08:51 I don’t know.

01:08:52 Okay, so if we think about this agent,

01:08:55 if we think about the movie Her,

01:08:58 you’ve criticized the art project

01:09:02 that is Sophia the Robot.

01:09:04 And what that project essentially does

01:09:07 is uses our natural inclination to anthropomorphize

01:09:11 things that look like human and give them more.

01:09:14 Do you think that could be used by AI systems

01:09:17 like in the movie Her?

01:09:21 So do you think that body is needed

01:09:23 to create a feeling of intelligence?

01:09:27 Well, if Sophia was just an art piece,

01:09:29 I would have no problem with it,

01:09:30 but it’s presented as something else.

01:09:33 Let me, on that comment real quick,

01:09:35 if creators of Sophia could change something

01:09:38 about their marketing or behavior in general,

01:09:40 what would it be?

01:09:41 What’s?

01:09:42 I’m just about everything.

01:09:44 I mean, don’t you think, here’s a tough question.

01:09:50 Let me, so I agree with you.

01:09:51 So Sophia is not, the general public feels

01:09:56 that Sophia can do way more than she actually can.

01:09:59 That’s right.

01:10:00 And the people who created Sophia

01:10:02 are not honestly publicly communicating,

01:10:08 trying to teach the public.

01:10:09 Right.

01:10:10 But here’s a tough question.

01:10:13 Don’t you think the same thing is scientists

01:10:19 in industry and research are taking advantage

01:10:22 of the same misunderstanding in the public

01:10:25 when they create AI companies or publish stuff?

01:10:29 Some companies, yes.

01:10:31 I mean, there is no sense of,

01:10:33 there’s no desire to delude.

01:10:34 There’s no desire to kind of over claim

01:10:37 when something is done, right?

01:10:38 You publish a paper on AI that has this result

01:10:41 on ImageNet, it’s pretty clear.

01:10:43 I mean, it’s not even interesting anymore,

01:10:44 but I don’t think there is that.

01:10:49 I mean, the reviewers are generally not very forgiving

01:10:52 of unsupported claims of this type.

01:10:57 And, but there are certainly quite a few startups

01:10:59 that have had a huge amount of hype around this

01:11:02 that I find extremely damaging

01:11:05 and I’ve been calling it out when I’ve seen it.

01:11:08 So yeah, but to go back to your original question,

01:11:10 like the necessity of embodiment,

01:11:13 I think, I don’t think embodiment is necessary.

01:11:15 I think grounding is necessary.

01:11:17 So I don’t think we’re gonna get machines

01:11:18 that really understand language

01:11:20 without some level of grounding in the real world.

01:11:22 And it’s not clear to me that language

01:11:24 is a high enough bandwidth medium

01:11:26 to communicate how the real world works.

01:11:28 So I think for this.

01:11:30 Can you talk to what grounding means?

01:11:32 So grounding means that,

01:11:34 so there is this classic problem of common sense reasoning,

01:11:37 you know, the Winograd schema, right?

01:11:41 And so I tell you the trophy doesn’t fit in the suitcase

01:11:44 because it’s too big,

01:11:46 or the trophy doesn’t fit in the suitcase

01:11:47 because it’s too small.

01:11:49 And the it in the first case refers to the trophy

01:11:51 in the second case to the suitcase.

01:11:53 And the reason you can figure this out

01:11:55 is because you know where the trophy and the suitcase are,

01:11:56 you know, one is supposed to fit in the other one

01:11:58 and you know the notion of size

01:12:00 and a big object doesn’t fit in a small object,

01:12:03 unless it’s a Tardis, you know, things like that, right?

01:12:05 So you have this knowledge of how the world works,

01:12:08 of geometry and things like that.

01:12:12 I don’t believe you can learn everything about the world

01:12:14 by just being told in language how the world works.

01:12:18 I think you need some low level perception of the world,

01:12:21 you know, be it visual touch, you know, whatever,

01:12:23 but some higher bandwidth perception of the world.

01:12:26 By reading all the world’s text,

01:12:28 you still might not have enough information.

01:12:31 That’s right.

01:12:32 There’s a lot of things that just will never appear in text

01:12:35 and that you can’t really infer.

01:12:37 So I think common sense will emerge from,

01:12:41 you know, certainly a lot of language interaction,

01:12:43 but also with watching videos

01:12:45 or perhaps even interacting in virtual environments

01:12:48 and possibly, you know, robot interacting in the real world.

01:12:51 But I don’t actually believe necessarily

01:12:53 that this last one is absolutely necessary.

01:12:56 But I think that there’s a need for some grounding.

01:13:00 But the final product

01:13:01 doesn’t necessarily need to be embodied, you’re saying.

01:13:04 No.

01:13:05 It just needs to have an awareness, a grounding to.

01:13:07 Right, but it needs to know how the world works

01:13:11 to have, you know, to not be frustrating to talk to.

01:13:15 And you talked about emotions being important.

01:13:19 That’s a whole nother topic.

01:13:21 Well, so, you know, I talked about this,

01:13:24 the basal ganglia as the thing

01:13:29 that calculates your level of miscontentment.

01:13:32 And then there is this other module

01:13:34 that sort of tries to do a prediction

01:13:36 of whether you’re going to be content or not.

01:13:38 That’s the source of some emotion.

01:13:40 So fear, for example, is an anticipation

01:13:43 of bad things that can happen to you, right?

01:13:47 You have this inkling that there is some chance

01:13:49 that something really bad is going to happen to you

01:13:50 and that creates fear.

01:13:52 Well, you know for sure

01:13:53 that something bad is going to happen to you,

01:13:54 you kind of give up, right?

01:13:55 It’s not fear anymore.

01:13:57 It’s uncertainty that creates fear.

01:13:59 So the punchline is,

01:14:01 we’re not going to have autonomous intelligence

01:14:02 without emotions.

01:14:07 Whatever the heck emotions are.

01:14:08 So you mentioned very practical things of fear,

01:14:11 but there’s a lot of other mess around it.

01:14:13 But there are kind of the results of, you know, drives.

01:14:16 Yeah, there’s deeper biological stuff going on.

01:14:19 And I’ve talked to a few folks on this.

01:14:21 There’s fascinating stuff

01:14:23 that ultimately connects to our brain.

01:14:27 If we create an AGI system, sorry.

01:14:30 Human level intelligence.

01:14:31 Human level intelligence system.

01:14:34 And you get to ask her one question.

01:14:37 What would that question be?

01:14:39 You know, I think the first one we’ll create

01:14:42 would probably not be that smart.

01:14:45 They’d be like a four year old.

01:14:47 Okay.

01:14:47 So you would have to ask her a question

01:14:50 to know she’s not that smart.

01:14:52 Yeah.

01:14:54 Well, what’s a good question to ask, you know,

01:14:56 to be impressed.

01:14:57 What is the cause of wind?

01:15:01 And if she answers,

01:15:02 oh, it’s because the leaves of the tree are moving

01:15:04 and that creates wind.

01:15:06 She’s onto something.

01:15:08 And if she says that’s a stupid question,

01:15:11 she’s really onto something.

01:15:12 No, and then you tell her,

01:15:14 actually, you know, here is the real thing.

01:15:18 She says, oh yeah, that makes sense.

01:15:20 So questions that reveal the ability

01:15:24 to do common sense reasoning about the physical world.

01:15:26 Yeah.

01:15:27 And you’ll sum it up with causal inference.

01:15:30 Causal inference.

01:15:31 Well, it was a huge honor.

01:15:33 Congratulations on your Turing Award.

01:15:35 Thank you so much for talking today.

01:15:37 Thank you.

01:15:38 Thank you for having me.