Transcript
00:00:00 The following is a conversation with Yann LeCun.
00:00:03 He’s considered to be one of the fathers of deep learning,
00:00:06 which, if you’ve been hiding under a rock,
00:00:09 is the recent revolution in AI that has captivated the world
00:00:12 with the possibility of what machines can learn from data.
00:00:16 He’s a professor at New York University,
00:00:18 a vice president and chief AI scientist at Facebook,
00:00:21 and co recipient of the Turing Award
00:00:24 for his work on deep learning.
00:00:26 He’s probably best known as the founding father
00:00:28 of convolutional neural networks,
00:00:30 in particular their application
00:00:32 to optical character recognition
00:00:34 and the famed MNIST dataset.
00:00:37 He is also an outspoken personality,
00:00:40 unafraid to speak his mind in a distinctive French accent
00:00:43 and explore provocative ideas,
00:00:45 both in the rigorous medium of academic research
00:00:48 and the somewhat less rigorous medium
00:00:51 of Twitter and Facebook.
00:00:52 This is the Artificial Intelligence Podcast.
00:00:55 If you enjoy it, subscribe on YouTube,
00:00:57 give it five stars on iTunes, support it on Patreon,
00:01:00 or simply connect with me on Twitter at Lex Friedman,
00:01:03 spelled F R I D M A N.
00:01:06 And now, here’s my conversation with Yann LeCun.
00:01:11 You said that 2001 Space Odyssey
00:01:13 is one of your favorite movies.
00:01:16 Hal 9000 decides to get rid of the astronauts
00:01:20 for people who haven’t seen the movie, spoiler alert,
00:01:23 because he, it, she believes that the astronauts,
00:01:29 they will interfere with the mission.
00:01:31 Do you see Hal as flawed in some fundamental way
00:01:34 or even evil, or did he do the right thing?
00:01:38 Neither.
00:01:39 There’s no notion of evil in that context,
00:01:43 other than the fact that people die,
00:01:44 but it was an example of what people call
00:01:48 value misalignment, right?
00:01:50 You give an objective to a machine,
00:01:52 and the machine strives to achieve this objective.
00:01:55 And if you don’t put any constraints on this objective,
00:01:58 like don’t kill people and don’t do things like this,
00:02:02 the machine, given the power, will do stupid things
00:02:06 just to achieve this objective,
00:02:08 or damaging things to achieve this objective.
00:02:10 It’s a little bit like, I mean, we’re used to this
00:02:12 in the context of human society.
00:02:15 We put in place laws to prevent people
00:02:20 from doing bad things, because spontaneously,
00:02:22 they would do those bad things, right?
00:02:24 So we have to shape their cost function,
00:02:28 their objective function, if you want,
00:02:29 through laws to kind of correct,
00:02:31 and education, obviously, to sort of correct for those.
00:02:36 So maybe just pushing a little further on that point,
00:02:41 how, you know, there’s a mission,
00:02:44 there’s this fuzziness around,
00:02:46 the ambiguity around what the actual mission is,
00:02:49 but, you know, do you think that there will be a time,
00:02:55 from a utilitarian perspective,
00:02:56 where an AI system, where it is not misalignment,
00:02:59 where it is alignment, for the greater good of society,
00:03:02 that an AI system will make decisions that are difficult?
00:03:05 Well, that’s the trick.
00:03:06 I mean, eventually we’ll have to figure out how to do this.
00:03:10 And again, we’re not starting from scratch,
00:03:12 because we’ve been doing this with humans for millennia.
00:03:16 So designing objective functions for people
00:03:19 is something that we know how to do.
00:03:20 And we don’t do it by, you know, programming things,
00:03:24 although the legal code is called code.
00:03:29 So that tells you something.
00:03:30 And it’s actually the design of an objective function.
00:03:33 That’s really what legal code is, right?
00:03:34 It tells you, here is what you can do,
00:03:36 here is what you can’t do.
00:03:37 If you do it, you pay that much,
00:03:39 that’s an objective function.
00:03:41 So there is this idea somehow that it’s a new thing
00:03:44 for people to try to design objective functions
00:03:46 that are aligned with the common good.
00:03:47 But no, we’ve been writing laws for millennia
00:03:49 and that’s exactly what it is.
00:03:52 So that’s where, you know, the science of lawmaking
00:03:57 and computer science will.
00:04:00 Come together.
00:04:01 Will come together.
00:04:02 So there’s nothing special about HAL or AI systems,
00:04:06 it’s just the continuation of tools used
00:04:09 to make some of these difficult ethical judgments
00:04:11 that laws make.
00:04:13 Yeah, and we have systems like this already
00:04:15 that make many decisions for ourselves in society
00:04:19 that need to be designed in a way that they,
00:04:22 like rules about things that sometimes have bad side effects
00:04:27 and we have to be flexible enough about those rules
00:04:29 so that they can be broken when it’s obvious
00:04:31 that they shouldn’t be applied.
00:04:34 So you don’t see this on the camera here,
00:04:35 but all the decoration in this room
00:04:36 is all pictures from 2001 and Space Odyssey.
00:04:41 Wow, is that by accident or is there a lot?
00:04:43 No, by accident, it’s by design.
00:04:47 Oh, wow.
00:04:48 So if you were to build HAL 10,000,
00:04:52 so an improvement of HAL 9,000, what would you improve?
00:04:57 Well, first of all, I wouldn’t ask it to hold secrets
00:05:00 and tell lies because that’s really what breaks it
00:05:03 in the end, that’s the fact that it’s asking itself
00:05:06 questions about the purpose of the mission
00:05:08 and it’s, you know, pieces things together that it’s heard,
00:05:11 you know, all the secrecy of the preparation of the mission
00:05:14 and the fact that it was the discovery
00:05:16 on the lunar surface that really was kept secret
00:05:19 and one part of HAL’s memory knows this
00:05:22 and the other part does not know it
00:05:24 and is supposed to not tell anyone
00:05:26 and that creates internal conflict.
00:05:28 So you think there’s never should be a set of things
00:05:32 that an AI system should not be allowed,
00:05:36 like a set of facts that should not be shared
00:05:39 with the human operators?
00:05:42 Well, I think, no, I think it should be a bit like
00:05:46 in the design of autonomous AI systems,
00:05:52 there should be the equivalent of, you know,
00:05:54 the oath that a hypocrite oath
00:05:59 that doctors sign up to, right?
00:06:02 So there’s certain things, certain rules
00:06:04 that you have to abide by and we can sort of hardwire this
00:06:07 into our machines to kind of make sure they don’t go.
00:06:11 So I’m not, you know, an advocate of the three laws
00:06:14 of robotics, you know, the Asimov kind of thing
00:06:17 because I don’t think it’s practical,
00:06:18 but, you know, some level of limits.
00:06:23 But to be clear, these are not questions
00:06:27 that are kind of really worth asking today
00:06:32 because we just don’t have the technology to do this.
00:06:34 We don’t have autonomous intelligent machines,
00:06:36 we have intelligent machines.
00:06:37 Some are intelligent machines that are very specialized,
00:06:41 but they don’t really sort of satisfy an objective.
00:06:43 They’re just, you know, kind of trained to do one thing.
00:06:46 So until we have some idea for design
00:06:50 of a full fledged autonomous intelligent system,
00:06:53 asking the question of how we design this objective,
00:06:55 I think is a little too abstract.
00:06:58 It’s a little too abstract.
00:06:59 There’s useful elements to it in that it helps us understand
00:07:04 our own ethical codes, humans.
00:07:07 So even just as a thought experiment,
00:07:10 if you imagine that an AGI system is here today,
00:07:14 how would we program it is a kind of nice thought experiment
00:07:17 of constructing how should we have a law,
00:07:21 have a system of laws for us humans.
00:07:24 It’s just a nice practical tool.
00:07:26 And I think there’s echoes of that idea too
00:07:29 in the AI systems we have today
00:07:32 that don’t have to be that intelligent.
00:07:33 Yeah.
00:07:34 Like autonomous vehicles.
00:07:35 These things start creeping in that are worth thinking about,
00:07:39 but certainly they shouldn’t be framed as how.
00:07:42 Yeah.
00:07:43 Looking back, what is the most,
00:07:46 I’m sorry if it’s a silly question,
00:07:49 but what is the most beautiful
00:07:51 or surprising idea in deep learning
00:07:53 or AI in general that you’ve ever come across?
00:07:56 Sort of personally, when you said back
00:08:00 and just had this kind of,
00:08:01 oh, that’s pretty cool moment.
00:08:03 That’s nice.
00:08:04 That’s surprising.
00:08:05 I don’t know if it’s an idea
00:08:06 rather than a sort of empirical fact.
00:08:12 The fact that you can build gigantic neural nets,
00:08:16 train them on relatively small amounts of data relatively
00:08:23 with stochastic gradient descent
00:08:24 and that it actually works,
00:08:26 breaks everything you read in every textbook, right?
00:08:29 Every pre deep learning textbook that told you,
00:08:32 you need to have fewer parameters
00:08:33 and you have data samples.
00:08:37 If you have a non convex objective function,
00:08:38 you have no guarantee of convergence.
00:08:40 All those things that you read in textbook
00:08:42 and they tell you to stay away from this
00:08:43 and they’re all wrong.
00:08:45 The huge number of parameters, non convex,
00:08:48 and somehow which is very relative
00:08:50 to the number of parameters data,
00:08:53 it’s able to learn anything.
00:08:54 Right.
00:08:55 Does that still surprise you today?
00:08:57 Well, it was kind of obvious to me
00:09:00 before I knew anything that this is a good idea.
00:09:04 And then it became surprising that it worked
00:09:06 because I started reading those textbooks.
00:09:09 Okay.
00:09:10 Okay.
00:09:10 So can you talk through the intuition
00:09:12 of why it was obvious to you if you remember?
00:09:14 Well, okay.
00:09:15 So the intuition was it’s sort of like,
00:09:17 those people in the late 19th century
00:09:19 who proved that heavier than air flight was impossible.
00:09:25 And of course you have birds, right?
00:09:26 They do fly.
00:09:28 And so on the face of it,
00:09:30 it’s obviously wrong as an empirical question, right?
00:09:33 And so we have the same kind of thing
00:09:34 that we know that the brain works.
00:09:38 We don’t know how, but we know it works.
00:09:39 And we know it’s a large network of neurons and interaction
00:09:43 and that learning takes place by changing the connection.
00:09:45 So kind of getting this level of inspiration
00:09:48 without copying the details,
00:09:49 but sort of trying to derive basic principles,
00:09:52 and that kind of gives you a clue
00:09:56 as to which direction to go.
00:09:58 There’s also the idea somehow that I’ve been convinced of
00:10:01 since I was an undergrad that, even before,
00:10:04 that intelligence is inseparable from learning.
00:10:06 So the idea somehow that you can create
00:10:10 an intelligent machine by basically programming,
00:10:14 for me it was a non starter from the start.
00:10:17 Every intelligent entity that we know about
00:10:20 arrives at this intelligence through learning.
00:10:24 So machine learning was a completely obvious path.
00:10:29 Also because I’m lazy, so, you know, kind of.
00:10:32 He’s automate basically everything
00:10:35 and learning is the automation of intelligence.
00:10:37 So do you think, so what is learning then?
00:10:42 What falls under learning?
00:10:44 Because do you think of reasoning as learning?
00:10:48 Well, reasoning is certainly a consequence
00:10:51 of learning as well, just like other functions of the brain.
00:10:56 The big question about reasoning is,
00:10:58 how do you make reasoning compatible
00:11:00 with gradient based learning?
00:11:02 Do you think neural networks can be made to reason?
00:11:04 Yes, there is no question about that.
00:11:07 Again, we have a good example, right?
00:11:10 The question is how?
00:11:11 So the question is how much prior structure
00:11:14 do you have to put in the neural net
00:11:15 so that something like human reasoning
00:11:17 will emerge from it, you know, from learning?
00:11:20 Another question is all of our kind of model
00:11:24 of what reasoning is that are based on logic
00:11:27 are discrete and are therefore incompatible
00:11:31 with gradient based learning.
00:11:32 And I’m a very strong believer
00:11:34 in this idea of gradient based learning.
00:11:35 I don’t believe that other types of learning
00:11:39 that don’t use kind of gradient information if you want.
00:11:41 So you don’t like discrete mathematics?
00:11:43 You don’t like anything discrete?
00:11:45 Well, that’s, it’s not that I don’t like it,
00:11:46 it’s just that it’s incompatible with learning
00:11:49 and I’m a big fan of learning, right?
00:11:51 So in fact, that’s perhaps one reason
00:11:53 why deep learning has been kind of looked at
00:11:57 with suspicion by a lot of computer scientists
00:11:58 because the math is very different.
00:11:59 The math that you use for deep learning,
00:12:02 you know, it kind of has more to do with,
00:12:05 you know, cybernetics, the kind of math you do
00:12:08 in electrical engineering than the kind of math
00:12:10 you do in computer science.
00:12:12 And, you know, nothing in machine learning is exact, right?
00:12:15 Computer science is all about sort of, you know,
00:12:18 obviously compulsive attention to details of like,
00:12:21 you know, every index has to be right.
00:12:23 And you can prove that an algorithm is correct, right?
00:12:26 Machine learning is the science of sloppiness, really.
00:12:30 That’s beautiful.
00:12:32 So, okay, maybe let’s feel around in the dark
00:12:38 of what is a neural network that reasons
00:12:41 or a system that works with continuous functions
00:12:47 that’s able to do, build knowledge,
00:12:52 however we think about reasoning,
00:12:54 build on previous knowledge, build on extra knowledge,
00:12:57 create new knowledge,
00:12:59 generalize outside of any training set to ever build.
00:13:03 What does that look like?
00:13:04 If, yeah, maybe give inklings of thoughts
00:13:08 of what that might look like.
00:13:10 Yeah, I mean, yes and no.
00:13:12 If I had precise ideas about this,
00:13:14 I think, you know, we’d be building it right now.
00:13:17 And there are people working on this
00:13:19 whose main research interest is actually exactly that, right?
00:13:22 So what you need to have is a working memory.
00:13:25 So you need to have some device, if you want,
00:13:29 some subsystem that can store a relatively large number
00:13:34 of factual episodic information for, you know,
00:13:39 a reasonable amount of time.
00:13:40 So, you know, in the brain, for example,
00:13:43 there are kind of three main types of memory.
00:13:45 One is the sort of memory of the state of your cortex.
00:13:53 And that sort of disappears within 20 seconds.
00:13:55 You can’t remember things for more than about 20 seconds
00:13:58 or a minute if you don’t have any other form of memory.
00:14:02 The second type of memory, which is longer term,
00:14:04 is still short term, is the hippocampus.
00:14:06 So you can, you know, you came into this building,
00:14:08 you remember where the exit is, where the elevators are.
00:14:14 You have some map of that building
00:14:15 that’s stored in your hippocampus.
00:14:17 You might remember something about what I said,
00:14:20 you know, a few minutes ago.
00:14:21 I forgot it all already.
00:14:22 Of course, it’s been erased, but, you know,
00:14:24 but that would be in your hippocampus.
00:14:27 And then the longer term memory is in the synapse,
00:14:30 the synapses, right?
00:14:32 So what you need if you want a system
00:14:34 that’s capable of reasoning
00:14:35 is that you want the hippocampus like thing, right?
00:14:40 And that’s what people have tried to do
00:14:41 with memory networks and, you know,
00:14:43 neural training machines and stuff like that, right?
00:14:45 And now with transformers,
00:14:47 which have sort of a memory in there,
00:14:50 kind of self attention system.
00:14:51 You can think of it this way.
00:14:55 So that’s one element you need.
00:14:57 Another thing you need is some sort of network
00:14:59 that can access this memory,
00:15:03 get an information back and then kind of crunch on it
00:15:08 and then do this iteratively multiple times
00:15:10 because a chain of reasoning is a process
00:15:15 by which you update your knowledge
00:15:19 about the state of the world,
00:15:20 about, you know, what’s going to happen, et cetera.
00:15:22 And that has to be this sort of
00:15:25 recurrent operation basically.
00:15:27 And you think that kind of,
00:15:29 if we think about a transformer,
00:15:31 so that seems to be too small
00:15:32 to contain the knowledge that’s,
00:15:36 to represent the knowledge
00:15:37 that’s contained in Wikipedia, for example.
00:15:39 Well, a transformer doesn’t have this idea of recurrence.
00:15:42 It’s got a fixed number of layers
00:15:43 and that’s the number of steps that, you know,
00:15:44 limits basically its representation.
00:15:47 But recurrence would build on the knowledge somehow.
00:15:51 I mean, it would evolve the knowledge
00:15:54 and expand the amount of information perhaps
00:15:58 or useful information within that knowledge.
00:16:00 But is this something that just can emerge with size?
00:16:04 Because it seems like everything we have now is too small.
00:16:06 Not just, no, it’s not clear.
00:16:09 I mean, how you access and write
00:16:11 into an associative memory in an efficient way.
00:16:13 I mean, sort of the original memory network
00:16:15 maybe had something like the right architecture,
00:16:17 but if you try to scale up a memory network
00:16:20 so that the memory contains all the Wikipedia,
00:16:22 it doesn’t quite work.
00:16:24 Right.
00:16:25 So there’s a need for new ideas there, okay.
00:16:28 But it’s not the only form of reasoning.
00:16:30 So there’s another form of reasoning,
00:16:31 which is true, which is very classical also
00:16:34 in some types of AI.
00:16:36 And it’s based on, let’s call it energy minimization.
00:16:40 Okay, so you have some sort of objective,
00:16:44 some energy function that represents
00:16:47 the quality or the negative quality, okay.
00:16:53 Energy goes up when things get bad
00:16:54 and they get low when things get good.
00:16:57 So let’s say you want to figure out,
00:17:00 what gestures do I need to do
00:17:03 to grab an object or walk out the door.
00:17:08 If you have a good model of your own body,
00:17:10 a good model of the environment,
00:17:12 using this kind of energy minimization,
00:17:14 you can do planning.
00:17:16 And in optimal control,
00:17:19 it’s called model predictive control.
00:17:22 You have a model of what’s gonna happen in the world
00:17:24 as a consequence of your actions.
00:17:25 And that allows you to, by energy minimization,
00:17:28 figure out the sequence of action
00:17:29 that optimizes a particular objective function,
00:17:32 which measures, minimizes the number of times
00:17:34 you’re gonna hit something
00:17:35 and the energy you’re gonna spend
00:17:36 doing the gesture and et cetera.
00:17:39 So that’s a form of reasoning.
00:17:42 Planning is a form of reasoning.
00:17:43 And perhaps what led to the ability of humans to reason
00:17:48 is the fact that, or species that appear before us
00:17:53 had to do some sort of planning
00:17:55 to be able to hunt and survive
00:17:56 and survive the winter in particular.
00:17:59 And so it’s the same capacity that you need to have.
00:18:03 So in your intuition is,
00:18:07 if we look at expert systems
00:18:09 and encoding knowledge as logic systems,
00:18:13 as graphs, in this kind of way,
00:18:16 is not a useful way to think about knowledge?
00:18:20 Graphs are a little brittle or logic representation.
00:18:23 So basically, variables that have values
00:18:27 and then constraint between them
00:18:29 that are represented by rules,
00:18:31 is a little too rigid and too brittle, right?
00:18:32 So some of the early efforts in that respect
00:18:38 were to put probabilities on them.
00:18:41 So a rule, if you have this and that symptom,
00:18:44 you have this disease with that probability
00:18:47 and you should prescribe that antibiotic
00:18:49 with that probability, right?
00:18:50 That’s the mycin system from the 70s.
00:18:54 And that’s what that branch of AI led to,
00:18:58 based on networks and graphical models
00:19:00 and causal inference and variational method.
00:19:04 So there is certainly a lot of interesting
00:19:10 work going on in this area.
00:19:11 The main issue with this is knowledge acquisition.
00:19:13 How do you reduce a bunch of data to a graph of this type?
00:19:18 Yeah, it relies on the expert, on the human being,
00:19:22 to encode, to add knowledge.
00:19:24 And that’s essentially impractical.
00:19:27 Yeah, it’s not scalable.
00:19:29 That’s a big question.
00:19:30 The second question is,
00:19:31 do you want to represent knowledge as symbols
00:19:34 and do you want to manipulate them with logic?
00:19:37 And again, that’s incompatible with learning.
00:19:39 So one suggestion, which Jeff Hinton
00:19:43 has been advocating for many decades,
00:19:45 is replace symbols by vectors.
00:19:49 Think of it as pattern of activities
00:19:50 in a bunch of neurons or units
00:19:53 or whatever you want to call them.
00:19:55 And replace logic by continuous functions.
00:19:59 Okay, and that becomes now compatible.
00:20:01 There’s a very good set of ideas
00:20:04 by, written in a paper about 10 years ago
00:20:07 by Leon Boutout, who is here at Facebook.
00:20:13 The title of the paper is,
00:20:14 From Machine Learning to Machine Reasoning.
00:20:15 And his idea is that a learning system
00:20:19 should be able to manipulate objects
00:20:20 that are in a space
00:20:23 and then put the result back in the same space.
00:20:24 So it’s this idea of working memory, basically.
00:20:28 And it’s very enlightening.
00:20:30 And in a sense, that might learn something
00:20:33 like the simple expert systems.
00:20:37 I mean, you can learn basic logic operations there.
00:20:42 Yeah, quite possibly.
00:20:43 There’s a big debate on sort of how much prior structure
00:20:46 you have to put in for this kind of stuff to emerge.
00:20:49 That’s the debate I have with Gary Marcus
00:20:50 and people like that.
00:20:51 Yeah, yeah, so, and the other person,
00:20:55 so I just talked to Judea Pearl,
00:20:57 from the you mentioned causal inference world.
00:21:00 So his worry is that the current neural networks
00:21:04 are not able to learn what causes
00:21:09 what causal inference between things.
00:21:12 So I think he’s right and wrong about this.
00:21:15 If he’s talking about the sort of classic
00:21:20 type of neural nets,
00:21:21 people sort of didn’t worry too much about this.
00:21:23 But there’s a lot of people now working on causal inference.
00:21:26 And there’s a paper that just came out last week
00:21:27 by Leon Boutou, among others,
00:21:29 David Lopez, Baz, and a bunch of other people,
00:21:32 exactly on that problem of how do you kind of
00:21:36 get a neural net to sort of pay attention
00:21:39 to real causal relationships,
00:21:41 which may also solve issues of bias in data
00:21:46 and things like this, so.
00:21:48 I’d like to read that paper
00:21:49 because that ultimately the challenges
00:21:51 also seems to fall back on the human expert
00:21:56 to ultimately decide causality between things.
00:22:01 People are not very good
00:22:02 at establishing causality, first of all.
00:22:04 So first of all, you talk to physicists
00:22:06 and physicists actually don’t believe in causality
00:22:08 because look at all the basic laws of microphysics
00:22:12 are time reversible, so there’s no causality.
00:22:15 The arrow of time is not real, yeah.
00:22:17 It’s as soon as you start looking at macroscopic systems
00:22:20 where there is unpredictable randomness,
00:22:22 where there is clearly an arrow of time,
00:22:25 but it’s a big mystery in physics, actually,
00:22:27 how that emerges.
00:22:29 Is it emergent or is it part of
00:22:31 the fundamental fabric of reality?
00:22:34 Or is it a bias of intelligent systems
00:22:36 that because of the second law of thermodynamics,
00:22:39 we perceive a particular arrow of time,
00:22:41 but in fact, it’s kind of arbitrary, right?
00:22:45 So yeah, physicists, mathematicians,
00:22:47 they don’t care about, I mean,
00:22:48 the math doesn’t care about the flow of time.
00:22:51 Well, certainly, macrophysics doesn’t.
00:22:54 People themselves are not very good
00:22:55 at establishing causal relationships.
00:22:58 If you ask, I think it was in one of Seymour Papert’s book
00:23:02 on children learning.
00:23:06 He studied with Jean Piaget.
00:23:08 He’s the guy who coauthored the book Perceptron
00:23:11 with Marvin Minsky that kind of killed
00:23:12 the first wave of neural nets,
00:23:14 but he was actually a learning person.
00:23:17 He, in the sense of studying learning in humans
00:23:21 and machines, that’s why he got interested in Perceptron.
00:23:24 And he wrote that if you ask a little kid
00:23:29 about what is the cause of the wind,
00:23:33 a lot of kids will say, they will think for a while
00:23:35 and they’ll say, oh, it’s the branches in the trees,
00:23:38 they move and that creates wind, right?
00:23:40 So they get the causal relationship backwards.
00:23:42 And it’s because their understanding of the world
00:23:44 and intuitive physics is not that great, right?
00:23:46 I mean, these are like, you know, four or five year old kids.
00:23:49 You know, it gets better,
00:23:50 and then you understand that this, it can be, right?
00:23:54 But there are many things which we can,
00:23:57 because of our common sense understanding of things,
00:24:00 what people call common sense,
00:24:03 and our understanding of physics,
00:24:05 we can, there’s a lot of stuff
00:24:07 that we can figure out causality.
00:24:08 Even with diseases, we can figure out
00:24:10 what’s not causing what, often.
00:24:14 There’s a lot of mystery, of course,
00:24:16 but the idea is that you should be able
00:24:18 to encode that into systems,
00:24:20 because it seems unlikely they’d be able
00:24:21 to figure that out themselves.
00:24:22 Well, whenever we can do intervention,
00:24:24 but you know, all of humanity has been completely deluded
00:24:27 for millennia, probably since its existence,
00:24:30 about a very, very wrong causal relationship,
00:24:33 where whatever you can explain, you attribute it to,
00:24:35 you know, some deity, some divinity, right?
00:24:39 And that’s a cop out, that’s a way of saying like,
00:24:41 I don’t know the cause, so you know, God did it, right?
00:24:43 So you mentioned Marvin Minsky,
00:24:46 and the irony of, you know,
00:24:51 maybe causing the first AI winter.
00:24:54 You were there in the 90s, you were there in the 80s,
00:24:56 of course.
00:24:58 In the 90s, why do you think people lost faith
00:25:00 in deep learning, in the 90s, and found it again,
00:25:04 a decade later, over a decade later?
00:25:06 Yeah, it wasn’t called deep learning yet,
00:25:07 it was just called neural nets, but yeah,
00:25:11 they lost interest.
00:25:13 I mean, I think I would put that around 1995,
00:25:16 at least the machine learning community,
00:25:18 there was always a neural net community,
00:25:19 but it became kind of disconnected
00:25:23 from sort of mainstream machine learning, if you want.
00:25:26 There were, it was basically electrical engineering
00:25:30 that kept at it, and computer science gave up on neural nets.
00:25:38 I don’t know, you know, I was too close to it
00:25:40 to really sort of analyze it with sort of an unbiased eye,
00:25:46 if you want, but I would make a few guesses.
00:25:50 So the first one is, at the time, neural nets were,
00:25:55 it was very hard to make them work,
00:25:57 in the sense that you would implement backprop
00:26:02 in your favorite language, and that favorite language
00:26:06 was not Python, it was not MATLAB,
00:26:08 it was not any of those things,
00:26:09 because they didn’t exist, right?
00:26:10 You had to write it in Fortran OC,
00:26:13 or something like this, right?
00:26:16 So you would experiment with it,
00:26:18 you would probably make some very basic mistakes,
00:26:21 like, you know, badly initialize your weights,
00:26:23 make the network too small,
00:26:24 because you read in the textbook, you know,
00:26:25 you don’t want too many parameters, right?
00:26:27 And of course, you know, and you would train on XOR,
00:26:29 because you didn’t have any other data set to trade on.
00:26:32 And of course, you know, it works half the time.
00:26:33 So you would say, I give up.
00:26:36 Also, you would train it with batch gradient,
00:26:37 which, you know, isn’t that sufficient.
00:26:40 So there’s a lot of, there’s a bag of tricks
00:26:42 that you had to know to make those things work,
00:26:44 or you had to reinvent, and a lot of people just didn’t,
00:26:48 and they just couldn’t make it work.
00:26:51 So that’s one thing.
00:26:52 The investment in software platform
00:26:54 to be able to kind of, you know, display things,
00:26:58 figure out why things don’t work,
00:26:59 kind of get a good intuition for how to get them to work,
00:27:02 have enough flexibility so you can create, you know,
00:27:04 network architectures like convolutional nets
00:27:06 and stuff like that.
00:27:08 It was hard.
00:27:09 I mean, you had to write everything from scratch.
00:27:10 And again, you didn’t have any Python
00:27:11 or MATLAB or anything, right?
00:27:14 I read that, sorry to interrupt,
00:27:15 but I read that you wrote in Lisp
00:27:17 the first versions of Lanet with convolutional networks,
00:27:22 which by the way, one of my favorite languages.
00:27:25 That’s how I knew you were legit.
00:27:27 Turing award, whatever.
00:27:29 You programmed in Lisp, that’s…
00:27:30 It’s still my favorite language,
00:27:31 but it’s not that we programmed in Lisp,
00:27:34 it’s that we had to write our Lisp interpreter, okay?
00:27:38 Because it’s not like we used one that existed.
00:27:40 So we wrote a Lisp interpreter that we hooked up to,
00:27:43 you know, a backend library that we wrote also
00:27:46 for sort of neural net computation.
00:27:48 And then after a few years around 1991,
00:27:50 we invented this idea of basically having modules
00:27:54 that know how to forward propagate
00:27:56 and back propagate gradients,
00:27:57 and then interconnecting those modules in a graph.
00:28:01 Number two had made proposals on this,
00:28:03 about this in the late eighties,
00:28:04 and we were able to implement this using our Lisp system.
00:28:08 Eventually we wanted to use that system
00:28:09 to build production code for character recognition
00:28:13 at Bell Labs.
00:28:14 So we actually wrote a compiler for that Lisp interpreter
00:28:16 so that Patricia Simard, who is now at Microsoft,
00:28:19 kind of did the bulk of it with Leon and me.
00:28:22 And so we could write our system in Lisp
00:28:24 and then compile to C,
00:28:26 and then we’ll have a self contained complete system
00:28:29 that could kind of do the entire thing.
00:28:33 Neither PyTorch nor TensorFlow can do this today.
00:28:36 Yeah, okay, it’s coming.
00:28:37 Yeah.
00:28:40 I mean, there’s something like that in PyTorch
00:28:42 called TorchScript.
00:28:44 And so, you know, we had to write our Lisp interpreter,
00:28:46 we had to write our Lisp compiler,
00:28:48 we had to invest a huge amount of effort to do this.
00:28:50 And not everybody,
00:28:52 if you don’t completely believe in the concept,
00:28:55 you’re not going to invest the time to do this.
00:28:57 Now at the time also, you know,
00:28:59 or today, this would turn into Torch or PyTorch
00:29:02 or TensorFlow or whatever,
00:29:03 we’d put it in open source, everybody would use it
00:29:05 and, you know, realize it’s good.
00:29:07 Back before 1995, working at AT&T,
00:29:11 there’s no way the lawyers would let you
00:29:13 release anything in open source of this nature.
00:29:17 And so we could not distribute our code really.
00:29:20 And on that point,
00:29:21 and sorry to go on a million tangents,
00:29:23 but on that point, I also read that there was some,
00:29:26 almost like a patent on convolutional neural networks
00:29:30 at Bell Labs.
00:29:32 So that, first of all, I mean, just.
00:29:35 There’s two actually.
00:29:38 That ran out.
00:29:39 Thankfully, in 2007.
00:29:41 In 2007.
00:29:42 So I’m gonna, what,
00:29:46 can we just talk about that for a second?
00:29:48 I know you’re a Facebook, but you’re also at NYU.
00:29:51 And what does it mean to patent ideas
00:29:55 like these software ideas, essentially?
00:29:58 Or what are mathematical ideas?
00:30:02 Or what are they?
00:30:03 Okay, so they’re not mathematical ideas.
00:30:05 They are, you know, algorithms.
00:30:07 And there was a period where the US Patent Office
00:30:11 would allow the patent of software
00:30:14 as long as it was embodied.
00:30:16 The Europeans are very different.
00:30:18 They don’t quite accept that.
00:30:20 They have a different concept.
00:30:21 But, you know, I don’t, I no longer,
00:30:24 I mean, I never actually strongly believed in this,
00:30:26 but I don’t believe in this kind of patent.
00:30:28 Facebook basically doesn’t believe in this kind of patent.
00:30:34 Google fires patents because they’ve been burned with Apple.
00:30:39 And so now they do this for defensive purpose,
00:30:41 but usually they say,
00:30:42 we’re not gonna sue you if you infringe.
00:30:44 Facebook has a similar policy.
00:30:47 They say, you know, we fire patents on certain things
00:30:49 for defensive purpose.
00:30:50 We’re not gonna sue you if you infringe,
00:30:52 unless you sue us.
00:30:54 So the industry does not believe in patents.
00:30:59 They are there because of, you know,
00:31:00 the legal landscape and various things.
00:31:03 But I don’t really believe in patents
00:31:06 for this kind of stuff.
00:31:07 So that’s a great thing.
00:31:09 So I…
00:31:10 I’ll tell you a worse story, actually.
00:31:11 So what happens was the first patent about convolutional net
00:31:15 was about kind of the early version of convolutional net
00:31:18 that didn’t have separate pooling layers.
00:31:19 It had convolutional layers
00:31:22 which tried more than one, if you want, right?
00:31:25 And then there was a second one on convolutional nets
00:31:28 with separate pooling layers, trained with backprop.
00:31:31 And there were files filed in 89 and 1990
00:31:35 or something like this.
00:31:36 At the time, the life of a patent was 17 years.
00:31:40 So here’s what happened over the next few years
00:31:42 is that we started developing character recognition
00:31:45 technology around convolutional nets.
00:31:48 And in 1994,
00:31:52 a check reading system was deployed in ATM machines.
00:31:56 In 1995, it was for large check reading machines
00:31:59 in back offices, et cetera.
00:32:00 And those systems were developed by an engineering group
00:32:04 that we were collaborating with at AT&T.
00:32:07 And they were commercialized by NCR,
00:32:08 which at the time was a subsidiary of AT&T.
00:32:11 Now AT&T split up in 1996,
00:32:17 early 1996.
00:32:18 And the lawyers just looked at all the patents
00:32:20 and they distributed the patents among the various companies.
00:32:23 They gave the convolutional net patent to NCR
00:32:26 because they were actually selling products that used it.
00:32:29 But nobody at NCR had any idea what a convolutional net was.
00:32:32 Yeah.
00:32:33 Okay.
00:32:34 So between 1996 and 2007,
00:32:38 so there’s a whole period until 2002
00:32:39 where I didn’t actually work on machine learning
00:32:42 or convolutional net.
00:32:42 I resumed working on this around 2002.
00:32:45 And between 2002 and 2007,
00:32:47 I was working on them, crossing my finger
00:32:49 that nobody at NCR would notice.
00:32:51 Nobody noticed.
00:32:52 Yeah, and I hope that this kind of somewhat,
00:32:55 as you said, lawyers aside,
00:32:58 relative openness of the community now will continue.
00:33:02 It accelerates the entire progress of the industry.
00:33:05 And the problems that Facebook and Google
00:33:11 and others are facing today
00:33:13 is not whether Facebook or Google or Microsoft or IBM
00:33:16 or whoever is ahead of the other.
00:33:18 It’s that we don’t have the technology
00:33:19 to build the things we want to build.
00:33:21 We want to build intelligent virtual assistants
00:33:23 that have common sense.
00:33:24 We don’t have monopoly on good ideas for this.
00:33:26 We don’t believe we do.
00:33:27 Maybe others believe they do, but we don’t.
00:33:30 Okay.
00:33:31 If a startup tells you they have the secret
00:33:33 to human level intelligence and common sense,
00:33:36 don’t believe them, they don’t.
00:33:38 And it’s gonna take the entire work
00:33:42 of the world research community for a while
00:33:45 to get to the point where you can go off
00:33:47 and each of those companies
00:33:49 kind of start to build things on this.
00:33:50 We’re not there yet.
00:33:51 It’s absolutely, and this calls to the gap
00:33:54 between the space of ideas
00:33:57 and the rigorous testing of those ideas
00:34:00 of practical application that you often speak to.
00:34:03 You’ve written advice saying don’t get fooled
00:34:06 by people who claim to have a solution
00:34:08 to artificial general intelligence,
00:34:10 who claim to have an AI system
00:34:11 that works just like the human brain
00:34:14 or who claim to have figured out how the brain works.
00:34:17 Ask them what the error rate they get
00:34:20 on MNIST or ImageNet.
00:34:23 So this is a little dated by the way.
00:34:25 2000, I mean five years, who’s counting?
00:34:28 Okay, but I think your opinion is still,
00:34:30 MNIST and ImageNet, yes, may be dated,
00:34:34 there may be new benchmarks, right?
00:34:36 But I think that philosophy is one you still
00:34:39 in somewhat hold, that benchmarks
00:34:43 and the practical testing, the practical application
00:34:45 is where you really get to test the ideas.
00:34:48 Well, it may not be completely practical.
00:34:49 Like for example, it could be a toy data set,
00:34:52 but it has to be some sort of task
00:34:54 that the community as a whole has accepted
00:34:57 as some sort of standard kind of benchmark if you want.
00:35:00 It doesn’t need to be real.
00:35:01 So for example, many years ago here at FAIR,
00:35:05 people, Jason West and Antoine Borne
00:35:07 and a few others proposed the Babi tasks,
00:35:09 which were kind of a toy problem to test
00:35:12 the ability of machines to reason actually
00:35:14 to access working memory and things like this.
00:35:16 And it was very useful even though it wasn’t a real task.
00:35:20 MNIST is kind of halfway real task.
00:35:23 So toy problems can be very useful.
00:35:26 It’s just that I was really struck by the fact
00:35:29 that a lot of people, particularly a lot of people
00:35:31 with money to invest would be fooled by people telling them,
00:35:34 oh, we have the algorithm of the cortex
00:35:37 and you should give us 50 million.
00:35:39 Yes, absolutely.
00:35:40 So there’s a lot of people who try to take advantage
00:35:45 of the hype for business reasons and so on.
00:35:48 But let me sort of talk to this idea
00:35:50 that sort of new ideas, the ideas that push the field
00:35:55 forward may not yet have a benchmark
00:35:58 or it may be very difficult to establish a benchmark.
00:36:00 I agree.
00:36:01 That’s part of the process.
00:36:02 Establishing benchmarks is part of the process.
00:36:04 So what are your thoughts about,
00:36:07 so we have these benchmarks on around stuff we can do
00:36:10 with images from classification to captioning
00:36:14 to just every kind of information you can pull off
00:36:16 from images and the surface level.
00:36:18 There’s audio data sets, there’s some video.
00:36:22 What can we start, natural language, what kind of stuff,
00:36:27 what kind of benchmarks do you see that start creeping
00:36:30 on to more something like intelligence, like reasoning,
00:36:34 like maybe you don’t like the term,
00:36:37 but AGI echoes of that kind of formulation.
00:36:41 A lot of people are working on interactive environments
00:36:44 in which you can train and test intelligence systems.
00:36:48 So there, for example, it’s the classical paradigm
00:36:54 of supervised learning is that you have a data set,
00:36:57 you partition it into a training set, validation set,
00:37:00 test set, and there’s a clear protocol, right?
00:37:03 But what if that assumes that the samples
00:37:06 are statistically independent, you can exchange them,
00:37:10 the order in which you see them shouldn’t matter,
00:37:12 things like that.
00:37:13 But what if the answer you give determines
00:37:16 the next sample you see, which is the case, for example,
00:37:18 in robotics, right?
00:37:19 You robot does something and then it gets exposed
00:37:22 to a new room, and depending on where it goes,
00:37:25 the room would be different.
00:37:26 So that creates the exploration problem.
00:37:30 The what if the samples, so that creates also a dependency
00:37:34 between samples, right?
00:37:35 You, if you move, if you can only move in space,
00:37:39 the next sample you’re gonna see is gonna be probably
00:37:41 in the same building, most likely, right?
00:37:44 So all the assumptions about the validity
00:37:47 of this training set, test set hypothesis break.
00:37:51 Whenever a machine can take an action
00:37:53 that has an influence in the world,
00:37:54 and it’s what it’s gonna see.
00:37:56 So people are setting up artificial environments
00:38:00 where that takes place, right?
00:38:02 The robot runs around a 3D model of a house
00:38:05 and can interact with objects and things like this.
00:38:08 So you do robotics based simulation,
00:38:10 you have those opening a gym type thing
00:38:14 or Mujoko kind of simulated robots
00:38:18 and you have games, things like that.
00:38:21 So that’s where the field is going really,
00:38:23 this kind of environment.
00:38:25 Now, back to the question of AGI.
00:38:28 I don’t like the term AGI because it implies
00:38:33 that human intelligence is general
00:38:35 and human intelligence is nothing like general.
00:38:38 It’s very, very specialized.
00:38:40 We think it’s general.
00:38:41 We’d like to think of ourselves
00:38:42 as having general intelligence.
00:38:43 We don’t, we’re very specialized.
00:38:46 We’re only slightly more general than.
00:38:47 Why does it feel general?
00:38:48 So you kind of, the term general.
00:38:52 I think what’s impressive about humans is ability to learn,
00:38:56 as we were talking about learning,
00:38:58 to learn in just so many different domains.
00:39:01 It’s perhaps not arbitrarily general,
00:39:04 but just you can learn in many domains
00:39:06 and integrate that knowledge somehow.
00:39:08 Okay.
00:39:09 The knowledge persists.
00:39:09 So let me take a very specific example.
00:39:11 Yes.
00:39:12 It’s not an example.
00:39:13 It’s more like a quasi mathematical demonstration.
00:39:17 So you have about 1 million fibers
00:39:18 coming out of one of your eyes.
00:39:20 Okay, 2 million total,
00:39:21 but let’s talk about just one of them.
00:39:23 It’s 1 million nerve fibers, your optical nerve.
00:39:27 Let’s imagine that they are binary.
00:39:28 So they can be active or inactive, right?
00:39:30 So the input to your visual cortex is 1 million bits.
00:39:34 Mm hmm.
00:39:36 Now they’re connected to your brain in a particular way,
00:39:39 and your brain has connections
00:39:41 that are kind of a little bit like a convolutional net,
00:39:44 they’re kind of local, you know, in space
00:39:46 and things like this.
00:39:47 Now, imagine I play a trick on you.
00:39:50 It’s a pretty nasty trick, I admit.
00:39:53 I cut your optical nerve,
00:39:55 and I put a device that makes a random perturbation
00:39:58 of a permutation of all the nerve fibers.
00:40:01 So now what comes to your brain
00:40:04 is a fixed but random permutation of all the pixels.
00:40:09 There’s no way in hell that your visual cortex,
00:40:11 even if I do this to you in infancy,
00:40:14 will actually learn vision
00:40:16 to the same level of quality that you can.
00:40:20 Got it, and you’re saying there’s no way you’ve learned that?
00:40:22 No, because now two pixels that are nearby in the world
00:40:25 will end up in very different places in your visual cortex,
00:40:29 and your neurons there have no connections with each other
00:40:31 because they’re only connected locally.
00:40:33 So this whole, our entire, the hardware is built
00:40:36 in many ways to support?
00:40:38 The locality of the real world.
00:40:40 Yes, that’s specialization.
00:40:42 Yeah, but it’s still pretty damn impressive,
00:40:44 so it’s not perfect generalization, it’s not even close.
00:40:46 No, no, it’s not that it’s not even close, it’s not at all.
00:40:50 Yeah, it’s not, it’s specialized, yeah.
00:40:52 So how many Boolean functions?
00:40:54 So let’s imagine you want to train your visual system
00:40:58 to recognize particular patterns of those one million bits.
00:41:03 Okay, so that’s a Boolean function, right?
00:41:05 Either the pattern is here or not here,
00:41:07 this is a two way classification
00:41:09 with one million binary inputs.
00:41:13 How many such Boolean functions are there?
00:41:16 Okay, you have two to the one million
00:41:19 combinations of inputs,
00:41:21 for each of those you have an output bit,
00:41:24 and so you have two to the one million
00:41:27 Boolean functions of this type, okay?
00:41:30 Which is an unimaginably large number.
00:41:33 How many of those functions can actually be computed
00:41:35 by your visual cortex?
00:41:37 And the answer is a tiny, tiny, tiny, tiny, tiny, tiny sliver.
00:41:41 Like an enormously tiny sliver.
00:41:43 Yeah, yeah.
00:41:44 So we are ridiculously specialized.
00:41:48 Okay.
00:41:49 But, okay, that’s an argument against the word general.
00:41:54 I think there’s a, I agree with your intuition,
00:41:59 but I’m not sure it’s, it seems the brain is impressively
00:42:06 capable of adjusting to things, so.
00:42:09 It’s because we can’t imagine tasks
00:42:13 that are outside of our comprehension, right?
00:42:16 So we think we’re general because we’re general
00:42:18 of all the things that we can apprehend.
00:42:20 But there is a huge world out there
00:42:23 of things that we have no idea.
00:42:24 We call that heat, by the way.
00:42:26 Heat.
00:42:27 Heat.
00:42:28 So, at least physicists call that heat,
00:42:30 or they call it entropy, which is kind of.
00:42:33 You have a thing full of gas, right?
00:42:39 Closed system for gas.
00:42:40 Right?
00:42:41 Closed or not closed.
00:42:42 It has pressure, it has temperature, it has, you know,
00:42:47 and you can write equations, PV equal N on T,
00:42:50 you know, things like that, right?
00:42:52 When you reduce the volume, the temperature goes up,
00:42:54 the pressure goes up, you know, things like that, right?
00:42:57 For perfect gas, at least.
00:42:59 Those are the things you can know about that system.
00:43:02 And it’s a tiny, tiny number of bits
00:43:04 compared to the complete information
00:43:06 of the state of the entire system.
00:43:08 Because the state of the entire system
00:43:09 will give you the position of momentum
00:43:11 of every molecule of the gas.
00:43:14 And what you don’t know about it is the entropy,
00:43:17 and you interpret it as heat.
00:43:20 The energy contained in that thing is what we call heat.
00:43:24 Now, it’s very possible that, in fact,
00:43:28 there is some very strong structure
00:43:30 in how those molecules are moving.
00:43:31 It’s just that they are in a way
00:43:33 that we are just not wired to perceive.
00:43:35 Yeah, we’re ignorant to it.
00:43:36 And there’s, in your infinite amount of things,
00:43:40 we’re not wired to perceive.
00:43:41 And you’re right, that’s a nice way to put it.
00:43:44 We’re general to all the things we can imagine,
00:43:47 which is a very tiny subset of all things that are possible.
00:43:51 So it’s like comograph complexity
00:43:53 or the comograph chitin sum of complexity.
00:43:55 Yeah.
00:43:56 You know, every bit string or every integer is random,
00:44:02 except for all the ones that you can actually write down.
00:44:05 Yeah.
00:44:06 Yeah.
00:44:06 Yeah.
00:44:07 Yeah.
00:44:08 Yeah.
00:44:09 Yeah.
00:44:10 Yeah, okay.
00:44:12 So beautifully put.
00:44:13 But, you know, so we can just call it artificial intelligence.
00:44:15 We don’t need to have a general.
00:44:17 Or human level.
00:44:18 Human level intelligence is good.
00:44:20 You know, you’ll start, anytime you touch human,
00:44:24 it gets interesting because, you know,
00:44:30 it’s because we attach ourselves to human
00:44:33 and it’s difficult to define what human intelligence is.
00:44:36 Yeah.
00:44:37 Nevertheless, my definition is maybe dem impressive
00:44:42 intelligence, okay?
00:44:43 Dem impressive demonstration of intelligence, whatever.
00:44:46 And so on that topic, most successes in deep learning
00:44:51 have been in supervised learning.
00:44:53 What is your view on unsupervised learning?
00:44:57 Is there a hope to reduce involvement of human input
00:45:03 and still have successful systems
00:45:05 that have practical use?
00:45:08 Yeah, I mean, there’s definitely a hope.
00:45:09 It’s more than a hope, actually.
00:45:11 It’s mounting evidence for it.
00:45:13 And that’s basically all I do.
00:45:16 Like, the only thing I’m interested in at the moment is,
00:45:19 I call it self supervised learning, not unsupervised.
00:45:21 Because unsupervised learning is a loaded term.
00:45:25 People who know something about machine learning,
00:45:27 you know, tell you, so you’re doing clustering or PCA,
00:45:30 which is not the case.
00:45:31 And the white public, you know,
00:45:32 when you say unsupervised learning,
00:45:33 oh my God, machines are gonna learn by themselves
00:45:35 without supervision.
00:45:37 You know, they see this as…
00:45:39 Where’s the parents?
00:45:40 Yeah, so I call it self supervised learning
00:45:42 because, in fact, the underlying algorithms that are used
00:45:46 are the same algorithms as the supervised learning
00:45:48 algorithms, except that what we train them to do
00:45:52 is not predict a particular set of variables,
00:45:55 like the category of an image,
00:46:00 and not to predict a set of variables
00:46:02 that have been provided by human labelers.
00:46:06 But what you’re trying the machine to do
00:46:07 is basically reconstruct a piece of its input
00:46:10 that is being maxed out, essentially.
00:46:14 You can think of it this way, right?
00:46:15 So show a piece of video to a machine
00:46:18 and ask it to predict what’s gonna happen next.
00:46:20 And of course, after a while, you can show what happens
00:46:23 and the machine will kind of train itself
00:46:26 to do better at that task.
00:46:28 You can do like all the latest, most successful models
00:46:32 in natural language processing,
00:46:33 use self supervised learning.
00:46:36 You know, sort of BERT style systems, for example, right?
00:46:38 You show it a window of a dozen words on a text corpus,
00:46:43 you take out 15% of the words,
00:46:46 and then you train the machine to predict the words
00:46:49 that are missing, that self supervised learning.
00:46:52 It’s not predicting the future,
00:46:53 it’s just predicting things in the middle,
00:46:56 but you could have it predict the future,
00:46:57 that’s what language models do.
00:46:59 So you construct, so in an unsupervised way,
00:47:01 you construct a model of language.
00:47:03 Do you think…
00:47:05 Or video or the physical world or whatever, right?
00:47:09 How far do you think that can take us?
00:47:12 Do you think BERT understands anything?
00:47:18 To some level, it has a shallow understanding of text,
00:47:23 but it needs to, I mean,
00:47:24 to have kind of true human level intelligence,
00:47:26 I think you need to ground language in reality.
00:47:29 So some people are attempting to do this, right?
00:47:32 Having systems that kind of have some visual representation
00:47:35 of what is being talked about,
00:47:37 which is one reason you need
00:47:38 those interactive environments actually.
00:47:41 But this is like a huge technical problem
00:47:43 that is not solved,
00:47:45 and that explains why self supervised learning
00:47:47 works in the context of natural language,
00:47:49 but does not work in the context, or at least not well,
00:47:52 in the context of image recognition and video,
00:47:55 although it’s making progress quickly.
00:47:57 And the reason, that reason is the fact that
00:48:01 it’s much easier to represent uncertainty in the prediction
00:48:05 in a context of natural language
00:48:06 than it is in the context of things like video and images.
00:48:10 So for example, if I ask you to predict
00:48:12 what words are missing,
00:48:14 15% of the words that I’ve taken out.
00:48:17 The possibilities are small.
00:48:19 That means… It’s small, right?
00:48:20 There is 100,000 words in the lexicon,
00:48:23 and what the machine spits out
00:48:24 is a big probability vector, right?
00:48:27 It’s a bunch of numbers between zero and one
00:48:29 that sum to one.
00:48:30 And we know how to do this with computers.
00:48:34 So there, representing uncertainty in the prediction
00:48:36 is relatively easy, and that’s, in my opinion,
00:48:39 why those techniques work for NLP.
00:48:42 For images, if you ask…
00:48:45 If you block a piece of an image,
00:48:46 and you ask the system,
00:48:47 reconstruct that piece of the image,
00:48:49 there are many possible answers.
00:48:51 They are all perfectly legit, right?
00:48:54 And how do you represent this set of possible answers?
00:48:58 You can’t train a system to make one prediction.
00:49:00 You can’t train a neural net to say,
00:49:02 here it is, that’s the image,
00:49:04 because there’s a whole set of things
00:49:06 that are compatible with it.
00:49:07 So how do you get the machine to represent
00:49:08 not a single output, but a whole set of outputs?
00:49:13 And similarly with video prediction,
00:49:17 there’s a lot of things that can happen
00:49:19 in the future of video.
00:49:20 You’re looking at me right now.
00:49:21 I’m not moving my head very much,
00:49:22 but I might turn my head to the left or to the right.
00:49:26 If you don’t have a system that can predict this,
00:49:30 and you train it with least square
00:49:31 to minimize the error with the prediction
00:49:33 and what I’m doing,
00:49:34 what you get is a blurry image of myself
00:49:36 in all possible future positions that I might be in,
00:49:39 which is not a good prediction.
00:49:41 So there might be other ways
00:49:43 to do the self supervision for visual scenes.
00:49:48 Like what?
00:49:48 I mean, if I knew, I wouldn’t tell you,
00:49:52 publish it first, I don’t know.
00:49:55 No, there might be.
00:49:57 So I mean, these are kind of,
00:50:00 there might be artificial ways of like self play in games,
00:50:03 the way you can simulate part of the environment.
00:50:05 Oh, that doesn’t solve the problem.
00:50:06 It’s just a way of generating data.
00:50:10 But because you have more of a control,
00:50:12 like maybe you can control,
00:50:14 yeah, it’s a way to generate data.
00:50:16 That’s right.
00:50:16 And because you can do huge amounts of data generation,
00:50:20 that doesn’t, you’re right.
00:50:21 Well, it creeps up on the problem from the side of data,
00:50:26 and you don’t think that’s the right way to creep up.
00:50:27 It doesn’t solve this problem
00:50:28 of handling uncertainty in the world, right?
00:50:30 So if you have a machine learn a predictive model
00:50:35 of the world in a game that is deterministic
00:50:38 or quasi deterministic, it’s easy, right?
00:50:42 Just give a few frames of the game to a ConvNet,
00:50:45 put a bunch of layers,
00:50:47 and then have the game generates the next few frames.
00:50:49 And if the game is deterministic, it works fine.
00:50:54 And that includes feeding the system with the action
00:50:59 that your little character is gonna take.
00:51:03 The problem comes from the fact that the real world
00:51:06 and most games are not entirely predictable.
00:51:09 And so there you get those blurry predictions
00:51:11 and you can’t do planning with blurry predictions, right?
00:51:14 So if you have a perfect model of the world,
00:51:17 you can, in your head, run this model
00:51:20 with a hypothesis for a sequence of actions,
00:51:24 and you’re going to predict the outcome
00:51:25 of that sequence of actions.
00:51:28 But if your model is imperfect, how can you plan?
00:51:32 Yeah, it quickly explodes.
00:51:34 What are your thoughts on the extension of this,
00:51:37 which topic I’m super excited about,
00:51:39 it’s connected to something you were talking about
00:51:41 in terms of robotics, is active learning.
00:51:44 So as opposed to sort of completely unsupervised
00:51:47 or self supervised learning,
00:51:51 you ask the system for human help
00:51:54 for selecting parts you want annotated next.
00:51:58 So if you think about a robot exploring a space
00:52:00 or a baby exploring a space
00:52:02 or a system exploring a data set,
00:52:05 every once in a while asking for human input,
00:52:07 do you see value in that kind of work?
00:52:12 I don’t see transformative value.
00:52:14 It’s going to make things that we can already do
00:52:18 more efficient or they will learn slightly more efficiently,
00:52:20 but it’s not going to make machines
00:52:21 sort of significantly more intelligent.
00:52:23 I think, and by the way, there is no opposition,
00:52:29 there’s no conflict between self supervised learning,
00:52:34 reinforcement learning and supervised learning
00:52:35 or imitation learning or active learning.
00:52:39 I see self supervised learning
00:52:40 as a preliminary to all of the above.
00:52:43 Yes.
00:52:44 So the example I use very often is how is it that,
00:52:50 so if you use classical reinforcement learning,
00:52:54 deep reinforcement learning, if you want,
00:52:57 the best methods today,
00:53:01 so called model free reinforcement learning
00:53:03 to learn to play Atari games,
00:53:04 take about 80 hours of training to reach the level
00:53:07 that any human can reach in about 15 minutes.
00:53:11 They get better than humans, but it takes them a long time.
00:53:16 Alpha star, okay, the, you know,
00:53:20 Aureal Vinyals and his teams,
00:53:22 the system to play StarCraft plays,
00:53:27 you know, a single map, a single type of player.
00:53:32 A single player and can reach better than human level
00:53:38 with about the equivalent of 200 years of training
00:53:43 playing against itself.
00:53:45 It’s 200 years, right?
00:53:46 It’s not something that no human can ever do.
00:53:50 I mean, I’m not sure what lesson to take away from that.
00:53:52 Okay, now take those algorithms,
00:53:54 the best algorithms we have today
00:53:57 to train a car to drive itself.
00:54:00 It would probably have to drive millions of hours.
00:54:02 It will have to kill thousands of pedestrians.
00:54:04 It will have to run into thousands of trees.
00:54:06 It will have to run off cliffs.
00:54:08 And it had to run off cliff multiple times
00:54:10 before it figures out that it’s a bad idea, first of all.
00:54:14 And second of all, before it figures out how not to do it.
00:54:17 And so, I mean, this type of learning obviously
00:54:19 does not reflect the kind of learning
00:54:21 that animals and humans do.
00:54:23 There is something missing
00:54:24 that’s really, really important there.
00:54:26 And my hypothesis, which I’ve been advocating
00:54:28 for like five years now,
00:54:30 is that we have predictive models of the world
00:54:34 that include the ability to predict under uncertainty.
00:54:38 And what allows us to not run off a cliff
00:54:43 when we learn to drive,
00:54:44 most of us can learn to drive in about 20 or 30 hours
00:54:47 of training without ever crashing, causing any accident.
00:54:50 And if we drive next to a cliff,
00:54:53 we know that if we turn the wheel to the right,
00:54:55 the car is gonna run off the cliff
00:54:57 and nothing good is gonna come out of this.
00:54:58 Because we have a pretty good model of intuitive physics
00:55:00 that tells us the car is gonna fall.
00:55:02 We know about gravity.
00:55:04 Babies learn this around the age of eight or nine months
00:55:07 that objects don’t float, they fall.
00:55:11 And we have a pretty good idea of the effect
00:55:13 of turning the wheel on the car
00:55:15 and we know we need to stay on the road.
00:55:16 So there’s a lot of things that we bring to the table,
00:55:19 which is basically our predictive model of the world.
00:55:22 And that model allows us to not do stupid things.
00:55:25 And to basically stay within the context
00:55:28 of things we need to do.
00:55:29 We still face unpredictable situations
00:55:32 and that’s how we learn.
00:55:34 But that allows us to learn really, really, really quickly.
00:55:37 So that’s called model based reinforcement learning.
00:55:41 There’s some imitation and supervised learning
00:55:43 because we have a driving instructor
00:55:44 that tells us occasionally what to do.
00:55:47 But most of the learning is learning the model,
00:55:52 learning physics that we’ve done since we were babies.
00:55:55 That’s where all, almost all the learning.
00:55:56 And the physics is somewhat transferable from,
00:56:00 it’s transferable from scene to scene.
00:56:01 Stupid things are the same everywhere.
00:56:04 Yeah, I mean, if you have experience of the world,
00:56:07 you don’t need to be from a particularly intelligent species
00:56:11 to know that if you spill water from a container,
00:56:16 the rest is gonna get wet.
00:56:18 You might get wet.
00:56:20 So cats know this, right?
00:56:22 Yeah.
00:56:23 Right, so the main problem we need to solve
00:56:27 is how do we learn models of the world?
00:56:29 That’s what I’m interested in.
00:56:31 That’s what self supervised learning is all about.
00:56:34 If you were to try to construct a benchmark for,
00:56:39 let’s look at MNIST.
00:56:41 I love that data set.
00:56:44 Do you think it’s useful, interesting, slash possible
00:56:48 to perform well on MNIST with just one example
00:56:52 of each digit and how would we solve that problem?
00:56:58 The answer is probably yes.
00:56:59 The question is what other type of learning
00:57:02 are you allowed to do?
00:57:03 So if what you’re allowed to do is train
00:57:04 on some gigantic data set of labeled digit,
00:57:07 that’s called transfer learning.
00:57:08 And we know that works, okay?
00:57:11 We do this at Facebook, like in production, right?
00:57:13 We train large convolutional nets to predict hashtags
00:57:17 that people type on Instagram
00:57:18 and we train on billions of images, literally billions.
00:57:20 And then we chop off the last layer
00:57:22 and fine tune on whatever task we want.
00:57:24 That works really well.
00:57:26 You can beat the ImageNet record with this.
00:57:28 We actually open sourced the whole thing
00:57:30 like a few weeks ago.
00:57:31 Yeah, that’s still pretty cool.
00:57:33 But yeah, so what would be impressive?
00:57:36 What’s useful and impressive?
00:57:38 What kind of transfer learning
00:57:39 would be useful and impressive?
00:57:40 Is it Wikipedia, that kind of thing?
00:57:42 No, no, so I don’t think transfer learning
00:57:44 is really where we should focus.
00:57:46 We should try to do,
00:57:48 you know, have a kind of scenario for Benchmark
00:57:51 where you have unlabeled data
00:57:53 and you can, and it’s very large number of unlabeled data.
00:57:58 It could be video clips.
00:58:00 It could be where you do, you know, frame prediction.
00:58:03 It could be images where you could choose to,
00:58:06 you know, mask a piece of it, could be whatever,
00:58:10 but they’re unlabeled and you’re not allowed to label them.
00:58:13 So you do some training on this,
00:58:18 and then you train on a particular supervised task,
00:58:24 ImageNet or a NIST,
00:58:26 and you measure how your test error decrease
00:58:30 or validation error decreases
00:58:31 as you increase the number of label training samples.
00:58:35 Okay, and what you’d like to see is that,
00:58:40 you know, your error decreases much faster
00:58:43 than if you train from scratch from random weights.
00:58:46 So that to reach the same level of performance
00:58:48 and a completely supervised, purely supervised system
00:58:52 would reach you would need way fewer samples.
00:58:54 So that’s the crucial question
00:58:55 because it will answer the question to like, you know,
00:58:58 people interested in medical image analysis.
00:59:01 Okay, you know, if I want to get to a particular level
00:59:05 of error rate for this task,
00:59:07 I know I need a million samples.
00:59:10 Can I do, you know, self supervised pre training
00:59:13 to reduce this to about 100 or something?
00:59:15 And you think the answer there
00:59:16 is self supervised pre training?
00:59:18 Yeah, some form, some form of it.
00:59:23 Telling you active learning, but you disagree.
00:59:26 No, it’s not useless.
00:59:28 It’s just not gonna lead to a quantum leap.
00:59:30 It’s just gonna make things that we already do.
00:59:32 So you’re way smarter than me.
00:59:33 I just disagree with you.
00:59:35 But I don’t have anything to back that.
00:59:37 It’s just intuition.
00:59:38 So I worked a lot of large scale data sets
00:59:40 and there’s something that might be magic
00:59:43 in active learning, but okay.
00:59:45 And at least I said it publicly.
00:59:48 At least I’m being an idiot publicly.
00:59:50 Okay.
00:59:51 It’s not being an idiot.
00:59:52 It’s, you know, working with the data you have.
00:59:54 I mean, I mean, certainly people are doing things like,
00:59:56 okay, I have 3000 hours of, you know,
00:59:59 imitation learning for start driving car,
01:00:01 but most of those are incredibly boring.
01:00:03 What I like is select, you know, 10% of them
01:00:05 that are kind of the most informative.
01:00:07 And with just that, I would probably reach the same.
01:00:10 So it’s a weak form of active learning if you want.
01:00:14 Yes, but there might be a much stronger version.
01:00:18 Yeah, that’s right.
01:00:18 That’s what, and that’s an awful question if it exists.
01:00:21 The question is how much stronger can you get?
01:00:24 Elon Musk is confident.
01:00:26 Talked to him recently.
01:00:28 He’s confident that large scale data and deep learning
01:00:30 can solve the autonomous driving problem.
01:00:33 What are your thoughts on the limits,
01:00:36 possibilities of deep learning in this space?
01:00:38 It’s obviously part of the solution.
01:00:40 I mean, I don’t think we’ll ever have a set driving system
01:00:43 or at least not in the foreseeable future
01:00:45 that does not use deep learning.
01:00:47 Let me put it this way.
01:00:48 Now, how much of it?
01:00:49 So in the history of sort of engineering,
01:00:54 particularly sort of AI like systems,
01:00:58 there’s generally a first phase where everything is built by hand.
01:01:01 Then there is a second phase.
01:01:02 And that was the case for autonomous driving 20, 30 years ago.
01:01:06 There’s a phase where there’s a little bit of learning is used,
01:01:09 but there’s a lot of engineering that’s involved in kind of
01:01:12 taking care of corner cases and putting limits, et cetera,
01:01:16 because the learning system is not perfect.
01:01:18 And then as technology progresses,
01:01:21 we end up relying more and more on learning.
01:01:23 That’s the history of character recognition,
01:01:25 it’s the history of science.
01:01:27 Character recognition is the history of speech recognition,
01:01:29 now computer vision, natural language processing.
01:01:31 And I think the same is going to happen with autonomous driving
01:01:36 that currently the methods that are closest
01:01:40 to providing some level of autonomy,
01:01:43 some decent level of autonomy
01:01:44 where you don’t expect a driver to kind of do anything
01:01:48 is where you constrain the world.
01:01:50 So you only run within 100 square kilometers
01:01:53 or square miles in Phoenix where the weather is nice
01:01:56 and the roads are wide, which is what Waymo is doing.
01:02:00 You completely overengineer the car with tons of LIDARs
01:02:04 and sophisticated sensors that are too expensive
01:02:08 for consumer cars,
01:02:09 but they’re fine if you just run a fleet.
01:02:13 And you engineer the hell out of everything else.
01:02:16 You map the entire world.
01:02:17 So you have complete 3D model of everything.
01:02:20 So the only thing that the perception system
01:02:22 has to take care of is moving objects
01:02:24 and construction and sort of things that weren’t in your map.
01:02:30 And you can engineer a good SLAM system and all that stuff.
01:02:34 So that’s kind of the current approach
01:02:35 that’s closest to some level of autonomy.
01:02:37 But I think eventually the longterm solution
01:02:39 is going to rely more and more on learning
01:02:43 and possibly using a combination
01:02:45 of self supervised learning and model based reinforcement
01:02:49 or something like that.
01:02:50 But ultimately learning will be not just at the core,
01:02:54 but really the fundamental part of the system.
01:02:57 Yeah, it already is, but it will become more and more.
01:03:00 What do you think it takes to build a system
01:03:02 with human level intelligence?
01:03:04 You talked about the AI system in the movie Her
01:03:07 being way out of reach, our current reach.
01:03:10 This might be outdated as well, but.
01:03:12 It’s still way out of reach.
01:03:13 It’s still way out of reach.
01:03:15 What would it take to build Her?
01:03:18 Do you think?
01:03:19 So I can tell you the first two obstacles
01:03:21 that we have to clear,
01:03:22 but I don’t know how many obstacles there are after this.
01:03:24 So the image I usually use is that
01:03:26 there is a bunch of mountains that we have to climb
01:03:28 and we can see the first one,
01:03:29 but we don’t know if there are 50 mountains behind it or not.
01:03:33 And this might be a good sort of metaphor
01:03:34 for why AI researchers in the past
01:03:38 have been overly optimistic about the result of AI.
01:03:43 You know, for example,
01:03:45 Noel and Simon wrote the general problem solver
01:03:49 and they called it the general problem solver.
01:03:51 General problem solver.
01:03:52 And of course, the first thing you realize
01:03:54 is that all the problems you want to solve are exponential.
01:03:56 And so you can’t actually use it for anything useful,
01:03:59 but you know.
01:04:00 Yeah, so yeah, all you see is the first peak.
01:04:02 So in general, what are the first couple of peaks for Her?
01:04:05 So the first peak, which is precisely what I’m working on
01:04:08 is self supervised learning.
01:04:10 How do we get machines to run models of the world
01:04:12 by observation, kind of like babies and like young animals?
01:04:15 So we’ve been working with, you know, cognitive scientists.
01:04:21 So this Emmanuelle Dupoux, who’s at FAIR in Paris,
01:04:24 is a half time, is also a researcher in a French university.
01:04:30 And he has this chart that shows that which,
01:04:36 how many months of life baby humans
01:04:38 kind of learn different concepts.
01:04:40 And you can measure this in sort of various ways.
01:04:44 So things like distinguishing animate objects
01:04:49 from inanimate objects,
01:04:50 you can tell the difference at age two, three months.
01:04:54 Whether an object is going to stay stable,
01:04:56 is going to fall, you know,
01:04:58 about four months, you can tell.
01:05:00 You know, there are various things like this.
01:05:02 And then things like gravity,
01:05:04 the fact that objects are not supposed to float in the air,
01:05:06 but are supposed to fall,
01:05:07 you run this around the age of eight or nine months.
01:05:10 If you look at the data,
01:05:11 eight or nine months, if you look at a lot of,
01:05:14 you know, eight month old babies,
01:05:15 you give them a bunch of toys on their high chair.
01:05:19 First thing they do is they throw them on the ground
01:05:20 and they look at them.
01:05:21 It’s because, you know, they’re learning about,
01:05:23 actively learning about gravity.
01:05:26 Gravity, yeah.
01:05:26 Okay, so they’re not trying to annoy you,
01:05:29 but they, you know, they need to do the experiment, right?
01:05:32 Yeah.
01:05:33 So, you know, how do we get machines to learn like babies,
01:05:36 mostly by observation with a little bit of interaction
01:05:39 and learning those models of the world?
01:05:41 Because I think that’s really a crucial piece
01:05:43 of an intelligent autonomous system.
01:05:46 So if you think about the architecture
01:05:47 of an intelligent autonomous system,
01:05:49 it needs to have a predictive model of the world.
01:05:51 So something that says, here is a world at time T,
01:05:54 here is a state of the world at time T plus one,
01:05:55 if I take this action.
01:05:57 And it’s not a single answer, it can be a…
01:05:59 Yeah, it can be a distribution, yeah.
01:06:01 Yeah, well, but we don’t know how to represent
01:06:03 distributions in high dimensional T spaces.
01:06:04 So it’s gotta be something weaker than that, okay?
01:06:07 But with some representation of uncertainty.
01:06:09 If you have that, then you can do what optimal control
01:06:12 theorists call model predictive control,
01:06:14 which means that you can run your model
01:06:16 with a hypothesis for a sequence of action
01:06:18 and then see the result.
01:06:20 Now, what you need, the other thing you need
01:06:22 is some sort of objective that you want to optimize.
01:06:24 Am I reaching the goal of grabbing this object?
01:06:27 Am I minimizing energy?
01:06:28 Am I whatever, right?
01:06:30 So there is some sort of objective that you have to minimize.
01:06:33 And so in your head, if you have this model,
01:06:35 you can figure out the sequence of action
01:06:37 that will optimize your objective.
01:06:38 That objective is something that ultimately is rooted
01:06:42 in your basal ganglia, at least in the human brain,
01:06:44 that’s what it’s basal ganglia,
01:06:47 computes your level of contentment or miscontentment.
01:06:50 I don’t know if that’s a word.
01:06:52 Unhappiness, okay?
01:06:53 Yeah, yeah.
01:06:54 Discontentment.
01:06:55 Discontentment, maybe.
01:06:56 And so your entire behavior is driven towards
01:07:01 kind of minimizing that objective,
01:07:03 which is maximizing your contentment,
01:07:05 computed by your basal ganglia.
01:07:07 And what you have is an objective function,
01:07:10 which is basically a predictor
01:07:12 of what your basal ganglia is going to tell you.
01:07:14 So you’re not going to put your hand on fire
01:07:16 because you know it’s going to burn
01:07:19 and you’re going to get hurt.
01:07:21 And you’re predicting this because of your model
01:07:23 of the world and your sort of predictor
01:07:25 of this objective, right?
01:07:27 So if you have those three components,
01:07:31 you have four components,
01:07:32 you have the hardwired objective,
01:07:36 hardwired contentment objective computer,
01:07:41 if you want, calculator.
01:07:43 And then you have the three components.
01:07:45 One is the objective predictor,
01:07:46 which basically predicts your level of contentment.
01:07:48 One is the model of the world.
01:07:52 And there’s a third module I didn’t mention,
01:07:54 which is the module that will figure out
01:07:57 the best course of action to optimize an objective
01:08:00 given your model, okay?
01:08:03 Yeah.
01:08:04 And you can call this a policy network
01:08:07 or something like that, right?
01:08:09 Now, you need those three components
01:08:11 to act autonomously intelligently.
01:08:13 And you can be stupid in three different ways.
01:08:16 You can be stupid because your model of the world is wrong.
01:08:19 You can be stupid because your objective is not aligned
01:08:22 with what you actually want to achieve, okay?
01:08:27 In humans, that would be a psychopath.
01:08:30 And then the third way you can be stupid
01:08:33 is that you have the right model,
01:08:34 you have the right objective,
01:08:36 but you’re unable to figure out a course of action
01:08:38 to optimize your objective given your model.
01:08:41 Okay.
01:08:44 Some people who are in charge of big countries
01:08:45 actually have all three that are wrong.
01:08:47 All right.
01:08:50 Which countries?
01:08:51 I don’t know.
01:08:52 Okay, so if we think about this agent,
01:08:55 if we think about the movie Her,
01:08:58 you’ve criticized the art project
01:09:02 that is Sophia the Robot.
01:09:04 And what that project essentially does
01:09:07 is uses our natural inclination to anthropomorphize
01:09:11 things that look like human and give them more.
01:09:14 Do you think that could be used by AI systems
01:09:17 like in the movie Her?
01:09:21 So do you think that body is needed
01:09:23 to create a feeling of intelligence?
01:09:27 Well, if Sophia was just an art piece,
01:09:29 I would have no problem with it,
01:09:30 but it’s presented as something else.
01:09:33 Let me, on that comment real quick,
01:09:35 if creators of Sophia could change something
01:09:38 about their marketing or behavior in general,
01:09:40 what would it be?
01:09:41 What’s?
01:09:42 I’m just about everything.
01:09:44 I mean, don’t you think, here’s a tough question.
01:09:50 Let me, so I agree with you.
01:09:51 So Sophia is not, the general public feels
01:09:56 that Sophia can do way more than she actually can.
01:09:59 That’s right.
01:10:00 And the people who created Sophia
01:10:02 are not honestly publicly communicating,
01:10:08 trying to teach the public.
01:10:09 Right.
01:10:10 But here’s a tough question.
01:10:13 Don’t you think the same thing is scientists
01:10:19 in industry and research are taking advantage
01:10:22 of the same misunderstanding in the public
01:10:25 when they create AI companies or publish stuff?
01:10:29 Some companies, yes.
01:10:31 I mean, there is no sense of,
01:10:33 there’s no desire to delude.
01:10:34 There’s no desire to kind of over claim
01:10:37 when something is done, right?
01:10:38 You publish a paper on AI that has this result
01:10:41 on ImageNet, it’s pretty clear.
01:10:43 I mean, it’s not even interesting anymore,
01:10:44 but I don’t think there is that.
01:10:49 I mean, the reviewers are generally not very forgiving
01:10:52 of unsupported claims of this type.
01:10:57 And, but there are certainly quite a few startups
01:10:59 that have had a huge amount of hype around this
01:11:02 that I find extremely damaging
01:11:05 and I’ve been calling it out when I’ve seen it.
01:11:08 So yeah, but to go back to your original question,
01:11:10 like the necessity of embodiment,
01:11:13 I think, I don’t think embodiment is necessary.
01:11:15 I think grounding is necessary.
01:11:17 So I don’t think we’re gonna get machines
01:11:18 that really understand language
01:11:20 without some level of grounding in the real world.
01:11:22 And it’s not clear to me that language
01:11:24 is a high enough bandwidth medium
01:11:26 to communicate how the real world works.
01:11:28 So I think for this.
01:11:30 Can you talk to what grounding means?
01:11:32 So grounding means that,
01:11:34 so there is this classic problem of common sense reasoning,
01:11:37 you know, the Winograd schema, right?
01:11:41 And so I tell you the trophy doesn’t fit in the suitcase
01:11:44 because it’s too big,
01:11:46 or the trophy doesn’t fit in the suitcase
01:11:47 because it’s too small.
01:11:49 And the it in the first case refers to the trophy
01:11:51 in the second case to the suitcase.
01:11:53 And the reason you can figure this out
01:11:55 is because you know where the trophy and the suitcase are,
01:11:56 you know, one is supposed to fit in the other one
01:11:58 and you know the notion of size
01:12:00 and a big object doesn’t fit in a small object,
01:12:03 unless it’s a Tardis, you know, things like that, right?
01:12:05 So you have this knowledge of how the world works,
01:12:08 of geometry and things like that.
01:12:12 I don’t believe you can learn everything about the world
01:12:14 by just being told in language how the world works.
01:12:18 I think you need some low level perception of the world,
01:12:21 you know, be it visual touch, you know, whatever,
01:12:23 but some higher bandwidth perception of the world.
01:12:26 By reading all the world’s text,
01:12:28 you still might not have enough information.
01:12:31 That’s right.
01:12:32 There’s a lot of things that just will never appear in text
01:12:35 and that you can’t really infer.
01:12:37 So I think common sense will emerge from,
01:12:41 you know, certainly a lot of language interaction,
01:12:43 but also with watching videos
01:12:45 or perhaps even interacting in virtual environments
01:12:48 and possibly, you know, robot interacting in the real world.
01:12:51 But I don’t actually believe necessarily
01:12:53 that this last one is absolutely necessary.
01:12:56 But I think that there’s a need for some grounding.
01:13:00 But the final product
01:13:01 doesn’t necessarily need to be embodied, you’re saying.
01:13:04 No.
01:13:05 It just needs to have an awareness, a grounding to.
01:13:07 Right, but it needs to know how the world works
01:13:11 to have, you know, to not be frustrating to talk to.
01:13:15 And you talked about emotions being important.
01:13:19 That’s a whole nother topic.
01:13:21 Well, so, you know, I talked about this,
01:13:24 the basal ganglia as the thing
01:13:29 that calculates your level of miscontentment.
01:13:32 And then there is this other module
01:13:34 that sort of tries to do a prediction
01:13:36 of whether you’re going to be content or not.
01:13:38 That’s the source of some emotion.
01:13:40 So fear, for example, is an anticipation
01:13:43 of bad things that can happen to you, right?
01:13:47 You have this inkling that there is some chance
01:13:49 that something really bad is going to happen to you
01:13:50 and that creates fear.
01:13:52 Well, you know for sure
01:13:53 that something bad is going to happen to you,
01:13:54 you kind of give up, right?
01:13:55 It’s not fear anymore.
01:13:57 It’s uncertainty that creates fear.
01:13:59 So the punchline is,
01:14:01 we’re not going to have autonomous intelligence
01:14:02 without emotions.
01:14:07 Whatever the heck emotions are.
01:14:08 So you mentioned very practical things of fear,
01:14:11 but there’s a lot of other mess around it.
01:14:13 But there are kind of the results of, you know, drives.
01:14:16 Yeah, there’s deeper biological stuff going on.
01:14:19 And I’ve talked to a few folks on this.
01:14:21 There’s fascinating stuff
01:14:23 that ultimately connects to our brain.
01:14:27 If we create an AGI system, sorry.
01:14:30 Human level intelligence.
01:14:31 Human level intelligence system.
01:14:34 And you get to ask her one question.
01:14:37 What would that question be?
01:14:39 You know, I think the first one we’ll create
01:14:42 would probably not be that smart.
01:14:45 They’d be like a four year old.
01:14:47 Okay.
01:14:47 So you would have to ask her a question
01:14:50 to know she’s not that smart.
01:14:52 Yeah.
01:14:54 Well, what’s a good question to ask, you know,
01:14:56 to be impressed.
01:14:57 What is the cause of wind?
01:15:01 And if she answers,
01:15:02 oh, it’s because the leaves of the tree are moving
01:15:04 and that creates wind.
01:15:06 She’s onto something.
01:15:08 And if she says that’s a stupid question,
01:15:11 she’s really onto something.
01:15:12 No, and then you tell her,
01:15:14 actually, you know, here is the real thing.
01:15:18 She says, oh yeah, that makes sense.
01:15:20 So questions that reveal the ability
01:15:24 to do common sense reasoning about the physical world.
01:15:26 Yeah.
01:15:27 And you’ll sum it up with causal inference.
01:15:30 Causal inference.
01:15:31 Well, it was a huge honor.
01:15:33 Congratulations on your Turing Award.
01:15:35 Thank you so much for talking today.
01:15:37 Thank you.
01:15:38 Thank you for having me.