Transcript
00:00:00 The following is a conversation with Jitendra Malik, a professor at Berkeley and one of
00:00:05 the seminal figures in the field of computer vision, the kind before the deep learning
00:00:10 revolution and the kind after.
00:00:13 He has been cited over 180,000 times and has mentored many world class researchers in computer
00:00:21 science.
00:00:22 Quick summary of the ads.
00:00:24 Two sponsors, one new one which is BetterHelp and an old goodie ExpressVPN.
00:00:31 Please consider supporting this podcast by going to betterhelp.com slash lex and signing
00:00:37 up at expressvpn.com slash lexpod.
00:00:40 Click the links, buy the stuff, it really is the best way to support this podcast and
00:00:45 the journey I’m on.
00:00:47 If you enjoy this thing, subscribe on YouTube, review it with 5 stars on Apple Podcast, support
00:00:52 it on Patreon, or connect with me on Twitter at Lex Friedman, however the heck you spell
00:00:57 that.
00:00:58 As usual, I’ll do a few minutes of ads now and never any ads in the middle that can break
00:01:02 the flow of the conversation.
00:01:05 This show is sponsored by BetterHelp, spelled H E L P help.
00:01:11 Check it out at betterhelp.com slash lex.
00:01:15 They figure out what you need and match you with a licensed professional therapist in
00:01:19 under 48 hours.
00:01:21 It’s not a crisis line, it’s not self help, it’s professional counseling done securely
00:01:26 online.
00:01:27 I’m a bit from the David Goggins line of creatures, as you may know, and so have some
00:01:33 demons to contend with, usually on long runs or all nights working, forever and possibly
00:01:40 full of self doubt.
00:01:42 It may be because I’m Russian, but I think suffering is essential for creation.
00:01:47 But I also think you can suffer beautifully, in a way that doesn’t destroy you.
00:01:52 For most people, I think a good therapist can help in this, so it’s at least worth a
00:01:56 try.
00:01:57 Check out their reviews, they’re good, it’s easy, private, affordable, available worldwide.
00:02:03 You can communicate by text, any time, and schedule weekly audio and video sessions.
00:02:09 I highly recommend that you check them out at betterhelp.com slash lex.
00:02:15 This show is also sponsored by ExpressVPN.
00:02:19 Get it at expressvpn.com slash lexpod to support this podcast and to get an extra three months
00:02:26 free on a one year package.
00:02:28 I’ve been using ExpressVPN for many years, I love it.
00:02:32 I think ExpressVPN is the best VPN out there.
00:02:36 They told me to say it, but it happens to be true.
00:02:39 It doesn’t log your data, it’s crazy fast, and is easy to use, literally just one big,
00:02:45 sexy power on button.
00:02:47 Again, for obvious reasons, it’s really important that they don’t log your data.
00:02:51 It works on Linux and everywhere else too, but really, why use anything else?
00:02:57 Shout out to my favorite flavor of Linux, Ubuntu Mate 2004.
00:03:02 Once again, get it at expressvpn.com slash lexpod to support this podcast and to get
00:03:09 an extra three months free on a one year package.
00:03:13 And now, here’s my conversation with Jitendra Malik.
00:03:18 In 1966, Seymour Papert at MIT wrote up a proposal called the Summer Vision Project
00:03:25 to be given, as far as we know, to 10 students to work on and solve that summer.
00:03:31 So that proposal outlined many of the computer vision tasks we still work on today.
00:03:37 Why do you think we underestimate, and perhaps we did underestimate and perhaps still underestimate
00:03:43 how hard computer vision is?
00:03:46 Because most of what we do in vision, we do unconsciously or subconsciously.
00:03:51 In human vision.
00:03:52 In human vision.
00:03:53 So that gives us this, that effortlessness gives us the sense that, oh, this must be
00:03:58 very easy to implement on a computer.
00:04:02 Now, this is why the early researchers in AI got it so wrong.
00:04:09 However, if you go into neuroscience or psychology of human vision, then the complexity becomes
00:04:17 very clear.
00:04:19 The fact is that a very large part of the cerebral cortex is devoted to visual processing.
00:04:26 And this is true in other primates as well.
00:04:29 So once we looked at it from a neuroscience or psychology perspective, it becomes quite
00:04:35 clear that the problem is very challenging and it will take some time.
00:04:39 You said the higher level parts are the harder parts?
00:04:43 I think vision appears to be easy because most of what visual processing is subconscious
00:04:52 or unconscious.
00:04:55 So we underestimate the difficulty, whereas when you are like proving a mathematical theorem
00:05:03 or playing chess, the difficulty is much more evident.
00:05:08 So because it is your conscious brain, which is processing various aspects of the problem
00:05:15 solving behavior, whereas in vision, all this is happening, but it’s not in your awareness,
00:05:21 it’s in your, it’s operating below that.
00:05:25 But it’s, it still seems strange.
00:05:27 Yes, that’s true, but it seems strange that as computer vision researchers, for example,
00:05:35 the community broadly is time and time again makes the mistake of thinking the problem
00:05:41 is easier than it is, or maybe it’s not a mistake.
00:05:43 We’ll talk a little bit about autonomous driving, for example, how hard of a vision task that
00:05:48 is, it, do you think, I mean, what, is it just human nature or is there something fundamental
00:05:56 to the vision problem that we, we underestimate?
00:06:01 We’re still not able to be cognizant of how hard the problem is.
00:06:05 Yeah, I think in the early days it could have been excused because in the early days, all
00:06:11 aspects of AI were regarded as too easy.
00:06:15 But I think today it is much less excusable.
00:06:19 And I think why people fall for this is because of what I call the fallacy of the successful
00:06:27 first step.
00:06:30 There are many problems in vision where getting 50% of the solution you can get in one minute,
00:06:37 getting to 90% can take you a day, getting to 99% may take you five years, and 99.99%
00:06:47 may be not in your lifetime.
00:06:49 I wonder if that’s a unique division.
00:06:52 It seems that language, people are not so confident about, so natural language processing,
00:06:58 people are a little bit more cautious about our ability to, to solve that problem.
00:07:04 I think for language, people intuit that we have to be able to do natural language understanding.
00:07:10 For vision, it seems that we’re not cognizant or we don’t think about how much understanding
00:07:18 is required.
00:07:19 It’s probably still an open problem.
00:07:21 But in your sense, how much understanding is required to solve vision?
00:07:27 Like this, put another way, how much something called common sense reasoning is required
00:07:34 to really be able to interpret even static scenes?
00:07:39 Yeah.
00:07:40 So vision operates at all levels and there are parts which can be solved with what we
00:07:47 could call maybe peripheral processing.
00:07:50 So in the human vision literature, there used to be these terms, sensation, perception and
00:07:57 cognition, which roughly speaking referred to like the front end of processing, middle
00:08:04 stages of processing and higher level of processing.
00:08:08 And I think they made a big deal out of, out of this and they wanted to study only perception
00:08:13 and then dismiss certain, certain problems as being quote cognitive.
00:08:19 But really I think these are artificial divides.
00:08:23 The problem is continuous at all levels and there are challenges at all levels.
00:08:28 The techniques that we have today, they work better at the lower and mid levels of the
00:08:34 problem.
00:08:35 I think the higher levels of the problem, quote the cognitive levels of the problem
00:08:39 are there and we, in many real applications, we have to confront them.
00:08:46 Now how much that is necessary will depend on the application.
00:08:51 For some problems it doesn’t matter, for some problems it matters a lot.
00:08:55 So I am, for example, a pessimist on fully autonomous driving in the near future.
00:09:04 And the reason is because I think there will be that 0.01% of the cases where quite sophisticated
00:09:13 cognitive reasoning is called for.
00:09:16 However, there are tasks where you can, first of all, they are much more, they are robust.
00:09:23 So in the sense that error rates, error is not so much of a problem.
00:09:28 For example, let’s say we are, you’re doing image search, you’re trying to get images
00:09:34 based on some, some, some description, some visual description.
00:09:41 We are very tolerant of errors there, right?
00:09:43 I mean, when Google image search gives you some images back and a few of them are wrong,
00:09:49 it’s okay.
00:09:50 It doesn’t hurt anybody.
00:09:51 There is no, there’s not a matter of life and death.
00:09:54 But making mistakes when you are driving at 60 miles per hour and you could potentially
00:10:02 kill somebody is much more important.
00:10:06 So just for the, for the fun of it, since you mentioned, let’s go there briefly about
00:10:11 autonomous vehicles.
00:10:12 So one of the companies in the space, Tesla, is with Andre Karpathy and Elon Musk are working
00:10:19 on a system called Autopilot, which is primarily a vision based system with eight cameras and
00:10:26 basically a single neural network, a multitask neural network.
00:10:30 They call it HydroNet, multiple heads, so it does multiple tasks, but is forming the
00:10:35 same representation at the core.
00:10:38 Do you think driving can be converted in this way to purely a vision problem and then solved
00:10:47 with learning or even more specifically in the current approach, what do you think about
00:10:53 what Tesla Autopilot team is doing?
00:10:57 So the way I think about it is that there are certainly subsets of the visual based
00:11:02 driving problem, which are quite solvable.
00:11:05 So for example, driving in freeway conditions is quite a solvable problem.
00:11:11 I think there were demonstrations of that going back to the 1980s by someone called
00:11:18 Ernst Tickmans in Munich.
00:11:22 In the 90s, there were approaches from Carnegie Mellon, there were approaches from our team
00:11:27 at Berkeley.
00:11:28 In the 2000s, there were approaches from Stanford and so on.
00:11:33 So autonomous driving in certain settings is very doable.
00:11:38 The challenge is to have an autopilot work under all kinds of driving conditions.
00:11:45 At that point, it’s not just a question of vision or perception, but really also of control
00:11:51 and dealing with all the edge cases.
00:11:54 So where do you think most of the difficult cases, to me, even the highway driving is
00:11:59 an open problem because it applies the same 50, 90, 95, 99 rule where the first step,
00:12:08 the fallacy of the first step, I forget how you put it, we fall victim to.
00:12:12 I think even highway driving has a lot of elements because to solve autonomous driving,
00:12:17 you have to completely relinquish the help of a human being.
00:12:22 You’re always in control so that you’re really going to feel the edge cases.
00:12:26 So I think even highway driving is really difficult.
00:12:29 But in terms of the general driving task, do you think vision is the fundamental problem
00:12:35 or is it also your action, the interaction with the environment, the ability to…
00:12:44 And then the middle ground, I don’t know if you put that under vision, which is trying
00:12:48 to predict the behavior of others, which is a little bit in the world of understanding
00:12:54 the scene, but it’s also trying to form a model of the actors in the scene and predict
00:13:00 their behavior.
00:13:01 Yeah.
00:13:02 I include that in vision because to me, perception blends into cognition and building predictive
00:13:08 models of other agents in the world, which could be other agents, could be people, other
00:13:13 agents could be other cars.
00:13:15 That is part of the task of perception because perception always has to not tell us what
00:13:22 is now, but what will happen because what’s now is boring.
00:13:26 It’s done.
00:13:27 It’s over with.
00:13:28 Okay?
00:13:29 Yeah.
00:13:30 We care about the future because we act in the future.
00:13:33 And we care about the past in as much as it informs what’s going to happen in the future.
00:13:39 So I think we have to build predictive models of behaviors of people and those can get quite
00:13:45 complicated.
00:13:48 So I mean, I’ve seen examples of this in actually, I mean, I own a Tesla and it has various safety
00:13:59 features built in.
00:14:01 And what I see are these examples where let’s say there is some a skateboarder, I mean,
00:14:09 and I don’t want to be too critical because obviously these systems are always being improved
00:14:16 and any specific criticism I have, maybe the system six months from now will not have that
00:14:23 particular failure mode.
00:14:25 So it had the wrong response and it’s because it couldn’t predict what this skateboarder
00:14:36 was going to do.
00:14:38 Okay?
00:14:39 And because it really required that higher level cognitive understanding of what skateboarders
00:14:45 typically do as opposed to a normal pedestrian.
00:14:48 So what might have been the correct behavior for a pedestrian, a typical behavior for pedestrian
00:14:53 was not the typical behavior for a skateboarder, right?
00:14:59 Yeah.
00:15:00 And so therefore to do a good job there, you need to have enough data where you have pedestrians,
00:15:07 you also have skateboarders, you’ve seen enough skateboarders to see what kinds of patterns
00:15:14 of behavior they have.
00:15:16 So it is in principle with enough data, that problem could be solved.
00:15:21 But I think our current systems, computer vision systems, they need far, far more data
00:15:29 than humans do for learning those same capabilities.
00:15:33 So say that there is going to be a system that solves autonomous driving.
00:15:38 Do you think it will look similar to what we have today, but have a lot more data, perhaps
00:15:43 more compute, but the fundamental architecture is involved, like neural, well, in the case
00:15:48 of Tesla autopilot is neural networks.
00:15:52 Do you think it will look similar in that regard and we’ll just have more data?
00:15:57 That’s a scientific hypothesis as to which way is it going to go.
00:16:01 I will tell you what I would bet on.
00:16:05 So and this is my general philosophical position on how these learning systems have been.
00:16:14 What we have found currently very effective in computer vision in the deep learning paradigm
00:16:20 is sort of tabula rasa learning and tabula rasa learning in a supervised way with lots
00:16:27 and lots of…
00:16:28 What’s tabula rasa learning?
00:16:29 Tabula rasa in the sense that blank slate, we just have the system, which is given a
00:16:35 series of experiences in this setting and then it learns there.
00:16:39 Now if let’s think about human driving, it is not tabula rasa learning.
00:16:44 So at the age of 16 in high school, a teenager goes into driver ed class, right?
00:16:55 And now at that point they learn, but at the age of 16, they are already visual geniuses
00:17:02 because from zero to 16, they have built a certain repertoire of vision.
00:17:07 In fact, most of it has probably been achieved by age two, right?
00:17:13 In this period of age up to age two, they know that the world is three dimensional.
00:17:18 They know how objects look like from different perspectives.
00:17:22 They know about occlusion.
00:17:24 They know about common dynamics of humans and other bodies.
00:17:29 They have some notion of intuitive physics.
00:17:32 So they built that up from their observations and interactions in early childhood and of
00:17:38 course reinforced through their growing up to age 16.
00:17:44 So then at age 16, when they go into driver ed, what are they learning?
00:17:49 They’re not learning afresh the visual world.
00:17:52 They have a mastery of the visual world.
00:17:54 What they are learning is control, okay?
00:17:58 They’re learning how to be smooth about control, about steering and brakes and so forth.
00:18:04 They’re learning a sense of typical traffic situations.
00:18:08 Now that education process can be quite short because they are coming in as visual geniuses.
00:18:17 And of course in their future, they’re going to encounter situations which are very novel,
00:18:23 right?
00:18:24 So during my driver ed class, I may not have had to deal with a skateboarder.
00:18:29 I may not have had to deal with a truck driving in front of me where the back opens up and
00:18:37 some junk gets dropped from the truck and I have to deal with it, right?
00:18:42 But I can deal with this as a driver even though I did not encounter this in my driver
00:18:47 ed class.
00:18:48 And the reason I can deal with it is because I have all this general visual knowledge and
00:18:52 expertise.
00:18:55 And do you think the learning mechanisms we have today can do that kind of long term accumulation
00:19:02 of knowledge?
00:19:03 Or do we have to do some kind of, you know, the work that led up to expert systems with
00:19:11 knowledge representation, you know, the broader field of artificial intelligence worked on
00:19:17 this kind of accumulation of knowledge.
00:19:20 Do you think neural networks can do the same?
00:19:22 I think I don’t see any in principle problem with neural networks doing it, but I think
00:19:29 the learning techniques would need to evolve significantly.
00:19:33 So the current learning techniques that we have are supervised learning.
00:19:41 You’re given lots of examples, x, y, y pairs and you learn the functional mapping between
00:19:47 them.
00:19:48 I think that human learning is far richer than that.
00:19:52 It includes many different components.
00:19:54 There is a child explores the world and sees, for example, a child takes an object and manipulates
00:20:05 it in his hand and therefore gets to see the object from different points of view.
00:20:12 And the child has commanded the movement.
00:20:14 So that’s a kind of learning data, but the learning data has been arranged by the child.
00:20:21 And this is a very rich kind of data.
00:20:23 The child can do various experiments with the world.
00:20:30 So there are many aspects of sort of human learning, and these have been studied in child
00:20:36 development by psychologists.
00:20:39 And what they tell us is that supervised learning is a very small part of it.
00:20:45 There are many different aspects of learning.
00:20:48 And what we would need to do is to develop models of all of these and then train our
00:20:57 systems with that kind of a protocol.
00:21:02 So new methods of learning, some of which might imitate the human brain, but you also
00:21:07 in your talks have mentioned sort of the compute side of things, in terms of the difference
00:21:13 in the human brain or referencing Moravec, Hans Moravec.
00:21:19 So do you think there’s something interesting, valuable to consider about the difference
00:21:25 in the computational power of the human brain versus the computers of today in terms of
00:21:32 instructions per second?
00:21:34 Yes, so if we go back, so this is a point I’ve been making for 20 years now.
00:21:41 And I think once upon a time, the way I used to argue this was that we just didn’t have
00:21:47 the computing power of the human brain.
00:21:49 Our computers were not quite there.
00:21:53 And I mean, there is a well known trade off, which we know that neurons are slow compared
00:22:03 to transistors, but we have a lot of them and they have a very high connectivity.
00:22:09 Whereas in silicon, you have much faster devices, transistors switch at the order of nanoseconds,
00:22:18 but the connectivity is usually smaller.
00:22:21 At this point in time, I mean, we are now talking about 2020, we do have, if you consider
00:22:27 the latest GPUs and so on, amazing computing power.
00:22:31 And if we look back at Hans Moravec type of calculations, which he did in the 1990s, we
00:22:39 may be there today in terms of computing power comparable to the brain, but it’s not in the
00:22:44 of the same style, it’s of a very different style.
00:22:49 So I mean, for example, the style of computing that we have in our GPUs is far, far more
00:22:55 power hungry than the style of computing that is there in the human brain or other biological
00:23:02 entities.
00:23:03 Yeah.
00:23:04 And that the efficiency part is, we’re going to have to solve that in order to build actual
00:23:11 real world systems of large scale.
00:23:15 Let me ask sort of the high level question, taking a step back.
00:23:19 How would you articulate the general problem of computer vision?
00:23:24 Does such a thing exist?
00:23:25 So if you look at the computer vision conferences and the work that’s been going on, it’s often
00:23:30 separated into different little segments, breaking the problem of vision apart into
00:23:36 whether segmentation, 3D reconstruction, object detection, I don’t know, image capturing,
00:23:44 whatever.
00:23:45 There’s benchmarks for each.
00:23:46 But if you were to sort of philosophically say, what is the big problem of computer vision?
00:23:52 Does such a thing exist?
00:23:54 Yes, but it’s not in isolation.
00:23:57 So for all intelligence tasks, I always go back to sort of biology or humans.
00:24:09 And if we think about vision or perception in that setting, we realize that perception
00:24:15 is always to guide action.
00:24:18 Action for a biological system does not give any benefits unless it is coupled with action.
00:24:25 So we can go back and think about the first multicellular animals, which arose in the
00:24:30 Cambrian era, you know, 500 million years ago.
00:24:35 And these animals could move and they could see in some way.
00:24:40 And the two activities helped each other.
00:24:43 Because how does movement help?
00:24:47 Movement helps that because you can get food in different places.
00:24:52 But you need to know where to go.
00:24:54 And that’s really about perception or seeing, I mean, vision is perhaps the single most
00:25:00 perception sense.
00:25:02 But all the others are equally are also important.
00:25:06 So perception and action kind of go together.
00:25:10 So earlier, it was in these very simple feedback loops, which were about finding food or avoid
00:25:17 avoiding becoming food if there’s a predator running, trying to, you know, eat you up,
00:25:24 and so forth.
00:25:25 So we must, at the fundamental level, connect perception to action.
00:25:30 Then as we evolved, perception became more and more sophisticated because it served many
00:25:37 more purposes.
00:25:39 And so today we have what seems like a fairly general purpose capability, which can look
00:25:46 at the external world and build a model of the external world inside the head.
00:25:53 We do have that capability.
00:25:55 That model is not perfect.
00:25:56 And psychologists have great fun in pointing out the ways in which the model in your head
00:26:01 is not a perfect model of the external world.
00:26:05 They create various illusions to show the ways in which it is imperfect.
00:26:11 But it’s amazing how far it has come from a very simple perception action loop that
00:26:17 you exist in, you know, an animal 500 million years ago.
00:26:23 Once we have this, these very sophisticated visual systems, we can then impose a structure
00:26:29 on them.
00:26:30 It’s we as scientists who are imposing that structure, where we have chosen to characterize
00:26:36 this part of the system as this quote, module of object detection or quote, this module
00:26:43 of 3D reconstruction.
00:26:45 What’s going on is really all of these processes are running simultaneously and they are running
00:26:55 simultaneously because originally their purpose was in fact to help guide action.
00:27:01 So as a guiding general statement of a problem, do you think we can say that the general problem
00:27:08 of computer vision, you said in humans, it was tied to action.
00:27:14 Do you think we should also say that ultimately the goal, the problem of computer vision is
00:27:20 to sense the world in a way that helps you act in the world?
00:27:27 Yes.
00:27:28 I think that’s the most fundamental, that’s the most fundamental purpose.
00:27:32 We have by now hyper evolved.
00:27:37 So we have this visual system which can be used for other things.
00:27:42 For example, judging the aesthetic value of a painting.
00:27:46 And this is not guiding action.
00:27:49 Maybe it’s guiding action in terms of how much money you will put in your auction bid,
00:27:54 but that’s a bit stretched.
00:27:56 But the basics are in fact in terms of action, but we evolved really this hyper, we have
00:28:06 hyper evolved our visual system.
00:28:08 Actually just to, sorry to interrupt, but perhaps it is fundamentally about action.
00:28:13 You kind of jokingly said about spending, but perhaps the capitalistic drive that drives
00:28:20 a lot of the development in this world is about the exchange of money and the fundamental
00:28:25 action is money.
00:28:26 If you watch Netflix, if you enjoy watching movies, you’re using your perception system
00:28:30 to interpret the movie, ultimately your enjoyment of that movie means you’ll subscribe to Netflix.
00:28:36 So the action is this extra layer that we’ve developed in modern society perhaps is fundamentally
00:28:44 tied to the action of spending money.
00:28:47 Well certainly with respect to interactions with firms.
00:28:54 So in this homo economicus role, when you’re interacting with firms, it does become that.
00:29:01 What else is there?
00:29:02 And that was a rhetorical question.
00:29:07 So to linger on the division between the static and the dynamic, so much of the work in computer
00:29:16 vision, so many of the breakthroughs that you’ve been a part of have been in the static
00:29:20 world and looking at static images.
00:29:24 And then you’ve also worked on starting, but it’s a much smaller degree, the community
00:29:29 is looking at dynamic, at video, at dynamic scenes.
00:29:32 And then there is robotic vision, which is dynamic, but also where you actually have
00:29:38 a robot in the physical world interacting based on that vision.
00:29:43 Which problem is harder?
00:29:49 The trivial first answer is, well, of course one image is harder.
00:29:53 But if you look at a deeper question there, are we, what’s the term, cutting ourselves
00:30:03 at the knees or like making the problem harder by focusing on images?
00:30:08 That’s a fair question.
00:30:09 I think sometimes we can simplify a problem so much that we essentially lose part of the
00:30:20 juice that could enable us to solve the problem.
00:30:24 And one could reasonably argue that to some extent this happens when we go from video
00:30:29 to single images.
00:30:31 Now historically you have to consider the limits imposed by the computation capabilities
00:30:39 we had.
00:30:41 So many of the choices made in the computer vision community through the 70s, 80s, 90s
00:30:50 can be understood as choices which were forced upon us by the fact that we just didn’t have
00:30:59 enough access to enough compute.
00:31:01 Not enough memory, not enough hardware.
00:31:04 Exactly.
00:31:05 Not enough compute, not enough storage.
00:31:08 So think of these choices.
00:31:09 So one of the choices is focusing on single images rather than video.
00:31:14 Okay.
00:31:15 Clear question.
00:31:16 Storage and compute.
00:31:19 We had to focus on, we used to detect edges and throw away the image.
00:31:24 Right?
00:31:25 So we would have an image which I say 256 by 256 pixels and instead of keeping around
00:31:31 the grayscale value, what we did was we detected edges, find the places where the brightness
00:31:37 changes a lot and then throw away the rest.
00:31:42 So this was a major compression device and the hope was that this makes it that you can
00:31:47 still work with it and the logic was humans can interpret a line drawing.
00:31:53 And yes, and this will save us computation.
00:31:58 So many of the choices were dictated by that.
00:32:00 I think today we are no longer detecting edges, right?
00:32:07 We process images with ConvNets because we don’t need to.
00:32:10 We don’t have those computer restrictions anymore.
00:32:14 Now video is still understudied because video compute is still quite challenging if you
00:32:19 are a university researcher.
00:32:22 I think video computing is not so challenging if you are at Google or Facebook or Amazon.
00:32:29 Still super challenging.
00:32:30 I just spoke with the VP of engineering at Google, head of the YouTube search and discovery
00:32:35 and they still struggle doing stuff on video.
00:32:38 It’s very difficult except using techniques that are essentially the techniques you used
00:32:44 in the 90s.
00:32:45 Some very basic computer vision techniques.
00:32:48 No, that’s when you want to do things at scale.
00:32:51 So if you want to operate at the scale of all the content of YouTube, it’s very challenging
00:32:56 and there are similar issues with Facebook.
00:32:59 But as a researcher, you have more opportunities.
00:33:05 You can train large networks with relatively large video data sets.
00:33:11 So I think that this is part of the reason why we have so emphasized static images.
00:33:17 I think that this is changing and over the next few years, I see a lot more progress
00:33:22 happening in video.
00:33:25 So I have this generic statement that to me, video recognition feels like 10 years behind
00:33:32 object recognition and you can quantify that because you can take some of the challenging
00:33:37 video data sets and their performance on action classification is like say 30%, which is kind
00:33:45 of what we used to have around 2009 in object detection.
00:33:51 It’s like about 10 years behind and whether it’ll take 10 years to catch up is a different
00:33:58 question.
00:33:59 Hopefully, it will take less than that.
00:34:01 Let me ask a similar question I’ve already asked, but once again, so for dynamic scenes,
00:34:08 do you think some kind of injection of knowledge bases and reasoning is required to help improve
00:34:17 like action recognition?
00:34:20 Like if we saw the general action recognition problem, what do you think the solution would
00:34:28 look like as another way to put it?
00:34:31 So I completely agree that knowledge is called for and that knowledge can be quite sophisticated.
00:34:39 So the way I would say it is that perception blends into cognition and cognition brings
00:34:44 in issues of memory and this notion of a schema from psychology, which is, let me use the
00:34:54 classic example, which is you go to a restaurant, right?
00:34:58 Now there are things that happen in a certain order, you walk in, somebody takes you to
00:35:03 a table, waiter comes, gives you a menu, takes the order, food arrives, eventually bill arrives,
00:35:13 et cetera, et cetera.
00:35:15 This is a classic example of AI from the 1970s.
00:35:19 It was called, there was the term frames and scripts and schemas, these are all quite similar
00:35:26 ideas.
00:35:27 Okay, and in the 70s, the way the AI of the time dealt with it was by hand coding this.
00:35:34 So they hand coded in this notion of a script and the various stages and the actors and
00:35:40 so on and so forth, and use that to interpret, for example, language.
00:35:45 I mean, if there’s a description of a story involving some people eating at a restaurant,
00:35:52 there are all these inferences you can make because you know what happens typically at
00:35:58 a restaurant.
00:36:00 So I think this kind of knowledge is absolutely essential.
00:36:06 So I think that when we are going to do long form video understanding, we are going to
00:36:12 need to do this.
00:36:13 I think the kinds of technology that we have right now with 3D convolutions over a couple
00:36:19 of seconds of clip or video, it’s very much tailored towards short term video understanding,
00:36:26 not that long term understanding.
00:36:28 Long term understanding requires this notion of schemas that I talked about, perhaps some
00:36:35 notions of goals, intentionality, functionality, and so on and so forth.
00:36:43 Now, how will we bring that in?
00:36:46 So we could either revert back to the 70s and say, OK, I’m going to hand code in a script
00:36:51 or we might try to learn it.
00:36:56 So I tend to believe that we have to find learning ways of doing this because I think
00:37:03 learning ways land up being more robust.
00:37:06 And there must be a learning version of the story because children acquire a lot of this
00:37:12 knowledge by sort of just observation.
00:37:16 So at no moment in a child’s life does it’s possible, but I think it’s not so typical
00:37:24 that somebody that a mother coaches a child through all the stages of what happens in
00:37:29 a restaurant.
00:37:30 They just go as a family, they go to the restaurant, they eat, come back, and the child goes through
00:37:36 ten such experiences and the child has got a schema of what happens when you go to a
00:37:41 restaurant.
00:37:42 So we somehow need to provide that capability to our systems.
00:37:48 You mentioned the following line from the end of the Alan Turing paper, Computing Machinery
00:37:53 and Intelligence, that many people, like you said, many people know and very few have read
00:37:59 where he proposes the Turing test.
00:38:03 This is how you know because it’s towards the end of the paper.
00:38:06 Instead of trying to produce a program to simulate the adult mind, why not rather try
00:38:10 to produce one which simulates the child’s?
00:38:14 So that’s a really interesting point.
00:38:17 If I think about the benchmarks we have before us, the tests of our computer vision systems,
00:38:24 they’re often kind of trying to get to the adult.
00:38:28 So what kind of benchmarks should we have?
00:38:31 What kind of tests for computer vision do you think we should have that mimic the child’s
00:38:37 in computer vision?
00:38:38 I think we should have those and we don’t have those today.
00:38:42 And I think the part of the challenge is that we should really be collecting data of the
00:38:50 type that the child experiences.
00:38:55 So that gets into issues of privacy and so on and so forth.
00:38:59 But there are attempts in this direction to sort of try to collect the kind of data that
00:39:05 a child encounters growing up.
00:39:08 So what’s the child’s linguistic environment?
00:39:11 What’s the child’s visual environment?
00:39:13 So if we could collect that kind of data and then develop learning schemes based on that
00:39:20 data, that would be one way to do it.
00:39:25 I think that’s a very promising direction myself.
00:39:28 There might be people who would argue that we could just short circuit this in some way
00:39:33 and sometimes we have imitated, we have had success by not imitating nature in detail.
00:39:44 So the usual example is airplanes, right?
00:39:47 We don’t build flapping wings.
00:39:51 So yes, that’s one of the points of debate.
00:39:57 In my mind, I would bet on this learning like a child approach.
00:40:05 So one of the fundamental aspects of learning like a child is the interactivity.
00:40:11 So the child gets to play with the data set it’s learning from.
00:40:14 Yes.
00:40:15 So it gets to select.
00:40:16 I mean, you can call that active learning.
00:40:19 In the machine learning world, you can call it a lot of terms.
00:40:23 What are your thoughts about this whole space of being able to play with the data set or
00:40:27 select what you’re learning?
00:40:29 Yeah.
00:40:30 So I think that I believe in that and I think that we could achieve it in two ways and I
00:40:38 think we should use both.
00:40:40 So one is actually real robotics, right?
00:40:45 So real physical embodiments of agents who are interacting with the world and they have
00:40:52 a physical body with dynamics and mass and moment of inertia and friction and all the
00:40:59 rest and you learn your body, the robot learns its body by doing a series of actions.
00:41:08 The second is that simulation environments.
00:41:11 So I think simulation environments are getting much, much better.
00:41:17 In my life in Facebook AI research, our group has worked on something called Habitat, which
00:41:24 is a simulation environment, which is a visually photorealistic environment of places like
00:41:34 houses or interiors of various urban spaces and so forth.
00:41:39 And as you move, you get a picture, which is a pretty accurate picture.
00:41:45 So now you can imagine that subsequent generations of these simulators will be accurate, not
00:41:53 just visually, but with respect to forces and masses and haptic interactions and so
00:42:01 on.
00:42:03 And then we have that environment to play with.
00:42:07 I think, let me state one reason why I think being able to act in the world is important.
00:42:16 I think that this is one way to break the correlation versus causation barrier.
00:42:23 So this is something which is of a great deal of interest these days.
00:42:27 I mean, people like Judea Pearl have talked a lot about that we are neglecting causality
00:42:34 and he describes the entire set of successes of deep learning as just curve fitting, right?
00:42:42 But I don’t quite agree about it.
00:42:45 He’s a troublemaker.
00:42:46 He is.
00:42:47 But causality is important, but causality is not like a single silver bullet.
00:42:54 It’s not like one single principle.
00:42:56 There are many different aspects here.
00:42:58 And one of the ways in which, one of our most reliable ways of establishing causal links
00:43:05 and this is the way, for example, the medical community does this is randomized control
00:43:11 trials.
00:43:12 So you have, you pick some situation and now in some situation you perform an action and
00:43:18 for certain others you don’t, right?
00:43:22 So you have a controlled experiment.
00:43:23 Well, the child is in fact performing controlled experiments all the time, right?
00:43:28 Right.
00:43:29 Okay.
00:43:30 Small scale.
00:43:31 In a small scale.
00:43:32 But that is a way that the child gets to build and refine its causal models of the world.
00:43:41 And my colleague Alison Gopnik has, together with a couple of authors, coauthors, has this
00:43:47 book called The Scientist in the Crib, referring to the children.
00:43:50 So I like, the part that I like about that is the scientist wants to do, wants to build
00:43:57 causal models and the scientist does control experiments.
00:44:01 And I think the child is doing that.
00:44:03 So to enable that, we will need to have these active experiments.
00:44:10 And I think this could be done, some in the real world and some in simulation.
00:44:14 So you have hope for simulation.
00:44:16 I have hope for simulation.
00:44:18 That’s an exciting possibility if we can get to not just photorealistic, but what’s that
00:44:22 called life realistic simulation.
00:44:27 So you don’t see any fundamental blocks to why we can’t eventually simulate the principles
00:44:35 of what it means to exist in the world as a physical scientist.
00:44:39 I don’t see any fundamental problems that, I mean, and look, the computer graphics community
00:44:43 has come a long way.
00:44:45 So in the early days, back going back to the eighties and nineties, they were focusing
00:44:50 on visual realism, right?
00:44:52 And then they could do the easy stuff, but they couldn’t do stuff like hair or fur and
00:44:58 so on.
00:44:59 Okay, well, they managed to do that.
00:45:01 Then they couldn’t do physical actions, right?
00:45:04 Like there’s a bowl of glass and it falls down and it shatters, but then they could
00:45:09 start to do pretty realistic models of that and so on and so forth.
00:45:13 So the graphics people have shown that they can do this forward direction, not just for
00:45:19 optical interactions, but also for physical interactions.
00:45:23 So I think, of course, some of that is very compute intensive, but I think by and by we
00:45:30 will find ways of making our models ever more realistic.
00:45:35 You break vision apart into, in one of your presentations, early vision, static scene
00:45:40 understanding, dynamic scene understanding, and raise a few interesting questions.
00:45:44 I thought I could just throw some at you to see if you want to talk about them.
00:45:50 So early vision, so it’s, what is it that you said, sensation, perception and cognition.
00:45:58 So is this a sensation?
00:46:00 Yes.
00:46:01 What can we learn from image statistics that we don’t already know?
00:46:05 So at the lowest level, what can we make from just the statistics, the basics, or the variations
00:46:15 in the rock pixels, the textures and so on?
00:46:18 Yeah.
00:46:19 So what we seem to have learned is that there’s a lot of redundancy in these images and as
00:46:28 a result, we are able to do a lot of compression and this compression is very important in
00:46:35 biological settings, right?
00:46:36 So you might have 10 to the 8 photoreceptors and only 10 to the 6 fibers in the optic nerve.
00:46:42 So you have to do this compression by a factor of 100 is to 1.
00:46:46 And so there are analogs of that which are happening in our neural net, artificial neural
00:46:54 network.
00:46:55 That’s the early layers.
00:46:56 So you think there’s a lot of compression that can be done in the beginning.
00:47:01 Just the statistics.
00:47:02 Yeah.
00:47:03 So how successful is image compression?
00:47:05 How much?
00:47:06 Well, I mean, the way to think about it is just how successful is image compression,
00:47:14 right?
00:47:15 And that’s been done with older technologies, but it can be done with, there are several
00:47:23 companies which are trying to use sort of these more advanced neural network type techniques
00:47:29 for compression, both for static images as well as for video.
00:47:34 One of my former students has a company which is trying to do stuff like this.
00:47:41 And I think that they are showing quite interesting results.
00:47:47 And I think that that’s all the success of, that’s really about image statistics and
00:47:52 video statistics.
00:47:53 But that’s still not doing compression of the kind, when I see a picture of a cat, all
00:47:59 I have to say is it’s a cat, that’s another semantic kind of compression.
00:48:02 Yeah.
00:48:03 So this is at the lower level, right?
00:48:04 So we are, as I said, yeah, that’s focusing on low level statistics.
00:48:10 So to linger on that for a little bit, you mentioned how far can bottom up image segmentation
00:48:17 go.
00:48:18 You know, what you mentioned that the central question for scene understanding is the interplay
00:48:24 of bottom up and top down information.
00:48:26 Maybe this is a good time to elaborate on that.
00:48:29 Maybe define what is bottom up, what is top down in the context of computer vision.
00:48:37 Right.
00:48:38 So today what we have are very interesting systems because they work completely bottom
00:48:45 up.
00:48:46 What does bottom up mean, sorry?
00:48:47 So bottom up means, in this case means a feed forward neural network.
00:48:52 So starting from the raw pixels, yeah, they start from the raw pixels and they end up
00:48:57 with some, something like cat or not a cat, right?
00:49:00 So our systems are running totally feed forward.
00:49:04 They’re trained in a very top down way.
00:49:07 So they’re trained by saying, okay, this is a cat, there’s a cat, there’s a dog, there’s
00:49:11 a zebra, et cetera.
00:49:14 And I’m not happy with either of these choices fully.
00:49:18 We have gone into, because we have completely separated these processes, right?
00:49:24 So there’s a, so I would like the process, so what do we know compared to biology?
00:49:34 So in biology, what we know is that the processes in at test time, at runtime, those processes
00:49:42 are not purely feed forward, but they involve feedback.
00:49:46 So and they involve much shallower neural networks.
00:49:50 So the kinds of neural networks we are using in computer vision, say a ResNet 50 has 50
00:49:55 layers.
00:49:56 Well in the brain, in the visual cortex going from the retina to IT, maybe we have like
00:50:02 seven, right?
00:50:04 So they’re far shallower, but we have the possibility of feedback.
00:50:08 So there are backward connections.
00:50:11 And this might enable us to deal with the more ambiguous stimuli, for example.
00:50:18 So the biological solution seems to involve feedback, the solution in artificial vision
00:50:26 seems to be just feed forward, but with a much deeper network.
00:50:30 And the two are functionally equivalent because if you have a feedback network, which just
00:50:35 has like three rounds of feedback, you can just unroll it and make it three times the
00:50:40 depth and create it in a totally feed forward way.
00:50:44 So this is something which, I mean, we have written some papers on this theme, but I really
00:50:49 feel that this should, this theme should be pursued further.
00:50:55 Some kind of occurrence mechanism.
00:50:57 Yeah.
00:50:58 Okay.
00:50:59 The other, so that’s, so I want to have a little bit more top down in the, at test time.
00:51:07 Okay.
00:51:08 And then at training time, we make use of a lot of top down knowledge right now.
00:51:13 So basically to learn to segment an object, we have to have all these examples of this
00:51:19 is the boundary of a cat, and this is the boundary of a chair, and this is the boundary
00:51:22 of a horse and so on.
00:51:24 And this is too much top down knowledge.
00:51:27 How do humans do this?
00:51:30 We manage to, we manage with far less supervision and we do it in a sort of bottom up way because
00:51:36 for example, we are looking at a video stream and the horse moves and that enables me to
00:51:44 say that all these pixels are together.
00:51:47 So the Gestalt psychologist used to call this the principle of common fate.
00:51:53 So there was a bottom up process by which we were able to segment out these objects
00:51:58 and we have totally focused on this top down training signal.
00:52:04 So in my view, we have currently solved it in machine vision, this top down bottom up
00:52:10 interaction, but I don’t find the solution fully satisfactory and I would rather have
00:52:17 a bit of both at both stages.
00:52:20 For all computer vision problems, not just segmentation.
00:52:25 And the question that you can ask is, so for me, I’m inspired a lot by human vision and
00:52:30 I care about that.
00:52:31 You could be just a hard boiled engineer and not give a damn.
00:52:35 So to you, I would then argue that you would need far less training data if you could make
00:52:41 my research agenda fruitful.
00:52:45 Okay, so then maybe taking a step into segmentation, static scene understanding.
00:52:54 What is the interaction between segmentation and recognition?
00:52:57 You mentioned the movement of objects.
00:53:00 So for people who don’t know computer vision, segmentation is this weird activity that computer
00:53:07 vision folks have all agreed is very important of drawing outlines around objects versus
00:53:15 a bounding box and then classifying that object.
00:53:21 What’s the value of segmentation?
00:53:23 What is it as a problem in computer vision?
00:53:27 How is it fundamentally different from detection recognition and the other problems?
00:53:31 Yeah, so I think, so segmentation enables us to say that some set of pixels are an object
00:53:41 without necessarily even being able to name that object or knowing properties of that
00:53:47 object.
00:53:48 Oh, so you mean segmentation purely as the act of separating an object.
00:53:55 From its background.
00:53:56 It’s a job that’s united in some way from its background.
00:54:01 Yeah, so entitification, if you will, making an entity out of it.
00:54:05 Entitification, beautifully termed.
00:54:09 So I think that we have that capability and that enables us to, as we are growing up,
00:54:17 to acquire names of objects with very little supervision.
00:54:23 So suppose the child, let’s posit that the child has this ability to separate out objects
00:54:28 in the world.
00:54:30 Then when the mother says, pick up your bottle or the cat’s behaving funny today, the word
00:54:42 cat suggests some object and then the child sort of does the mapping, right?
00:54:47 The mother doesn’t have to teach specific object labels by pointing to them.
00:54:55 Weak supervision works in the context that you have the ability to create objects.
00:55:01 So I think that, so to me, that’s a very fundamental capability.
00:55:07 There are applications where this is very important, for example, medical diagnosis.
00:55:13 So in medical diagnosis, you have some brain scan, I mean, this is some work that we did
00:55:20 in my group where you have CT scans of people who have had traumatic brain injury and what
00:55:26 the radiologist needs to do is to precisely delineate various places where there might
00:55:32 be bleeds, for example, and there are clear needs like that.
00:55:39 So there are certainly very practical applications of computer vision where segmentation is necessary,
00:55:46 but philosophically segmentation enables the task of recognition to proceed with much weaker
00:55:54 supervision than we require today.
00:55:58 And you think of segmentation as this kind of task that takes on a visual scene and breaks
00:56:03 it apart into interesting entities that might be useful for whatever the task is.
00:56:11 Yeah.
00:56:12 And it is not semantics free.
00:56:14 So I think, I mean, it blends into, it involves perception and cognition.
00:56:22 It is not, I think the mistake that we used to make in the early days of computer vision
00:56:28 was to treat it as a purely bottom up perceptual task.
00:56:32 It is not just that because we do revise our notion of segmentation with more experience,
00:56:41 right?
00:56:42 Because for example, there are objects which are nonrigid like animals or humans.
00:56:47 And I think understanding that all the pixels of a human are one entity is actually quite
00:56:53 a challenge because the parts of the human, they can move independently and the human
00:56:59 wears clothes, so they might be differently colored.
00:57:02 So it’s all sort of a challenge.
00:57:05 You mentioned the three R’s of computer vision are recognition, reconstruction and reorganization.
00:57:12 Can you describe these three R’s and how they interact?
00:57:15 Yeah.
00:57:16 So recognition is the easiest one because that’s what I think people generally think
00:57:24 of as computer vision achieving these days, which is labels.
00:57:30 So is this a cat?
00:57:31 Is this a dog?
00:57:32 Is this a chihuahua?
00:57:35 I mean, you know, it could be very fine grained like, you know, specific breed of a dog or
00:57:41 a specific species of bird, or it could be very abstract like animal.
00:57:47 But given a part of an image or a whole image, say put a label on it.
00:57:51 Yeah.
00:57:52 That’s recognition.
00:57:54 Reconstruction is essentially, you can think of it as inverse graphics.
00:58:03 I mean, that’s one way to think about it.
00:58:07 So graphics is you have some internal computer representation and you have a computer representation
00:58:14 of some objects arranged in a scene.
00:58:17 And what you do is you produce a picture, you produce the pixels corresponding to a
00:58:22 rendering of that scene.
00:58:24 So let’s do the inverse of this.
00:58:28 We are given an image and we try to, we say, oh, this image arises from some objects in
00:58:38 a scene looked at with a camera from this viewpoint.
00:58:41 And we might have more information about the objects like their shape, maybe their textures,
00:58:47 maybe, you know, color, et cetera, et cetera.
00:58:51 So that’s the reconstruction problem.
00:58:53 In a way, you are in your head creating a model of the external world.
00:59:00 Right.
00:59:01 Okay.
00:59:02 Reorganization is to do with essentially finding these entities.
00:59:09 So it’s organization, the word organization implies structure.
00:59:15 So that in perception, in psychology, we use the term perceptual organization.
00:59:22 That the world is not just, an image is not just seen as, is not internally represented
00:59:30 as just a collection of pixels, but we make these entities.
00:59:34 We create these entities, objects, whatever you want to call it.
00:59:38 And the relationship between the entities as well, or is it purely about the entities?
00:59:42 It could be about the relationships, but mainly we focus on the fact that there are entities.
00:59:47 Okay.
00:59:48 So I’m trying to pinpoint what the organization means.
00:59:52 So organization is that instead of like a uniform grid, we have this structure of objects.
01:00:02 So the segmentation is the small part of that.
01:00:05 So segmentation gets us going towards that.
01:00:09 Yeah.
01:00:10 And you kind of have this triangle where they all interact together.
01:00:13 Yes.
01:00:14 So how do you see that interaction in sort of reorganization is yes, finding the entities
01:00:23 in the world.
01:00:25 The recognition is labeling those entities and then reconstruction is what filling in
01:00:32 the gaps.
01:00:33 Well, for example, see, impute some 3D objects corresponding to each of these entities.
01:00:43 That would be part of it.
01:00:44 So adding more information that’s not there in the raw data.
01:00:48 Correct.
01:00:49 I mean, I started pushing this kind of a view in the, around 2010 or something like that.
01:00:58 Because at that time in computer vision, the distinction that people were just working
01:01:06 on many different problems, but they treated each of them as a separate isolated problem
01:01:11 with each with its own data set.
01:01:13 And then you try to solve that and get good numbers on it.
01:01:17 So I wasn’t, I didn’t like that approach because I wanted to see the connection between these.
01:01:23 And if people divided up vision into, into various modules, the way they would do it
01:01:30 is as low level, mid level and high level vision corresponding roughly to the psychologist’s
01:01:36 notion of sensation, perception and cognition.
01:01:40 And I didn’t, that didn’t map to tasks that people cared about.
01:01:45 Okay.
01:01:46 So therefore I tried to promote this particular framework as a way of considering the problems
01:01:52 that people in computer vision were actually working on and trying to be more explicit
01:01:58 about the fact that they actually are connected to each other.
01:02:02 And I was at that time just doing this on the basis of information flow.
01:02:07 Now it turns out in the last five years or so in the post, the deep learning revolution
01:02:17 that this, this architecture has turned out to be very conducive to that.
01:02:25 Because basically in these neural networks, we are trying to build multiple representations.
01:02:33 They can be multiple output heads sharing common representations.
01:02:37 So in a certain sense today, given the reality of what solutions people have to this, I do
01:02:46 not need to preach this anymore.
01:02:48 It is, it is just there.
01:02:50 It’s part of the sedation space.
01:02:52 So speaking of neural networks, how much of this problem of computer vision of reorganization
01:03:02 recognition can be reconstruction?
01:03:09 How much of it can be learned end to end, do you think?
01:03:12 Sort of set it and forget it.
01:03:17 Just plug and play, have a giant data set, multiple, perhaps multimodal, and then just
01:03:23 learn the entirety of it.
01:03:25 Well, so I think that currently what that end to end learning means nowadays is end
01:03:31 to end supervised learning.
01:03:34 And that I would argue is too narrow a view of the problem.
01:03:38 I like this child development view, this lifelong learning view, one where there are certain
01:03:46 capabilities that are built up and then there are certain capabilities which are built up
01:03:51 on top of that.
01:03:53 So that’s what I believe in.
01:03:58 So I think end to end learning in the supervised setting for a very precise task to me is kind
01:04:13 of is sort of a limited view of the learning process.
01:04:17 Got it.
01:04:18 So if we think about beyond purely supervised, looking back to children, you mentioned six
01:04:25 lessons that we can learn from children of be multimodal, be incremental, be physical,
01:04:33 explore, be social, use language.
01:04:36 Can you speak to these, perhaps picking one that you find most fundamental to our time
01:04:42 today?
01:04:43 Yeah.
01:04:44 So I mean, I should say to give a due credit, this is from a paper by Smith and Gasser.
01:04:50 And it reflects essentially, I would say common wisdom among child development people.
01:05:00 It’s just that this is not common wisdom among people in computer vision and AI and machine
01:05:07 learning.
01:05:08 So I view my role as trying to bridge the two worlds.
01:05:15 So let’s take an example of a multimodal.
01:05:18 I like that.
01:05:20 So multimodal, a canonical example is a child interacting with an object.
01:05:28 So then the child holds a ball and plays with it.
01:05:32 So at that point, it’s getting a touch signal.
01:05:35 So the touch signal is getting the notion of 3D shape, but it is sparse.
01:05:44 And then the child is also seeing a visual signal.
01:05:48 And these two, so imagine these are two in totally different spaces.
01:05:52 So one is the space of receptors on the skin of the fingers and the thumb and the palm.
01:05:59 And then these map onto these neuronal fibers are getting activated somewhere.
01:06:06 These lead to some activation in somatosensory cortex.
01:06:10 I mean, a similar thing will happen if we have a robot hand.
01:06:15 And then we have the pixels corresponding to the visual view, but we know that they
01:06:20 correspond to the same object.
01:06:24 So that’s a very, very strong cross calibration signal.
01:06:28 And it is self supervisory, which is beautiful.
01:06:32 There’s nobody assigning a label.
01:06:34 The mother doesn’t have to come and assign a label.
01:06:37 The child doesn’t even have to know that this object is called a ball.
01:06:42 That the child is learning something about the three dimensional world from this signal.
01:06:49 I think tactile and visual, there is some work on, there is a lot of work currently
01:06:54 on audio and visual.
01:06:57 And audio visual, so there is some event that happens in the world and that event has a
01:07:02 visual signature and it has a auditory signature.
01:07:07 So there is this glass bowl on the table and it falls and breaks and I hear the smashing
01:07:12 sound and I see the pieces of glass.
01:07:14 Okay, I’ve built that connection between the two, right?
01:07:19 We have people, I mean, this has become a hot topic in computer vision in the last couple
01:07:24 of years.
01:07:26 There are problems like separating out multiple speakers, right?
01:07:32 Which was a classic problem in auditions.
01:07:35 They call this the problem of source separation or the cocktail party effect and so on.
01:07:40 But just try to do it visually when you also have, it becomes so much easier and so much
01:07:47 more useful.
01:07:50 So the multimodal, I mean, there’s so much more signal with multimodal and you can use
01:07:56 that for some kind of weak supervision as well.
01:08:00 Yes, because they are occurring at the same time in time.
01:08:03 So you have time which links the two, right?
01:08:06 So at a certain moment, T1, you’ve got a certain signal in the auditory domain and a certain
01:08:10 signal in the visual domain, but they must be causally related.
01:08:14 Yeah, that’s an exciting area.
01:08:16 Not well studied yet.
01:08:17 Yeah, I mean, we have a little bit of work at this, but so much more needs to be done.
01:08:25 So this is a good example.
01:08:28 Be physical, that’s to do with like the one thing we talked about earlier that there’s
01:08:34 a embodied world.
01:08:36 To mention language, use language.
01:08:39 So Noam Chomsky believes that language may be at the core of cognition, at the core of
01:08:44 everything in the human mind.
01:08:46 What is the connection between language and vision to you?
01:08:50 What’s more fundamental?
01:08:51 Are they neighbors?
01:08:53 Is one the parent and the child, the chicken and the egg?
01:08:58 Oh, it’s very clear.
01:08:59 It is vision, which is the parent.
01:09:00 Which is the fundamental ability, okay.
01:09:07 It comes before you think vision is more fundamental than language.
01:09:11 Correct.
01:09:12 And you can think of it either in phylogeny or in ontogeny.
01:09:18 So phylogeny means if you look at evolutionary time, right?
01:09:22 So we have vision that developed 500 million years ago, okay.
01:09:27 Then something like when we get to maybe like five million years ago, you have the first
01:09:33 bipedal primate.
01:09:34 So when we started to walk, then the hands became free.
01:09:38 And so then manipulation, the ability to manipulate objects and build tools and so on and so forth.
01:09:45 So you said 500,000 years ago?
01:09:47 No, sorry.
01:09:48 The first multicellular animals, which you can say had some intelligence arose 500 million
01:09:56 years ago.
01:09:57 Million.
01:09:58 Okay.
01:09:59 And now let’s fast forward to say the last seven million years, which is the development
01:10:05 of the hominid line, right, where from the other primates, we have the branch which leads
01:10:10 on to modern humans.
01:10:12 Now there are many of these hominids, but the ones which, you know, people talk about
01:10:21 Lucy because that’s like a skeleton from three million years ago.
01:10:25 And we know that Lucy walked, okay.
01:10:28 So at this stage you have that the hand is free for manipulating objects and then the
01:10:34 ability to manipulate objects, build tools and the brain size grew in this era.
01:10:43 So okay, so now you have manipulation.
01:10:46 Now we don’t know exactly when language arose.
01:10:49 But after that.
01:10:50 Because no apes have, I mean, so I mean Chomsky is correct in that, that it is a uniquely
01:10:57 human capability and we primates, other primates don’t have that.
01:11:04 But so it developed somewhere in this era, but it developed, I would, I mean, argue that
01:11:12 it probably developed after we had this stage of humans, I mean, the human species already
01:11:19 able to manipulate and hands free much bigger brain size.
01:11:25 And for that, there’s a lot of vision has already had, had to have developed.
01:11:31 So the sensation and the perception may be some of the cognition.
01:11:35 Yeah.
01:11:36 So we, we, we, so those, so, so that vision, so the world, so there, so, so these ancestors
01:11:45 of ours, you know, three, four million years ago, they had, they had special intelligence.
01:11:53 So they knew that the world consists of objects.
01:11:56 They knew that the objects were in certain relationships to each other.
01:11:59 They had observed causal interactions among objects.
01:12:05 They could move in space.
01:12:06 So they had space and time and all of that.
01:12:09 So language builds on that substrate.
01:12:13 So language has a lot of, I mean, I mean, the none, all human languages have constructs
01:12:19 which depend on a notion of space and time.
01:12:22 Where did that notion of space and time come from?
01:12:26 It had to come from perception and action in the world we live in.
01:12:30 Yeah.
01:12:31 Well, you’ve referred to the spatial intelligence.
01:12:33 Yeah.
01:12:34 Yeah.
01:12:35 So to linger a little bit, we’ll mention Turing and his mention of, we should learn from
01:12:42 children.
01:12:43 Nevertheless, language is the fundamental piece of the test of intelligence that Turing
01:12:49 proposed.
01:12:50 Yes.
01:12:51 What do you think is a good test of intelligence?
01:12:53 Are you, what would impress the heck out of you?
01:12:56 Is it fundamentally natural language or is there something in vision?
01:13:02 I think, I wouldn’t, I don’t think we should have created a single test of intelligence.
01:13:10 So just like I don’t believe in IQ as a single number, I think generally there can be many
01:13:17 capabilities which are correlated perhaps.
01:13:21 So I think that there will be, there will be accomplishments which are visual accomplishments,
01:13:28 accomplishments which are accomplishments in manipulation or robotics, and then accomplishments
01:13:36 in language.
01:13:37 But I do believe that language will be the hardest nut to crack.
01:13:40 Really?
01:13:41 Yeah.
01:13:42 So what’s harder, to pass the spirit of the Turing test or like whatever formulation will
01:13:46 make it natural language, convincingly a natural language, like somebody you would want to
01:13:52 have a beer with, hang out and have a chat with, or the general natural scene understanding?
01:13:59 You think language is the tougher problem?
01:14:01 I think, I’m not a fan of the, I think, I think Turing test, that Turing as he proposed
01:14:09 the test in 1950 was trying to solve a certain problem.
01:14:13 Yeah, imitation.
01:14:14 Yeah.
01:14:15 And, and I think it made a lot of sense then.
01:14:18 Where we are today, 70 years later, I think, I think we should not worry about that.
01:14:26 I think the Turing test is no longer the right way to channel research in AI, because that,
01:14:34 it takes us down this path of this chat bot, which can fool us for five minutes or whatever.
01:14:39 Okay.
01:14:40 I think I would rather have a list of 10 different tasks.
01:14:44 I mean, I think there are tasks which, there are tasks in the manipulation domain, tasks
01:14:50 in navigation, tasks in visual scene understanding, tasks in reading a story and answering questions
01:14:58 based on that.
01:14:59 I mean, so my favorite language understanding task would be, you know, reading a novel and
01:15:05 being able to answer arbitrary questions from it.
01:15:08 Okay.
01:15:09 Right.
01:15:10 I think that to me, and this is not an exhaustive list by any means.
01:15:15 So I would, I think that that’s what we, where we need to be going to.
01:15:21 And each of these, on each of these axes, there’s a fair amount of work to be done.
01:15:26 So on the visual understanding side, in this intelligence Olympics that we’ve set up, what’s
01:15:31 a good test for one of many of visual scene understanding?
01:15:39 Do you think such benchmarks exist?
01:15:41 Sorry to interrupt.
01:15:42 No, there aren’t any.
01:15:43 I think, I think essentially to me, a really good aid to the blind.
01:15:50 So suppose there was a blind person and I needed to assist the blind person.
01:15:57 So ultimately, like we said, vision that aids in the action in a survival in this world,
01:16:05 maybe in the simulated world.
01:16:09 Maybe easier to measure performance in a simulated world, what we are ultimately after is performance
01:16:15 in the real world.
01:16:17 So David Hilbert in 1900 proposed 23 open problems in mathematics, some of which are
01:16:23 still unsolved, most important, famous of which is probably the Riemann hypothesis.
01:16:29 You’ve thought about and presented about the Hilbert problems of computer vision.
01:16:33 So let me ask, what do you today, I don’t know when the last year you presented that
01:16:38 in 2015, but versions of it, you’re kind of the face and the spokesperson for computer
01:16:44 vision.
01:16:45 It’s your job to state what the open problems are for the field.
01:16:51 So what today are the Hilbert problems of computer vision, do you think?
01:16:56 Let me pick one which I regard as clearly unsolved, which is what I would call long
01:17:05 form video understanding.
01:17:08 So we have a video clip and we want to understand the behavior in there in terms of agents,
01:17:20 their goals, intentionality and make predictions about what might happen.
01:17:30 So that kind of understanding which goes away from atomic visual action.
01:17:37 So in the short range, the question is, are you sitting, are you standing, are you catching
01:17:41 a ball?
01:17:44 That we can do now, or even if we can’t do it fully accurately, if we can do it at 50%,
01:17:50 maybe next year we’ll do it at 65% and so forth.
01:17:54 But I think the long range video understanding, I don’t think we can do today.
01:18:01 And it blends into cognition, that’s the reason why it’s challenging.
01:18:06 So you have to track, you have to understand the entities, you have to understand the entities,
01:18:11 you have to track them and you have to have some kind of model of their behavior.
01:18:16 Correct.
01:18:17 And their behavior might be, these are agents, so they are not just like passive objects,
01:18:24 but they’re agents, so therefore they would exhibit goal directed behavior.
01:18:29 Okay, so this is one area.
01:18:32 Then I will talk about understanding the world in 3D.
01:18:37 This may seem paradoxical because in a way we have been able to do 3D understanding even
01:18:43 like 30 years ago, right?
01:18:45 But I don’t think we currently have the richness of 3D understanding in our computer vision
01:18:51 system that we would like.
01:18:55 So let me elaborate on that a bit.
01:18:57 So currently we have two kinds of techniques which are not fully unified.
01:19:03 So they are the kinds of techniques from multi view geometry that you have multiple pictures
01:19:08 of a scene and you do a reconstruction using stereoscopic vision or structure from motion.
01:19:14 But these techniques do not, they totally fail if you just have a single view because
01:19:21 they are relying on this multiple view geometry.
01:19:25 Okay, then we have some techniques that we have developed in the computer vision community
01:19:30 which try to guess 3D from single views.
01:19:34 And these techniques are based on supervised learning and they are based on having a training
01:19:41 time 3D models of objects available.
01:19:46 And this is completely unnatural supervision, right?
01:19:50 That’s not, CAD models are not injected into your brain.
01:19:54 Okay, so what would I like?
01:19:56 What I would like would be a kind of learning as you move around the world notion of 3D.
01:20:06 So we have our succession of visual experiences and from those we, so as part of that I might
01:20:19 see a chair from different viewpoints or a table from different viewpoints and so on.
01:20:24 Now as part that enables me to build some internal representation.
01:20:31 And then next time I just see a single photograph and it may not even be of that chair, it’s
01:20:37 of some other chair.
01:20:38 And I have a guess of what it’s 3D shape is like.
01:20:42 So you’re almost learning the CAD model, kind of.
01:20:45 Yeah, implicitly.
01:20:46 Implicitly.
01:20:47 I mean, the CAD model need not be in the same form as used by computer graphics programs.
01:20:52 Hidden in the representation.
01:20:53 It’s hidden in the representation, the ability to predict new views.
01:20:58 And what I would see if I went to such and such position.
01:21:04 By the way, on a small tangent on that, are you okay or comfortable with neural networks
01:21:14 that do achieve visual understanding that do, for example, achieve this kind of 3D understanding
01:21:19 and you don’t know how they, you’re not able to interest, you’re not able to visualize
01:21:27 or understand or interact with the representation.
01:21:31 So the fact that they’re not or may not be explainable.
01:21:34 Yeah, I think that’s fine.
01:21:38 To me that is, so let me put some caveats on that.
01:21:44 So it depends on the setting.
01:21:46 So first of all, I think the humans are not explainable.
01:21:55 So that’s a really good point.
01:21:57 So we, one human to another human is not fully explainable.
01:22:02 I think there are settings where explainability matters and these might be, for example, questions
01:22:10 on medical diagnosis.
01:22:13 So I’m in a setting where maybe the doctor, maybe a computer program has made a certain
01:22:19 diagnosis and then depending on the diagnosis, perhaps I should have treatment A or treatment
01:22:25 B, right?
01:22:28 So now is the computer program’s diagnosis based on data, which was data collected off
01:22:38 for American males who are in their 30s and 40s and maybe not so relevant to me.
01:22:45 Maybe it is relevant, you know, et cetera, et cetera.
01:22:48 I mean, in medical diagnosis, we have major issues to do with the reference class.
01:22:53 So we may have acquired statistics from one group of people and applying it to a different
01:22:58 group of people who may not share all the same characteristics.
01:23:02 The data might have, there might be error bars in the prediction.
01:23:07 So that prediction should really be taken with a huge grain of salt.
01:23:14 But this has an impact on what treatments should be picked, right?
01:23:20 So there are settings where I want to know more than just, this is the answer.
01:23:26 But what I acknowledge is that, so in that sense, explainability and interpretability
01:23:33 may matter.
01:23:34 It’s about giving error bounds and a better sense of the quality of the decision.
01:23:40 Where I’m willing to sacrifice interpretability is that I believe that there can be systems
01:23:50 which can be highly performant, but which are internally black boxes.
01:23:56 And that seems to be where it’s headed.
01:23:57 Some of the best performing systems are essentially black boxes, fundamentally by their construction.
01:24:04 You and I are black boxes to each other.
01:24:06 Yeah.
01:24:07 So the nice thing about the black boxes we are is, so we ourselves are black boxes, but
01:24:13 we’re also, those of us who are charming are able to convince others, like explain the
01:24:20 black, what’s going on inside the black box with narratives of stories.
01:24:25 So in some sense, neural networks don’t have to actually explain what’s going on inside.
01:24:31 They just have to come up with stories, real or fake that convince you that they know what’s
01:24:37 going on.
01:24:38 And I’m sure we can do that.
01:24:39 We can create those stories, neural networks can create those stories.
01:24:45 Yeah.
01:24:46 And the transformer will be involved.
01:24:50 Do you think we will ever build a system of human level or superhuman level intelligence?
01:24:56 We’ve kind of defined what it takes to try to approach that, but do you think that’s
01:25:01 within our reach?
01:25:02 The thing that we thought we could do, what Turing thought actually we could do by year
01:25:07 2000, right?
01:25:09 What do you think we’ll ever be able to do?
01:25:11 So I think there are two answers here.
01:25:12 One question, one answer is in principle, can we do this at some time?
01:25:18 And my answer is yes.
01:25:20 The second answer is a pragmatic one.
01:25:23 Do you think we will be able to do it in the next 20 years or whatever?
01:25:27 And to that my answer is no.
01:25:30 So of course that’s a wild guess.
01:25:34 I think that, you know, Donald Rumsfeld is not a favorite person of mine, but one of
01:25:40 his lines was very good, which is about known unknowns and unknown unknowns.
01:25:48 So in the business we are in, there are known unknowns and we have unknown unknowns.
01:25:55 So I think with respect to a lot of what’s the case in vision and robotics, I feel like
01:26:04 we have known unknowns.
01:26:06 So I have a sense of where we need to go and what the problems that need to be solved are.
01:26:13 I feel with respect to natural language, understanding and high level cognition, it’s not just known
01:26:21 unknowns, but also unknown unknowns.
01:26:24 So it is very difficult to put any kind of a timeframe to that.
01:26:30 Do you think some of the unknown unknowns might be positive in that they’ll surprise
01:26:36 us and make the job much easier?
01:26:38 So fundamental breakthroughs?
01:26:40 I think that is possible because certainly I have been very positively surprised by how
01:26:45 effective these deep learning systems have been because I certainly would not have believed
01:26:53 that in 2010.
01:26:57 I think what we knew from the mathematical theory was that convex optimization works.
01:27:06 When there’s a single global optima, then these gradient descent techniques would work.
01:27:11 Now these are nonlinear systems with non convex systems.
01:27:16 Huge number of variables, so over parametrized.
01:27:18 And the people who used to play with them a lot, the ones who are totally immersed in
01:27:26 the lore and the black magic, they knew that they worked well, even though they were…
01:27:33 Really?
01:27:34 I thought like everybody…
01:27:35 No, the claim that I hear from my friends like Yann LeCun and so forth is that they
01:27:43 feel that they were comfortable with them.
01:27:45 But the community as a whole was certainly not.
01:27:50 And I think to me that was the surprise that they actually worked robustly for a wide range
01:27:59 of problems from a wide range of initializations and so on.
01:28:04 And so that was certainly more rapid progress than we expected.
01:28:13 But then there are certainly lots of times, in fact, most of the history of AI is when
01:28:19 we have made less progress at a slower rate than we expected.
01:28:24 So we just keep going.
01:28:27 I think what I regard as really unwarranted are these fears of AGI in 10 years and 20
01:28:39 years and that kind of stuff, because that’s based on completely unrealistic models of
01:28:44 how rapidly we will make progress in this field.
01:28:48 So I agree with you, but I’ve also gotten the chance to interact with very smart people
01:28:54 who really worry about existential threats of AI.
01:28:57 And I, as an open minded person, am sort of taking it in.
01:29:04 Do you think if AI systems in some way, the unknown unknowns, not super intelligent AI,
01:29:12 but in ways we don’t quite understand the nature of super intelligence, will have a
01:29:18 detrimental effect on society?
01:29:20 Do you think this is something we should be worried about or we need to first allow the
01:29:25 unknown unknowns to become known unknowns?
01:29:29 I think we need to be worried about AI today.
01:29:32 I think that it is not just a worry we need to have when we get that AGI.
01:29:38 I think that AI is being used in many systems today.
01:29:43 And there might be settings, for example, when it causes biases or decisions which could
01:29:49 be harmful.
01:29:50 I mean, decisions which could be unfair to some people or it could be a self driving
01:29:55 cars which kills a pedestrian.
01:29:57 So AI systems are being deployed today, right?
01:30:02 And they’re being deployed in many different settings, maybe in medical diagnosis, maybe
01:30:05 in a self driving car, maybe in selecting applicants for an interview.
01:30:10 So I would argue that when these systems make mistakes, there are consequences.
01:30:18 And we are in a certain sense responsible for those consequences.
01:30:22 So I would argue that this is a continuous effort.
01:30:27 It is we and this is something that in a way is not so surprising.
01:30:32 It’s about all engineering and scientific progress which great power comes great responsibility.
01:30:40 So as these systems are deployed, we have to worry about them and it’s a continuous
01:30:44 problem.
01:30:45 I don’t think of it as something which will suddenly happen on some day in 2079 for which
01:30:51 I need to design some clever trick.
01:30:54 I’m saying that these problems exist today and we need to be continuously on the lookout
01:31:00 for worrying about safety, biases, risks, right?
01:31:06 I mean, the self driving car kills a pedestrian and they have, right?
01:31:11 I mean, this Uber incident in Arizona, right?
01:31:16 It has happened, right?
01:31:17 This is not about AGI.
01:31:18 In fact, it’s about a very dumb intelligence which is still killing people.
01:31:23 The worry people have with AGI is the scale.
01:31:28 But I think you’re 100% right is like the thing that worries me about AI today and it’s
01:31:34 happening in a huge scale is recommender systems, recommendation systems.
01:31:39 So if you look at Twitter or Facebook or YouTube, they’re controlling the ideas that we have
01:31:47 access to, the news and so on.
01:31:50 And that’s a fundamental machine learning algorithm behind each of these recommendations.
01:31:55 And they, I mean, my life would not be the same without these sources of information.
01:32:00 I’m a totally new human being and the ideas that I know are very much because of the internet,
01:32:07 because of the algorithm that recommend those ideas.
01:32:09 And so as they get smarter and smarter, I mean, that is the AGI is that’s the algorithm
01:32:16 that’s recommending the next YouTube video you should watch has control of millions of
01:32:23 billions of people that that algorithm is already super intelligent and has complete
01:32:30 control of the population, not a complete, but very strong control.
01:32:35 For now we can turn off YouTube, we can just go have a normal life outside of that.
01:32:39 But the more and more that gets into our life, it’s that algorithm we start depending on
01:32:46 it in the different companies that are working on the algorithm.
01:32:49 So I think it’s, you’re right, it’s already there.
01:32:53 And YouTube in particular is using computer vision, doing their hardest to try to understand
01:32:59 the content of videos so they could be able to connect videos with the people who would
01:33:05 benefit from those videos the most.
01:33:08 And so that development could go in a bunch of different directions, some of which might
01:33:12 be harmful.
01:33:14 So yeah, you’re right, the threats of AI are here already and we should be thinking about
01:33:19 them.
01:33:20 On a philosophical notion, if you could, personal perhaps, if you could relive a moment in
01:33:29 your life outside of family because it made you truly happy or it was a profound moment
01:33:36 that impacted the direction of your life, what moment would you go to?
01:33:44 I don’t think of single moments, but I look over the long haul.
01:33:49 I feel that I’ve been very lucky because I feel that, I think that in scientific research,
01:33:58 a lot of it is about being at the right place at the right time.
01:34:03 And you can work on problems at a time when they’re just too premature.
01:34:10 You butt your head against them and nothing happens because the prerequisites for success
01:34:18 are not there.
01:34:19 And then there are times when you are in a field which is all pretty mature and you can
01:34:25 only solve curlicues upon curlicues.
01:34:30 I’ve been lucky to have been in this field which for 34 years, well actually 34 years
01:34:36 as a professor at Berkeley, so longer than that, which when I started in it was just
01:34:44 like some little crazy, absolutely useless field which couldn’t really do anything to
01:34:53 a time when it’s really, really solving a lot of practical problems, has offered a lot
01:35:01 of tools for scientific research because computer vision is impactful for images in biology
01:35:08 or astronomy and so on and so forth.
01:35:12 And we have, so we have made great scientific progress which has had real practical impact
01:35:18 in the world.
01:35:19 And I feel lucky that I got in at a time when the field was very young and at a time when
01:35:28 it is, it’s now mature but not fully mature.
01:35:34 It’s mature but not done.
01:35:35 I mean, it’s really still in a productive phase.
01:35:39 Yeah, I think people 500 years from now would laugh at you calling this field mature.
01:35:45 That is very possible.
01:35:46 Yeah.
01:35:47 So, but you’re also, lest I forget to mention, you’ve also mentored some of the biggest names
01:35:53 of computer vision, computer science and AI today.
01:35:59 So many questions I could ask, but really is what, what is it, how did you do it?
01:36:04 What does it take to be a good mentor?
01:36:06 What does it take to be a good guide?
01:36:09 Yeah, I think what I feel, I’ve been lucky to have had very, very smart and hardworking
01:36:17 and creative students.
01:36:18 I think some part of the credit just belongs to being at Berkeley.
01:36:25 Those of us who are at top universities are blessed because we have very, very smart and
01:36:32 capable students coming on, knocking on our door.
01:36:37 So I have to be humble enough to acknowledge that.
01:36:40 But what have I added?
01:36:41 I think I have added something.
01:36:44 What I have added is, I think what I’ve always tried to teach them is a sense of picking
01:36:52 the right problems.
01:36:54 So I think that in science, in the short run, success is always based on technical competence.
01:37:04 You’re, you know, you’re quick with math or you are whatever.
01:37:09 I mean, there’s certain technical capabilities which make for short range progress.
01:37:15 Long range progress is really determined by asking the right questions and focusing on
01:37:21 the right problems.
01:37:23 And I feel that what I’ve been able to bring to the table in terms of advising these students
01:37:31 is some sense of taste of what are good problems, what are problems that are worth attacking
01:37:38 now as opposed to waiting 10 years.
01:37:41 What’s a good problem?
01:37:42 If you could summarize, is that possible to even summarize, like what’s your sense of
01:37:47 a good problem?
01:37:48 I think, I think I have a sense of what is a good problem, which is there is a British
01:37:55 scientist, in fact, he won a Nobel Prize, Peter Medover, who has a book on this.
01:38:02 And basically he calls it, research is the art of the soluble.
01:38:08 So we need to sort of find problems which are not yet solved, but which are approachable.
01:38:18 And he sort of refers to this sense that there is this problem which isn’t quite solved yet,
01:38:25 but it has a soft underbelly.
01:38:26 There is some place where you can, you know, spear the beast.
01:38:32 And having that intuition that this problem is ripe is a good thing because otherwise
01:38:39 you can just beat your head and not make progress.
01:38:42 So I think that is important.
01:38:45 So if I have that and if I can convey that to students, it’s not just that they do great
01:38:52 research while they’re working with me, but that they continue to do great research.
01:38:56 So in a sense, I’m proud of my students and their achievements and their great research
01:39:01 even 20 years after they’ve ceased being my student.
01:39:05 So it’s in part developing, helping them develop that sense that a problem is not yet solved,
01:39:11 but it’s solvable.
01:39:12 Correct.
01:39:13 The other thing which I have, which I think I bring to the table, is a certain intellectual
01:39:21 breadth.
01:39:22 I’ve spent a fair amount of time studying psychology, neuroscience, relevant areas of
01:39:29 applied math and so forth.
01:39:31 So I can probably help them see some connections to disparate things, which they might not
01:39:40 have otherwise.
01:39:42 So the smart students coming into Berkeley can be very deep, they can think very deeply,
01:39:50 meaning very hard down one particular path, but where I could help them is the shallow
01:39:58 breadth, but they would have the narrow depth, but that’s of some value.
01:40:08 Well, it was beautifully refreshing just to hear you naturally jump to psychology back
01:40:14 to computer science in this conversation back and forth.
01:40:18 That’s actually a rare quality and I think it’s certainly for students empowering to
01:40:23 think about problems in a new way.
01:40:25 So for that and for many other reasons, I really enjoyed this conversation.
01:40:29 Thank you so much.
01:40:30 It was a huge honor.
01:40:31 Thanks for talking to me.
01:40:32 It’s been my pleasure.
01:40:34 Thanks for listening to this conversation with Jitendra Malik and thank you to our sponsors,
01:40:39 BetterHelp and ExpressVPN.
01:40:43 Please consider supporting this podcast by going to betterhelp.com slash Lex and signing
01:40:49 up at expressvpn.com slash LexPod.
01:40:52 Click the links, buy the stuff.
01:40:55 That’s how they know I sent you and it really is the best way to support this podcast and
01:41:00 the journey I’m on.
01:41:02 If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple podcast,
01:41:07 support it on Patreon or connect with me on Twitter at Lex Friedman.
01:41:12 Don’t ask me how to spell that.
01:41:13 I don’t remember it myself.
01:41:15 And now let me leave you with some words from Prince Mishkin in The Idiot by Dostoevsky.
01:41:22 Beauty will save the world.
01:41:24 Thank you for listening and hope to see you next time.