Jitendra Malik: Computer Vision #110

Transcript

00:00:00 The following is a conversation with Jitendra Malik, a professor at Berkeley and one of

00:00:05 the seminal figures in the field of computer vision, the kind before the deep learning

00:00:10 revolution and the kind after.

00:00:13 He has been cited over 180,000 times and has mentored many world class researchers in computer

00:00:21 science.

00:00:22 Quick summary of the ads.

00:00:24 Two sponsors, one new one which is BetterHelp and an old goodie ExpressVPN.

00:00:31 Please consider supporting this podcast by going to betterhelp.com slash lex and signing

00:00:37 up at expressvpn.com slash lexpod.

00:00:40 Click the links, buy the stuff, it really is the best way to support this podcast and

00:00:45 the journey I’m on.

00:00:47 If you enjoy this thing, subscribe on YouTube, review it with 5 stars on Apple Podcast, support

00:00:52 it on Patreon, or connect with me on Twitter at Lex Friedman, however the heck you spell

00:00:57 that.

00:00:58 As usual, I’ll do a few minutes of ads now and never any ads in the middle that can break

00:01:02 the flow of the conversation.

00:01:05 This show is sponsored by BetterHelp, spelled H E L P help.

00:01:11 Check it out at betterhelp.com slash lex.

00:01:15 They figure out what you need and match you with a licensed professional therapist in

00:01:19 under 48 hours.

00:01:21 It’s not a crisis line, it’s not self help, it’s professional counseling done securely

00:01:26 online.

00:01:27 I’m a bit from the David Goggins line of creatures, as you may know, and so have some

00:01:33 demons to contend with, usually on long runs or all nights working, forever and possibly

00:01:40 full of self doubt.

00:01:42 It may be because I’m Russian, but I think suffering is essential for creation.

00:01:47 But I also think you can suffer beautifully, in a way that doesn’t destroy you.

00:01:52 For most people, I think a good therapist can help in this, so it’s at least worth a

00:01:56 try.

00:01:57 Check out their reviews, they’re good, it’s easy, private, affordable, available worldwide.

00:02:03 You can communicate by text, any time, and schedule weekly audio and video sessions.

00:02:09 I highly recommend that you check them out at betterhelp.com slash lex.

00:02:15 This show is also sponsored by ExpressVPN.

00:02:19 Get it at expressvpn.com slash lexpod to support this podcast and to get an extra three months

00:02:26 free on a one year package.

00:02:28 I’ve been using ExpressVPN for many years, I love it.

00:02:32 I think ExpressVPN is the best VPN out there.

00:02:36 They told me to say it, but it happens to be true.

00:02:39 It doesn’t log your data, it’s crazy fast, and is easy to use, literally just one big,

00:02:45 sexy power on button.

00:02:47 Again, for obvious reasons, it’s really important that they don’t log your data.

00:02:51 It works on Linux and everywhere else too, but really, why use anything else?

00:02:57 Shout out to my favorite flavor of Linux, Ubuntu Mate 2004.

00:03:02 Once again, get it at expressvpn.com slash lexpod to support this podcast and to get

00:03:09 an extra three months free on a one year package.

00:03:13 And now, here’s my conversation with Jitendra Malik.

00:03:18 In 1966, Seymour Papert at MIT wrote up a proposal called the Summer Vision Project

00:03:25 to be given, as far as we know, to 10 students to work on and solve that summer.

00:03:31 So that proposal outlined many of the computer vision tasks we still work on today.

00:03:37 Why do you think we underestimate, and perhaps we did underestimate and perhaps still underestimate

00:03:43 how hard computer vision is?

00:03:46 Because most of what we do in vision, we do unconsciously or subconsciously.

00:03:51 In human vision.

00:03:52 In human vision.

00:03:53 So that gives us this, that effortlessness gives us the sense that, oh, this must be

00:03:58 very easy to implement on a computer.

00:04:02 Now, this is why the early researchers in AI got it so wrong.

00:04:09 However, if you go into neuroscience or psychology of human vision, then the complexity becomes

00:04:17 very clear.

00:04:19 The fact is that a very large part of the cerebral cortex is devoted to visual processing.

00:04:26 And this is true in other primates as well.

00:04:29 So once we looked at it from a neuroscience or psychology perspective, it becomes quite

00:04:35 clear that the problem is very challenging and it will take some time.

00:04:39 You said the higher level parts are the harder parts?

00:04:43 I think vision appears to be easy because most of what visual processing is subconscious

00:04:52 or unconscious.

00:04:55 So we underestimate the difficulty, whereas when you are like proving a mathematical theorem

00:05:03 or playing chess, the difficulty is much more evident.

00:05:08 So because it is your conscious brain, which is processing various aspects of the problem

00:05:15 solving behavior, whereas in vision, all this is happening, but it’s not in your awareness,

00:05:21 it’s in your, it’s operating below that.

00:05:25 But it’s, it still seems strange.

00:05:27 Yes, that’s true, but it seems strange that as computer vision researchers, for example,

00:05:35 the community broadly is time and time again makes the mistake of thinking the problem

00:05:41 is easier than it is, or maybe it’s not a mistake.

00:05:43 We’ll talk a little bit about autonomous driving, for example, how hard of a vision task that

00:05:48 is, it, do you think, I mean, what, is it just human nature or is there something fundamental

00:05:56 to the vision problem that we, we underestimate?

00:06:01 We’re still not able to be cognizant of how hard the problem is.

00:06:05 Yeah, I think in the early days it could have been excused because in the early days, all

00:06:11 aspects of AI were regarded as too easy.

00:06:15 But I think today it is much less excusable.

00:06:19 And I think why people fall for this is because of what I call the fallacy of the successful

00:06:27 first step.

00:06:30 There are many problems in vision where getting 50% of the solution you can get in one minute,

00:06:37 getting to 90% can take you a day, getting to 99% may take you five years, and 99.99%

00:06:47 may be not in your lifetime.

00:06:49 I wonder if that’s a unique division.

00:06:52 It seems that language, people are not so confident about, so natural language processing,

00:06:58 people are a little bit more cautious about our ability to, to solve that problem.

00:07:04 I think for language, people intuit that we have to be able to do natural language understanding.

00:07:10 For vision, it seems that we’re not cognizant or we don’t think about how much understanding

00:07:18 is required.

00:07:19 It’s probably still an open problem.

00:07:21 But in your sense, how much understanding is required to solve vision?

00:07:27 Like this, put another way, how much something called common sense reasoning is required

00:07:34 to really be able to interpret even static scenes?

00:07:39 Yeah.

00:07:40 So vision operates at all levels and there are parts which can be solved with what we

00:07:47 could call maybe peripheral processing.

00:07:50 So in the human vision literature, there used to be these terms, sensation, perception and

00:07:57 cognition, which roughly speaking referred to like the front end of processing, middle

00:08:04 stages of processing and higher level of processing.

00:08:08 And I think they made a big deal out of, out of this and they wanted to study only perception

00:08:13 and then dismiss certain, certain problems as being quote cognitive.

00:08:19 But really I think these are artificial divides.

00:08:23 The problem is continuous at all levels and there are challenges at all levels.

00:08:28 The techniques that we have today, they work better at the lower and mid levels of the

00:08:34 problem.

00:08:35 I think the higher levels of the problem, quote the cognitive levels of the problem

00:08:39 are there and we, in many real applications, we have to confront them.

00:08:46 Now how much that is necessary will depend on the application.

00:08:51 For some problems it doesn’t matter, for some problems it matters a lot.

00:08:55 So I am, for example, a pessimist on fully autonomous driving in the near future.

00:09:04 And the reason is because I think there will be that 0.01% of the cases where quite sophisticated

00:09:13 cognitive reasoning is called for.

00:09:16 However, there are tasks where you can, first of all, they are much more, they are robust.

00:09:23 So in the sense that error rates, error is not so much of a problem.

00:09:28 For example, let’s say we are, you’re doing image search, you’re trying to get images

00:09:34 based on some, some, some description, some visual description.

00:09:41 We are very tolerant of errors there, right?

00:09:43 I mean, when Google image search gives you some images back and a few of them are wrong,

00:09:49 it’s okay.

00:09:50 It doesn’t hurt anybody.

00:09:51 There is no, there’s not a matter of life and death.

00:09:54 But making mistakes when you are driving at 60 miles per hour and you could potentially

00:10:02 kill somebody is much more important.

00:10:06 So just for the, for the fun of it, since you mentioned, let’s go there briefly about

00:10:11 autonomous vehicles.

00:10:12 So one of the companies in the space, Tesla, is with Andre Karpathy and Elon Musk are working

00:10:19 on a system called Autopilot, which is primarily a vision based system with eight cameras and

00:10:26 basically a single neural network, a multitask neural network.

00:10:30 They call it HydroNet, multiple heads, so it does multiple tasks, but is forming the

00:10:35 same representation at the core.

00:10:38 Do you think driving can be converted in this way to purely a vision problem and then solved

00:10:47 with learning or even more specifically in the current approach, what do you think about

00:10:53 what Tesla Autopilot team is doing?

00:10:57 So the way I think about it is that there are certainly subsets of the visual based

00:11:02 driving problem, which are quite solvable.

00:11:05 So for example, driving in freeway conditions is quite a solvable problem.

00:11:11 I think there were demonstrations of that going back to the 1980s by someone called

00:11:18 Ernst Tickmans in Munich.

00:11:22 In the 90s, there were approaches from Carnegie Mellon, there were approaches from our team

00:11:27 at Berkeley.

00:11:28 In the 2000s, there were approaches from Stanford and so on.

00:11:33 So autonomous driving in certain settings is very doable.

00:11:38 The challenge is to have an autopilot work under all kinds of driving conditions.

00:11:45 At that point, it’s not just a question of vision or perception, but really also of control

00:11:51 and dealing with all the edge cases.

00:11:54 So where do you think most of the difficult cases, to me, even the highway driving is

00:11:59 an open problem because it applies the same 50, 90, 95, 99 rule where the first step,

00:12:08 the fallacy of the first step, I forget how you put it, we fall victim to.

00:12:12 I think even highway driving has a lot of elements because to solve autonomous driving,

00:12:17 you have to completely relinquish the help of a human being.

00:12:22 You’re always in control so that you’re really going to feel the edge cases.

00:12:26 So I think even highway driving is really difficult.

00:12:29 But in terms of the general driving task, do you think vision is the fundamental problem

00:12:35 or is it also your action, the interaction with the environment, the ability to…

00:12:44 And then the middle ground, I don’t know if you put that under vision, which is trying

00:12:48 to predict the behavior of others, which is a little bit in the world of understanding

00:12:54 the scene, but it’s also trying to form a model of the actors in the scene and predict

00:13:00 their behavior.

00:13:01 Yeah.

00:13:02 I include that in vision because to me, perception blends into cognition and building predictive

00:13:08 models of other agents in the world, which could be other agents, could be people, other

00:13:13 agents could be other cars.

00:13:15 That is part of the task of perception because perception always has to not tell us what

00:13:22 is now, but what will happen because what’s now is boring.

00:13:26 It’s done.

00:13:27 It’s over with.

00:13:28 Okay?

00:13:29 Yeah.

00:13:30 We care about the future because we act in the future.

00:13:33 And we care about the past in as much as it informs what’s going to happen in the future.

00:13:39 So I think we have to build predictive models of behaviors of people and those can get quite

00:13:45 complicated.

00:13:48 So I mean, I’ve seen examples of this in actually, I mean, I own a Tesla and it has various safety

00:13:59 features built in.

00:14:01 And what I see are these examples where let’s say there is some a skateboarder, I mean,

00:14:09 and I don’t want to be too critical because obviously these systems are always being improved

00:14:16 and any specific criticism I have, maybe the system six months from now will not have that

00:14:23 particular failure mode.

00:14:25 So it had the wrong response and it’s because it couldn’t predict what this skateboarder

00:14:36 was going to do.

00:14:38 Okay?

00:14:39 And because it really required that higher level cognitive understanding of what skateboarders

00:14:45 typically do as opposed to a normal pedestrian.

00:14:48 So what might have been the correct behavior for a pedestrian, a typical behavior for pedestrian

00:14:53 was not the typical behavior for a skateboarder, right?

00:14:59 Yeah.

00:15:00 And so therefore to do a good job there, you need to have enough data where you have pedestrians,

00:15:07 you also have skateboarders, you’ve seen enough skateboarders to see what kinds of patterns

00:15:14 of behavior they have.

00:15:16 So it is in principle with enough data, that problem could be solved.

00:15:21 But I think our current systems, computer vision systems, they need far, far more data

00:15:29 than humans do for learning those same capabilities.

00:15:33 So say that there is going to be a system that solves autonomous driving.

00:15:38 Do you think it will look similar to what we have today, but have a lot more data, perhaps

00:15:43 more compute, but the fundamental architecture is involved, like neural, well, in the case

00:15:48 of Tesla autopilot is neural networks.

00:15:52 Do you think it will look similar in that regard and we’ll just have more data?

00:15:57 That’s a scientific hypothesis as to which way is it going to go.

00:16:01 I will tell you what I would bet on.

00:16:05 So and this is my general philosophical position on how these learning systems have been.

00:16:14 What we have found currently very effective in computer vision in the deep learning paradigm

00:16:20 is sort of tabula rasa learning and tabula rasa learning in a supervised way with lots

00:16:27 and lots of…

00:16:28 What’s tabula rasa learning?

00:16:29 Tabula rasa in the sense that blank slate, we just have the system, which is given a

00:16:35 series of experiences in this setting and then it learns there.

00:16:39 Now if let’s think about human driving, it is not tabula rasa learning.

00:16:44 So at the age of 16 in high school, a teenager goes into driver ed class, right?

00:16:55 And now at that point they learn, but at the age of 16, they are already visual geniuses

00:17:02 because from zero to 16, they have built a certain repertoire of vision.

00:17:07 In fact, most of it has probably been achieved by age two, right?

00:17:13 In this period of age up to age two, they know that the world is three dimensional.

00:17:18 They know how objects look like from different perspectives.

00:17:22 They know about occlusion.

00:17:24 They know about common dynamics of humans and other bodies.

00:17:29 They have some notion of intuitive physics.

00:17:32 So they built that up from their observations and interactions in early childhood and of

00:17:38 course reinforced through their growing up to age 16.

00:17:44 So then at age 16, when they go into driver ed, what are they learning?

00:17:49 They’re not learning afresh the visual world.

00:17:52 They have a mastery of the visual world.

00:17:54 What they are learning is control, okay?

00:17:58 They’re learning how to be smooth about control, about steering and brakes and so forth.

00:18:04 They’re learning a sense of typical traffic situations.

00:18:08 Now that education process can be quite short because they are coming in as visual geniuses.

00:18:17 And of course in their future, they’re going to encounter situations which are very novel,

00:18:23 right?

00:18:24 So during my driver ed class, I may not have had to deal with a skateboarder.

00:18:29 I may not have had to deal with a truck driving in front of me where the back opens up and

00:18:37 some junk gets dropped from the truck and I have to deal with it, right?

00:18:42 But I can deal with this as a driver even though I did not encounter this in my driver

00:18:47 ed class.

00:18:48 And the reason I can deal with it is because I have all this general visual knowledge and

00:18:52 expertise.

00:18:55 And do you think the learning mechanisms we have today can do that kind of long term accumulation

00:19:02 of knowledge?

00:19:03 Or do we have to do some kind of, you know, the work that led up to expert systems with

00:19:11 knowledge representation, you know, the broader field of artificial intelligence worked on

00:19:17 this kind of accumulation of knowledge.

00:19:20 Do you think neural networks can do the same?

00:19:22 I think I don’t see any in principle problem with neural networks doing it, but I think

00:19:29 the learning techniques would need to evolve significantly.

00:19:33 So the current learning techniques that we have are supervised learning.

00:19:41 You’re given lots of examples, x, y, y pairs and you learn the functional mapping between

00:19:47 them.

00:19:48 I think that human learning is far richer than that.

00:19:52 It includes many different components.

00:19:54 There is a child explores the world and sees, for example, a child takes an object and manipulates

00:20:05 it in his hand and therefore gets to see the object from different points of view.

00:20:12 And the child has commanded the movement.

00:20:14 So that’s a kind of learning data, but the learning data has been arranged by the child.

00:20:21 And this is a very rich kind of data.

00:20:23 The child can do various experiments with the world.

00:20:30 So there are many aspects of sort of human learning, and these have been studied in child

00:20:36 development by psychologists.

00:20:39 And what they tell us is that supervised learning is a very small part of it.

00:20:45 There are many different aspects of learning.

00:20:48 And what we would need to do is to develop models of all of these and then train our

00:20:57 systems with that kind of a protocol.

00:21:02 So new methods of learning, some of which might imitate the human brain, but you also

00:21:07 in your talks have mentioned sort of the compute side of things, in terms of the difference

00:21:13 in the human brain or referencing Moravec, Hans Moravec.

00:21:19 So do you think there’s something interesting, valuable to consider about the difference

00:21:25 in the computational power of the human brain versus the computers of today in terms of

00:21:32 instructions per second?

00:21:34 Yes, so if we go back, so this is a point I’ve been making for 20 years now.

00:21:41 And I think once upon a time, the way I used to argue this was that we just didn’t have

00:21:47 the computing power of the human brain.

00:21:49 Our computers were not quite there.

00:21:53 And I mean, there is a well known trade off, which we know that neurons are slow compared

00:22:03 to transistors, but we have a lot of them and they have a very high connectivity.

00:22:09 Whereas in silicon, you have much faster devices, transistors switch at the order of nanoseconds,

00:22:18 but the connectivity is usually smaller.

00:22:21 At this point in time, I mean, we are now talking about 2020, we do have, if you consider

00:22:27 the latest GPUs and so on, amazing computing power.

00:22:31 And if we look back at Hans Moravec type of calculations, which he did in the 1990s, we

00:22:39 may be there today in terms of computing power comparable to the brain, but it’s not in the

00:22:44 of the same style, it’s of a very different style.

00:22:49 So I mean, for example, the style of computing that we have in our GPUs is far, far more

00:22:55 power hungry than the style of computing that is there in the human brain or other biological

00:23:02 entities.

00:23:03 Yeah.

00:23:04 And that the efficiency part is, we’re going to have to solve that in order to build actual

00:23:11 real world systems of large scale.

00:23:15 Let me ask sort of the high level question, taking a step back.

00:23:19 How would you articulate the general problem of computer vision?

00:23:24 Does such a thing exist?

00:23:25 So if you look at the computer vision conferences and the work that’s been going on, it’s often

00:23:30 separated into different little segments, breaking the problem of vision apart into

00:23:36 whether segmentation, 3D reconstruction, object detection, I don’t know, image capturing,

00:23:44 whatever.

00:23:45 There’s benchmarks for each.

00:23:46 But if you were to sort of philosophically say, what is the big problem of computer vision?

00:23:52 Does such a thing exist?

00:23:54 Yes, but it’s not in isolation.

00:23:57 So for all intelligence tasks, I always go back to sort of biology or humans.

00:24:09 And if we think about vision or perception in that setting, we realize that perception

00:24:15 is always to guide action.

00:24:18 Action for a biological system does not give any benefits unless it is coupled with action.

00:24:25 So we can go back and think about the first multicellular animals, which arose in the

00:24:30 Cambrian era, you know, 500 million years ago.

00:24:35 And these animals could move and they could see in some way.

00:24:40 And the two activities helped each other.

00:24:43 Because how does movement help?

00:24:47 Movement helps that because you can get food in different places.

00:24:52 But you need to know where to go.

00:24:54 And that’s really about perception or seeing, I mean, vision is perhaps the single most

00:25:00 perception sense.

00:25:02 But all the others are equally are also important.

00:25:06 So perception and action kind of go together.

00:25:10 So earlier, it was in these very simple feedback loops, which were about finding food or avoid

00:25:17 avoiding becoming food if there’s a predator running, trying to, you know, eat you up,

00:25:24 and so forth.

00:25:25 So we must, at the fundamental level, connect perception to action.

00:25:30 Then as we evolved, perception became more and more sophisticated because it served many

00:25:37 more purposes.

00:25:39 And so today we have what seems like a fairly general purpose capability, which can look

00:25:46 at the external world and build a model of the external world inside the head.

00:25:53 We do have that capability.

00:25:55 That model is not perfect.

00:25:56 And psychologists have great fun in pointing out the ways in which the model in your head

00:26:01 is not a perfect model of the external world.

00:26:05 They create various illusions to show the ways in which it is imperfect.

00:26:11 But it’s amazing how far it has come from a very simple perception action loop that

00:26:17 you exist in, you know, an animal 500 million years ago.

00:26:23 Once we have this, these very sophisticated visual systems, we can then impose a structure

00:26:29 on them.

00:26:30 It’s we as scientists who are imposing that structure, where we have chosen to characterize

00:26:36 this part of the system as this quote, module of object detection or quote, this module

00:26:43 of 3D reconstruction.

00:26:45 What’s going on is really all of these processes are running simultaneously and they are running

00:26:55 simultaneously because originally their purpose was in fact to help guide action.

00:27:01 So as a guiding general statement of a problem, do you think we can say that the general problem

00:27:08 of computer vision, you said in humans, it was tied to action.

00:27:14 Do you think we should also say that ultimately the goal, the problem of computer vision is

00:27:20 to sense the world in a way that helps you act in the world?

00:27:27 Yes.

00:27:28 I think that’s the most fundamental, that’s the most fundamental purpose.

00:27:32 We have by now hyper evolved.

00:27:37 So we have this visual system which can be used for other things.

00:27:42 For example, judging the aesthetic value of a painting.

00:27:46 And this is not guiding action.

00:27:49 Maybe it’s guiding action in terms of how much money you will put in your auction bid,

00:27:54 but that’s a bit stretched.

00:27:56 But the basics are in fact in terms of action, but we evolved really this hyper, we have

00:28:06 hyper evolved our visual system.

00:28:08 Actually just to, sorry to interrupt, but perhaps it is fundamentally about action.

00:28:13 You kind of jokingly said about spending, but perhaps the capitalistic drive that drives

00:28:20 a lot of the development in this world is about the exchange of money and the fundamental

00:28:25 action is money.

00:28:26 If you watch Netflix, if you enjoy watching movies, you’re using your perception system

00:28:30 to interpret the movie, ultimately your enjoyment of that movie means you’ll subscribe to Netflix.

00:28:36 So the action is this extra layer that we’ve developed in modern society perhaps is fundamentally

00:28:44 tied to the action of spending money.

00:28:47 Well certainly with respect to interactions with firms.

00:28:54 So in this homo economicus role, when you’re interacting with firms, it does become that.

00:29:01 What else is there?

00:29:02 And that was a rhetorical question.

00:29:07 So to linger on the division between the static and the dynamic, so much of the work in computer

00:29:16 vision, so many of the breakthroughs that you’ve been a part of have been in the static

00:29:20 world and looking at static images.

00:29:24 And then you’ve also worked on starting, but it’s a much smaller degree, the community

00:29:29 is looking at dynamic, at video, at dynamic scenes.

00:29:32 And then there is robotic vision, which is dynamic, but also where you actually have

00:29:38 a robot in the physical world interacting based on that vision.

00:29:43 Which problem is harder?

00:29:49 The trivial first answer is, well, of course one image is harder.

00:29:53 But if you look at a deeper question there, are we, what’s the term, cutting ourselves

00:30:03 at the knees or like making the problem harder by focusing on images?

00:30:08 That’s a fair question.

00:30:09 I think sometimes we can simplify a problem so much that we essentially lose part of the

00:30:20 juice that could enable us to solve the problem.

00:30:24 And one could reasonably argue that to some extent this happens when we go from video

00:30:29 to single images.

00:30:31 Now historically you have to consider the limits imposed by the computation capabilities

00:30:39 we had.

00:30:41 So many of the choices made in the computer vision community through the 70s, 80s, 90s

00:30:50 can be understood as choices which were forced upon us by the fact that we just didn’t have

00:30:59 enough access to enough compute.

00:31:01 Not enough memory, not enough hardware.

00:31:04 Exactly.

00:31:05 Not enough compute, not enough storage.

00:31:08 So think of these choices.

00:31:09 So one of the choices is focusing on single images rather than video.

00:31:14 Okay.

00:31:15 Clear question.

00:31:16 Storage and compute.

00:31:19 We had to focus on, we used to detect edges and throw away the image.

00:31:24 Right?

00:31:25 So we would have an image which I say 256 by 256 pixels and instead of keeping around

00:31:31 the grayscale value, what we did was we detected edges, find the places where the brightness

00:31:37 changes a lot and then throw away the rest.

00:31:42 So this was a major compression device and the hope was that this makes it that you can

00:31:47 still work with it and the logic was humans can interpret a line drawing.

00:31:53 And yes, and this will save us computation.

00:31:58 So many of the choices were dictated by that.

00:32:00 I think today we are no longer detecting edges, right?

00:32:07 We process images with ConvNets because we don’t need to.

00:32:10 We don’t have those computer restrictions anymore.

00:32:14 Now video is still understudied because video compute is still quite challenging if you

00:32:19 are a university researcher.

00:32:22 I think video computing is not so challenging if you are at Google or Facebook or Amazon.

00:32:29 Still super challenging.

00:32:30 I just spoke with the VP of engineering at Google, head of the YouTube search and discovery

00:32:35 and they still struggle doing stuff on video.

00:32:38 It’s very difficult except using techniques that are essentially the techniques you used

00:32:44 in the 90s.

00:32:45 Some very basic computer vision techniques.

00:32:48 No, that’s when you want to do things at scale.

00:32:51 So if you want to operate at the scale of all the content of YouTube, it’s very challenging

00:32:56 and there are similar issues with Facebook.

00:32:59 But as a researcher, you have more opportunities.

00:33:05 You can train large networks with relatively large video data sets.

00:33:11 So I think that this is part of the reason why we have so emphasized static images.

00:33:17 I think that this is changing and over the next few years, I see a lot more progress

00:33:22 happening in video.

00:33:25 So I have this generic statement that to me, video recognition feels like 10 years behind

00:33:32 object recognition and you can quantify that because you can take some of the challenging

00:33:37 video data sets and their performance on action classification is like say 30%, which is kind

00:33:45 of what we used to have around 2009 in object detection.

00:33:51 It’s like about 10 years behind and whether it’ll take 10 years to catch up is a different

00:33:58 question.

00:33:59 Hopefully, it will take less than that.

00:34:01 Let me ask a similar question I’ve already asked, but once again, so for dynamic scenes,

00:34:08 do you think some kind of injection of knowledge bases and reasoning is required to help improve

00:34:17 like action recognition?

00:34:20 Like if we saw the general action recognition problem, what do you think the solution would

00:34:28 look like as another way to put it?

00:34:31 So I completely agree that knowledge is called for and that knowledge can be quite sophisticated.

00:34:39 So the way I would say it is that perception blends into cognition and cognition brings

00:34:44 in issues of memory and this notion of a schema from psychology, which is, let me use the

00:34:54 classic example, which is you go to a restaurant, right?

00:34:58 Now there are things that happen in a certain order, you walk in, somebody takes you to

00:35:03 a table, waiter comes, gives you a menu, takes the order, food arrives, eventually bill arrives,

00:35:13 et cetera, et cetera.

00:35:15 This is a classic example of AI from the 1970s.

00:35:19 It was called, there was the term frames and scripts and schemas, these are all quite similar

00:35:26 ideas.

00:35:27 Okay, and in the 70s, the way the AI of the time dealt with it was by hand coding this.

00:35:34 So they hand coded in this notion of a script and the various stages and the actors and

00:35:40 so on and so forth, and use that to interpret, for example, language.

00:35:45 I mean, if there’s a description of a story involving some people eating at a restaurant,

00:35:52 there are all these inferences you can make because you know what happens typically at

00:35:58 a restaurant.

00:36:00 So I think this kind of knowledge is absolutely essential.

00:36:06 So I think that when we are going to do long form video understanding, we are going to

00:36:12 need to do this.

00:36:13 I think the kinds of technology that we have right now with 3D convolutions over a couple

00:36:19 of seconds of clip or video, it’s very much tailored towards short term video understanding,

00:36:26 not that long term understanding.

00:36:28 Long term understanding requires this notion of schemas that I talked about, perhaps some

00:36:35 notions of goals, intentionality, functionality, and so on and so forth.

00:36:43 Now, how will we bring that in?

00:36:46 So we could either revert back to the 70s and say, OK, I’m going to hand code in a script

00:36:51 or we might try to learn it.

00:36:56 So I tend to believe that we have to find learning ways of doing this because I think

00:37:03 learning ways land up being more robust.

00:37:06 And there must be a learning version of the story because children acquire a lot of this

00:37:12 knowledge by sort of just observation.

00:37:16 So at no moment in a child’s life does it’s possible, but I think it’s not so typical

00:37:24 that somebody that a mother coaches a child through all the stages of what happens in

00:37:29 a restaurant.

00:37:30 They just go as a family, they go to the restaurant, they eat, come back, and the child goes through

00:37:36 ten such experiences and the child has got a schema of what happens when you go to a

00:37:41 restaurant.

00:37:42 So we somehow need to provide that capability to our systems.

00:37:48 You mentioned the following line from the end of the Alan Turing paper, Computing Machinery

00:37:53 and Intelligence, that many people, like you said, many people know and very few have read

00:37:59 where he proposes the Turing test.

00:38:03 This is how you know because it’s towards the end of the paper.

00:38:06 Instead of trying to produce a program to simulate the adult mind, why not rather try

00:38:10 to produce one which simulates the child’s?

00:38:14 So that’s a really interesting point.

00:38:17 If I think about the benchmarks we have before us, the tests of our computer vision systems,

00:38:24 they’re often kind of trying to get to the adult.

00:38:28 So what kind of benchmarks should we have?

00:38:31 What kind of tests for computer vision do you think we should have that mimic the child’s

00:38:37 in computer vision?

00:38:38 I think we should have those and we don’t have those today.

00:38:42 And I think the part of the challenge is that we should really be collecting data of the

00:38:50 type that the child experiences.

00:38:55 So that gets into issues of privacy and so on and so forth.

00:38:59 But there are attempts in this direction to sort of try to collect the kind of data that

00:39:05 a child encounters growing up.

00:39:08 So what’s the child’s linguistic environment?

00:39:11 What’s the child’s visual environment?

00:39:13 So if we could collect that kind of data and then develop learning schemes based on that

00:39:20 data, that would be one way to do it.

00:39:25 I think that’s a very promising direction myself.

00:39:28 There might be people who would argue that we could just short circuit this in some way

00:39:33 and sometimes we have imitated, we have had success by not imitating nature in detail.

00:39:44 So the usual example is airplanes, right?

00:39:47 We don’t build flapping wings.

00:39:51 So yes, that’s one of the points of debate.

00:39:57 In my mind, I would bet on this learning like a child approach.

00:40:05 So one of the fundamental aspects of learning like a child is the interactivity.

00:40:11 So the child gets to play with the data set it’s learning from.

00:40:14 Yes.

00:40:15 So it gets to select.

00:40:16 I mean, you can call that active learning.

00:40:19 In the machine learning world, you can call it a lot of terms.

00:40:23 What are your thoughts about this whole space of being able to play with the data set or

00:40:27 select what you’re learning?

00:40:29 Yeah.

00:40:30 So I think that I believe in that and I think that we could achieve it in two ways and I

00:40:38 think we should use both.

00:40:40 So one is actually real robotics, right?

00:40:45 So real physical embodiments of agents who are interacting with the world and they have

00:40:52 a physical body with dynamics and mass and moment of inertia and friction and all the

00:40:59 rest and you learn your body, the robot learns its body by doing a series of actions.

00:41:08 The second is that simulation environments.

00:41:11 So I think simulation environments are getting much, much better.

00:41:17 In my life in Facebook AI research, our group has worked on something called Habitat, which

00:41:24 is a simulation environment, which is a visually photorealistic environment of places like

00:41:34 houses or interiors of various urban spaces and so forth.

00:41:39 And as you move, you get a picture, which is a pretty accurate picture.

00:41:45 So now you can imagine that subsequent generations of these simulators will be accurate, not

00:41:53 just visually, but with respect to forces and masses and haptic interactions and so

00:42:01 on.

00:42:03 And then we have that environment to play with.

00:42:07 I think, let me state one reason why I think being able to act in the world is important.

00:42:16 I think that this is one way to break the correlation versus causation barrier.

00:42:23 So this is something which is of a great deal of interest these days.

00:42:27 I mean, people like Judea Pearl have talked a lot about that we are neglecting causality

00:42:34 and he describes the entire set of successes of deep learning as just curve fitting, right?

00:42:42 But I don’t quite agree about it.

00:42:45 He’s a troublemaker.

00:42:46 He is.

00:42:47 But causality is important, but causality is not like a single silver bullet.

00:42:54 It’s not like one single principle.

00:42:56 There are many different aspects here.

00:42:58 And one of the ways in which, one of our most reliable ways of establishing causal links

00:43:05 and this is the way, for example, the medical community does this is randomized control

00:43:11 trials.

00:43:12 So you have, you pick some situation and now in some situation you perform an action and

00:43:18 for certain others you don’t, right?

00:43:22 So you have a controlled experiment.

00:43:23 Well, the child is in fact performing controlled experiments all the time, right?

00:43:28 Right.

00:43:29 Okay.

00:43:30 Small scale.

00:43:31 In a small scale.

00:43:32 But that is a way that the child gets to build and refine its causal models of the world.

00:43:41 And my colleague Alison Gopnik has, together with a couple of authors, coauthors, has this

00:43:47 book called The Scientist in the Crib, referring to the children.

00:43:50 So I like, the part that I like about that is the scientist wants to do, wants to build

00:43:57 causal models and the scientist does control experiments.

00:44:01 And I think the child is doing that.

00:44:03 So to enable that, we will need to have these active experiments.

00:44:10 And I think this could be done, some in the real world and some in simulation.

00:44:14 So you have hope for simulation.

00:44:16 I have hope for simulation.

00:44:18 That’s an exciting possibility if we can get to not just photorealistic, but what’s that

00:44:22 called life realistic simulation.

00:44:27 So you don’t see any fundamental blocks to why we can’t eventually simulate the principles

00:44:35 of what it means to exist in the world as a physical scientist.

00:44:39 I don’t see any fundamental problems that, I mean, and look, the computer graphics community

00:44:43 has come a long way.

00:44:45 So in the early days, back going back to the eighties and nineties, they were focusing

00:44:50 on visual realism, right?

00:44:52 And then they could do the easy stuff, but they couldn’t do stuff like hair or fur and

00:44:58 so on.

00:44:59 Okay, well, they managed to do that.

00:45:01 Then they couldn’t do physical actions, right?

00:45:04 Like there’s a bowl of glass and it falls down and it shatters, but then they could

00:45:09 start to do pretty realistic models of that and so on and so forth.

00:45:13 So the graphics people have shown that they can do this forward direction, not just for

00:45:19 optical interactions, but also for physical interactions.

00:45:23 So I think, of course, some of that is very compute intensive, but I think by and by we

00:45:30 will find ways of making our models ever more realistic.

00:45:35 You break vision apart into, in one of your presentations, early vision, static scene

00:45:40 understanding, dynamic scene understanding, and raise a few interesting questions.

00:45:44 I thought I could just throw some at you to see if you want to talk about them.

00:45:50 So early vision, so it’s, what is it that you said, sensation, perception and cognition.

00:45:58 So is this a sensation?

00:46:00 Yes.

00:46:01 What can we learn from image statistics that we don’t already know?

00:46:05 So at the lowest level, what can we make from just the statistics, the basics, or the variations

00:46:15 in the rock pixels, the textures and so on?

00:46:18 Yeah.

00:46:19 So what we seem to have learned is that there’s a lot of redundancy in these images and as

00:46:28 a result, we are able to do a lot of compression and this compression is very important in

00:46:35 biological settings, right?

00:46:36 So you might have 10 to the 8 photoreceptors and only 10 to the 6 fibers in the optic nerve.

00:46:42 So you have to do this compression by a factor of 100 is to 1.

00:46:46 And so there are analogs of that which are happening in our neural net, artificial neural

00:46:54 network.

00:46:55 That’s the early layers.

00:46:56 So you think there’s a lot of compression that can be done in the beginning.

00:47:01 Just the statistics.

00:47:02 Yeah.

00:47:03 So how successful is image compression?

00:47:05 How much?

00:47:06 Well, I mean, the way to think about it is just how successful is image compression,

00:47:14 right?

00:47:15 And that’s been done with older technologies, but it can be done with, there are several

00:47:23 companies which are trying to use sort of these more advanced neural network type techniques

00:47:29 for compression, both for static images as well as for video.

00:47:34 One of my former students has a company which is trying to do stuff like this.

00:47:41 And I think that they are showing quite interesting results.

00:47:47 And I think that that’s all the success of, that’s really about image statistics and

00:47:52 video statistics.

00:47:53 But that’s still not doing compression of the kind, when I see a picture of a cat, all

00:47:59 I have to say is it’s a cat, that’s another semantic kind of compression.

00:48:02 Yeah.

00:48:03 So this is at the lower level, right?

00:48:04 So we are, as I said, yeah, that’s focusing on low level statistics.

00:48:10 So to linger on that for a little bit, you mentioned how far can bottom up image segmentation

00:48:17 go.

00:48:18 You know, what you mentioned that the central question for scene understanding is the interplay

00:48:24 of bottom up and top down information.

00:48:26 Maybe this is a good time to elaborate on that.

00:48:29 Maybe define what is bottom up, what is top down in the context of computer vision.

00:48:37 Right.

00:48:38 So today what we have are very interesting systems because they work completely bottom

00:48:45 up.

00:48:46 What does bottom up mean, sorry?

00:48:47 So bottom up means, in this case means a feed forward neural network.

00:48:52 So starting from the raw pixels, yeah, they start from the raw pixels and they end up

00:48:57 with some, something like cat or not a cat, right?

00:49:00 So our systems are running totally feed forward.

00:49:04 They’re trained in a very top down way.

00:49:07 So they’re trained by saying, okay, this is a cat, there’s a cat, there’s a dog, there’s

00:49:11 a zebra, et cetera.

00:49:14 And I’m not happy with either of these choices fully.

00:49:18 We have gone into, because we have completely separated these processes, right?

00:49:24 So there’s a, so I would like the process, so what do we know compared to biology?

00:49:34 So in biology, what we know is that the processes in at test time, at runtime, those processes

00:49:42 are not purely feed forward, but they involve feedback.

00:49:46 So and they involve much shallower neural networks.

00:49:50 So the kinds of neural networks we are using in computer vision, say a ResNet 50 has 50

00:49:55 layers.

00:49:56 Well in the brain, in the visual cortex going from the retina to IT, maybe we have like

00:50:02 seven, right?

00:50:04 So they’re far shallower, but we have the possibility of feedback.

00:50:08 So there are backward connections.

00:50:11 And this might enable us to deal with the more ambiguous stimuli, for example.

00:50:18 So the biological solution seems to involve feedback, the solution in artificial vision

00:50:26 seems to be just feed forward, but with a much deeper network.

00:50:30 And the two are functionally equivalent because if you have a feedback network, which just

00:50:35 has like three rounds of feedback, you can just unroll it and make it three times the

00:50:40 depth and create it in a totally feed forward way.

00:50:44 So this is something which, I mean, we have written some papers on this theme, but I really

00:50:49 feel that this should, this theme should be pursued further.

00:50:55 Some kind of occurrence mechanism.

00:50:57 Yeah.

00:50:58 Okay.

00:50:59 The other, so that’s, so I want to have a little bit more top down in the, at test time.

00:51:07 Okay.

00:51:08 And then at training time, we make use of a lot of top down knowledge right now.

00:51:13 So basically to learn to segment an object, we have to have all these examples of this

00:51:19 is the boundary of a cat, and this is the boundary of a chair, and this is the boundary

00:51:22 of a horse and so on.

00:51:24 And this is too much top down knowledge.

00:51:27 How do humans do this?

00:51:30 We manage to, we manage with far less supervision and we do it in a sort of bottom up way because

00:51:36 for example, we are looking at a video stream and the horse moves and that enables me to

00:51:44 say that all these pixels are together.

00:51:47 So the Gestalt psychologist used to call this the principle of common fate.

00:51:53 So there was a bottom up process by which we were able to segment out these objects

00:51:58 and we have totally focused on this top down training signal.

00:52:04 So in my view, we have currently solved it in machine vision, this top down bottom up

00:52:10 interaction, but I don’t find the solution fully satisfactory and I would rather have

00:52:17 a bit of both at both stages.

00:52:20 For all computer vision problems, not just segmentation.

00:52:25 And the question that you can ask is, so for me, I’m inspired a lot by human vision and

00:52:30 I care about that.

00:52:31 You could be just a hard boiled engineer and not give a damn.

00:52:35 So to you, I would then argue that you would need far less training data if you could make

00:52:41 my research agenda fruitful.

00:52:45 Okay, so then maybe taking a step into segmentation, static scene understanding.

00:52:54 What is the interaction between segmentation and recognition?

00:52:57 You mentioned the movement of objects.

00:53:00 So for people who don’t know computer vision, segmentation is this weird activity that computer

00:53:07 vision folks have all agreed is very important of drawing outlines around objects versus

00:53:15 a bounding box and then classifying that object.

00:53:21 What’s the value of segmentation?

00:53:23 What is it as a problem in computer vision?

00:53:27 How is it fundamentally different from detection recognition and the other problems?

00:53:31 Yeah, so I think, so segmentation enables us to say that some set of pixels are an object

00:53:41 without necessarily even being able to name that object or knowing properties of that

00:53:47 object.

00:53:48 Oh, so you mean segmentation purely as the act of separating an object.

00:53:55 From its background.

00:53:56 It’s a job that’s united in some way from its background.

00:54:01 Yeah, so entitification, if you will, making an entity out of it.

00:54:05 Entitification, beautifully termed.

00:54:09 So I think that we have that capability and that enables us to, as we are growing up,

00:54:17 to acquire names of objects with very little supervision.

00:54:23 So suppose the child, let’s posit that the child has this ability to separate out objects

00:54:28 in the world.

00:54:30 Then when the mother says, pick up your bottle or the cat’s behaving funny today, the word

00:54:42 cat suggests some object and then the child sort of does the mapping, right?

00:54:47 The mother doesn’t have to teach specific object labels by pointing to them.

00:54:55 Weak supervision works in the context that you have the ability to create objects.

00:55:01 So I think that, so to me, that’s a very fundamental capability.

00:55:07 There are applications where this is very important, for example, medical diagnosis.

00:55:13 So in medical diagnosis, you have some brain scan, I mean, this is some work that we did

00:55:20 in my group where you have CT scans of people who have had traumatic brain injury and what

00:55:26 the radiologist needs to do is to precisely delineate various places where there might

00:55:32 be bleeds, for example, and there are clear needs like that.

00:55:39 So there are certainly very practical applications of computer vision where segmentation is necessary,

00:55:46 but philosophically segmentation enables the task of recognition to proceed with much weaker

00:55:54 supervision than we require today.

00:55:58 And you think of segmentation as this kind of task that takes on a visual scene and breaks

00:56:03 it apart into interesting entities that might be useful for whatever the task is.

00:56:11 Yeah.

00:56:12 And it is not semantics free.

00:56:14 So I think, I mean, it blends into, it involves perception and cognition.

00:56:22 It is not, I think the mistake that we used to make in the early days of computer vision

00:56:28 was to treat it as a purely bottom up perceptual task.

00:56:32 It is not just that because we do revise our notion of segmentation with more experience,

00:56:41 right?

00:56:42 Because for example, there are objects which are nonrigid like animals or humans.

00:56:47 And I think understanding that all the pixels of a human are one entity is actually quite

00:56:53 a challenge because the parts of the human, they can move independently and the human

00:56:59 wears clothes, so they might be differently colored.

00:57:02 So it’s all sort of a challenge.

00:57:05 You mentioned the three R’s of computer vision are recognition, reconstruction and reorganization.

00:57:12 Can you describe these three R’s and how they interact?

00:57:15 Yeah.

00:57:16 So recognition is the easiest one because that’s what I think people generally think

00:57:24 of as computer vision achieving these days, which is labels.

00:57:30 So is this a cat?

00:57:31 Is this a dog?

00:57:32 Is this a chihuahua?

00:57:35 I mean, you know, it could be very fine grained like, you know, specific breed of a dog or

00:57:41 a specific species of bird, or it could be very abstract like animal.

00:57:47 But given a part of an image or a whole image, say put a label on it.

00:57:51 Yeah.

00:57:52 That’s recognition.

00:57:54 Reconstruction is essentially, you can think of it as inverse graphics.

00:58:03 I mean, that’s one way to think about it.

00:58:07 So graphics is you have some internal computer representation and you have a computer representation

00:58:14 of some objects arranged in a scene.

00:58:17 And what you do is you produce a picture, you produce the pixels corresponding to a

00:58:22 rendering of that scene.

00:58:24 So let’s do the inverse of this.

00:58:28 We are given an image and we try to, we say, oh, this image arises from some objects in

00:58:38 a scene looked at with a camera from this viewpoint.

00:58:41 And we might have more information about the objects like their shape, maybe their textures,

00:58:47 maybe, you know, color, et cetera, et cetera.

00:58:51 So that’s the reconstruction problem.

00:58:53 In a way, you are in your head creating a model of the external world.

00:59:00 Right.

00:59:01 Okay.

00:59:02 Reorganization is to do with essentially finding these entities.

00:59:09 So it’s organization, the word organization implies structure.

00:59:15 So that in perception, in psychology, we use the term perceptual organization.

00:59:22 That the world is not just, an image is not just seen as, is not internally represented

00:59:30 as just a collection of pixels, but we make these entities.

00:59:34 We create these entities, objects, whatever you want to call it.

00:59:38 And the relationship between the entities as well, or is it purely about the entities?

00:59:42 It could be about the relationships, but mainly we focus on the fact that there are entities.

00:59:47 Okay.

00:59:48 So I’m trying to pinpoint what the organization means.

00:59:52 So organization is that instead of like a uniform grid, we have this structure of objects.

01:00:02 So the segmentation is the small part of that.

01:00:05 So segmentation gets us going towards that.

01:00:09 Yeah.

01:00:10 And you kind of have this triangle where they all interact together.

01:00:13 Yes.

01:00:14 So how do you see that interaction in sort of reorganization is yes, finding the entities

01:00:23 in the world.

01:00:25 The recognition is labeling those entities and then reconstruction is what filling in

01:00:32 the gaps.

01:00:33 Well, for example, see, impute some 3D objects corresponding to each of these entities.

01:00:43 That would be part of it.

01:00:44 So adding more information that’s not there in the raw data.

01:00:48 Correct.

01:00:49 I mean, I started pushing this kind of a view in the, around 2010 or something like that.

01:00:58 Because at that time in computer vision, the distinction that people were just working

01:01:06 on many different problems, but they treated each of them as a separate isolated problem

01:01:11 with each with its own data set.

01:01:13 And then you try to solve that and get good numbers on it.

01:01:17 So I wasn’t, I didn’t like that approach because I wanted to see the connection between these.

01:01:23 And if people divided up vision into, into various modules, the way they would do it

01:01:30 is as low level, mid level and high level vision corresponding roughly to the psychologist’s

01:01:36 notion of sensation, perception and cognition.

01:01:40 And I didn’t, that didn’t map to tasks that people cared about.

01:01:45 Okay.

01:01:46 So therefore I tried to promote this particular framework as a way of considering the problems

01:01:52 that people in computer vision were actually working on and trying to be more explicit

01:01:58 about the fact that they actually are connected to each other.

01:02:02 And I was at that time just doing this on the basis of information flow.

01:02:07 Now it turns out in the last five years or so in the post, the deep learning revolution

01:02:17 that this, this architecture has turned out to be very conducive to that.

01:02:25 Because basically in these neural networks, we are trying to build multiple representations.

01:02:33 They can be multiple output heads sharing common representations.

01:02:37 So in a certain sense today, given the reality of what solutions people have to this, I do

01:02:46 not need to preach this anymore.

01:02:48 It is, it is just there.

01:02:50 It’s part of the sedation space.

01:02:52 So speaking of neural networks, how much of this problem of computer vision of reorganization

01:03:02 recognition can be reconstruction?

01:03:09 How much of it can be learned end to end, do you think?

01:03:12 Sort of set it and forget it.

01:03:17 Just plug and play, have a giant data set, multiple, perhaps multimodal, and then just

01:03:23 learn the entirety of it.

01:03:25 Well, so I think that currently what that end to end learning means nowadays is end

01:03:31 to end supervised learning.

01:03:34 And that I would argue is too narrow a view of the problem.

01:03:38 I like this child development view, this lifelong learning view, one where there are certain

01:03:46 capabilities that are built up and then there are certain capabilities which are built up

01:03:51 on top of that.

01:03:53 So that’s what I believe in.

01:03:58 So I think end to end learning in the supervised setting for a very precise task to me is kind

01:04:13 of is sort of a limited view of the learning process.

01:04:17 Got it.

01:04:18 So if we think about beyond purely supervised, looking back to children, you mentioned six

01:04:25 lessons that we can learn from children of be multimodal, be incremental, be physical,

01:04:33 explore, be social, use language.

01:04:36 Can you speak to these, perhaps picking one that you find most fundamental to our time

01:04:42 today?

01:04:43 Yeah.

01:04:44 So I mean, I should say to give a due credit, this is from a paper by Smith and Gasser.

01:04:50 And it reflects essentially, I would say common wisdom among child development people.

01:05:00 It’s just that this is not common wisdom among people in computer vision and AI and machine

01:05:07 learning.

01:05:08 So I view my role as trying to bridge the two worlds.

01:05:15 So let’s take an example of a multimodal.

01:05:18 I like that.

01:05:20 So multimodal, a canonical example is a child interacting with an object.

01:05:28 So then the child holds a ball and plays with it.

01:05:32 So at that point, it’s getting a touch signal.

01:05:35 So the touch signal is getting the notion of 3D shape, but it is sparse.

01:05:44 And then the child is also seeing a visual signal.

01:05:48 And these two, so imagine these are two in totally different spaces.

01:05:52 So one is the space of receptors on the skin of the fingers and the thumb and the palm.

01:05:59 And then these map onto these neuronal fibers are getting activated somewhere.

01:06:06 These lead to some activation in somatosensory cortex.

01:06:10 I mean, a similar thing will happen if we have a robot hand.

01:06:15 And then we have the pixels corresponding to the visual view, but we know that they

01:06:20 correspond to the same object.

01:06:24 So that’s a very, very strong cross calibration signal.

01:06:28 And it is self supervisory, which is beautiful.

01:06:32 There’s nobody assigning a label.

01:06:34 The mother doesn’t have to come and assign a label.

01:06:37 The child doesn’t even have to know that this object is called a ball.

01:06:42 That the child is learning something about the three dimensional world from this signal.

01:06:49 I think tactile and visual, there is some work on, there is a lot of work currently

01:06:54 on audio and visual.

01:06:57 And audio visual, so there is some event that happens in the world and that event has a

01:07:02 visual signature and it has a auditory signature.

01:07:07 So there is this glass bowl on the table and it falls and breaks and I hear the smashing

01:07:12 sound and I see the pieces of glass.

01:07:14 Okay, I’ve built that connection between the two, right?

01:07:19 We have people, I mean, this has become a hot topic in computer vision in the last couple

01:07:24 of years.

01:07:26 There are problems like separating out multiple speakers, right?

01:07:32 Which was a classic problem in auditions.

01:07:35 They call this the problem of source separation or the cocktail party effect and so on.

01:07:40 But just try to do it visually when you also have, it becomes so much easier and so much

01:07:47 more useful.

01:07:50 So the multimodal, I mean, there’s so much more signal with multimodal and you can use

01:07:56 that for some kind of weak supervision as well.

01:08:00 Yes, because they are occurring at the same time in time.

01:08:03 So you have time which links the two, right?

01:08:06 So at a certain moment, T1, you’ve got a certain signal in the auditory domain and a certain

01:08:10 signal in the visual domain, but they must be causally related.

01:08:14 Yeah, that’s an exciting area.

01:08:16 Not well studied yet.

01:08:17 Yeah, I mean, we have a little bit of work at this, but so much more needs to be done.

01:08:25 So this is a good example.

01:08:28 Be physical, that’s to do with like the one thing we talked about earlier that there’s

01:08:34 a embodied world.

01:08:36 To mention language, use language.

01:08:39 So Noam Chomsky believes that language may be at the core of cognition, at the core of

01:08:44 everything in the human mind.

01:08:46 What is the connection between language and vision to you?

01:08:50 What’s more fundamental?

01:08:51 Are they neighbors?

01:08:53 Is one the parent and the child, the chicken and the egg?

01:08:58 Oh, it’s very clear.

01:08:59 It is vision, which is the parent.

01:09:00 Which is the fundamental ability, okay.

01:09:07 It comes before you think vision is more fundamental than language.

01:09:11 Correct.

01:09:12 And you can think of it either in phylogeny or in ontogeny.

01:09:18 So phylogeny means if you look at evolutionary time, right?

01:09:22 So we have vision that developed 500 million years ago, okay.

01:09:27 Then something like when we get to maybe like five million years ago, you have the first

01:09:33 bipedal primate.

01:09:34 So when we started to walk, then the hands became free.

01:09:38 And so then manipulation, the ability to manipulate objects and build tools and so on and so forth.

01:09:45 So you said 500,000 years ago?

01:09:47 No, sorry.

01:09:48 The first multicellular animals, which you can say had some intelligence arose 500 million

01:09:56 years ago.

01:09:57 Million.

01:09:58 Okay.

01:09:59 And now let’s fast forward to say the last seven million years, which is the development

01:10:05 of the hominid line, right, where from the other primates, we have the branch which leads

01:10:10 on to modern humans.

01:10:12 Now there are many of these hominids, but the ones which, you know, people talk about

01:10:21 Lucy because that’s like a skeleton from three million years ago.

01:10:25 And we know that Lucy walked, okay.

01:10:28 So at this stage you have that the hand is free for manipulating objects and then the

01:10:34 ability to manipulate objects, build tools and the brain size grew in this era.

01:10:43 So okay, so now you have manipulation.

01:10:46 Now we don’t know exactly when language arose.

01:10:49 But after that.

01:10:50 Because no apes have, I mean, so I mean Chomsky is correct in that, that it is a uniquely

01:10:57 human capability and we primates, other primates don’t have that.

01:11:04 But so it developed somewhere in this era, but it developed, I would, I mean, argue that

01:11:12 it probably developed after we had this stage of humans, I mean, the human species already

01:11:19 able to manipulate and hands free much bigger brain size.

01:11:25 And for that, there’s a lot of vision has already had, had to have developed.

01:11:31 So the sensation and the perception may be some of the cognition.

01:11:35 Yeah.

01:11:36 So we, we, we, so those, so, so that vision, so the world, so there, so, so these ancestors

01:11:45 of ours, you know, three, four million years ago, they had, they had special intelligence.

01:11:53 So they knew that the world consists of objects.

01:11:56 They knew that the objects were in certain relationships to each other.

01:11:59 They had observed causal interactions among objects.

01:12:05 They could move in space.

01:12:06 So they had space and time and all of that.

01:12:09 So language builds on that substrate.

01:12:13 So language has a lot of, I mean, I mean, the none, all human languages have constructs

01:12:19 which depend on a notion of space and time.

01:12:22 Where did that notion of space and time come from?

01:12:26 It had to come from perception and action in the world we live in.

01:12:30 Yeah.

01:12:31 Well, you’ve referred to the spatial intelligence.

01:12:33 Yeah.

01:12:34 Yeah.

01:12:35 So to linger a little bit, we’ll mention Turing and his mention of, we should learn from

01:12:42 children.

01:12:43 Nevertheless, language is the fundamental piece of the test of intelligence that Turing

01:12:49 proposed.

01:12:50 Yes.

01:12:51 What do you think is a good test of intelligence?

01:12:53 Are you, what would impress the heck out of you?

01:12:56 Is it fundamentally natural language or is there something in vision?

01:13:02 I think, I wouldn’t, I don’t think we should have created a single test of intelligence.

01:13:10 So just like I don’t believe in IQ as a single number, I think generally there can be many

01:13:17 capabilities which are correlated perhaps.

01:13:21 So I think that there will be, there will be accomplishments which are visual accomplishments,

01:13:28 accomplishments which are accomplishments in manipulation or robotics, and then accomplishments

01:13:36 in language.

01:13:37 But I do believe that language will be the hardest nut to crack.

01:13:40 Really?

01:13:41 Yeah.

01:13:42 So what’s harder, to pass the spirit of the Turing test or like whatever formulation will

01:13:46 make it natural language, convincingly a natural language, like somebody you would want to

01:13:52 have a beer with, hang out and have a chat with, or the general natural scene understanding?

01:13:59 You think language is the tougher problem?

01:14:01 I think, I’m not a fan of the, I think, I think Turing test, that Turing as he proposed

01:14:09 the test in 1950 was trying to solve a certain problem.

01:14:13 Yeah, imitation.

01:14:14 Yeah.

01:14:15 And, and I think it made a lot of sense then.

01:14:18 Where we are today, 70 years later, I think, I think we should not worry about that.

01:14:26 I think the Turing test is no longer the right way to channel research in AI, because that,

01:14:34 it takes us down this path of this chat bot, which can fool us for five minutes or whatever.

01:14:39 Okay.

01:14:40 I think I would rather have a list of 10 different tasks.

01:14:44 I mean, I think there are tasks which, there are tasks in the manipulation domain, tasks

01:14:50 in navigation, tasks in visual scene understanding, tasks in reading a story and answering questions

01:14:58 based on that.

01:14:59 I mean, so my favorite language understanding task would be, you know, reading a novel and

01:15:05 being able to answer arbitrary questions from it.

01:15:08 Okay.

01:15:09 Right.

01:15:10 I think that to me, and this is not an exhaustive list by any means.

01:15:15 So I would, I think that that’s what we, where we need to be going to.

01:15:21 And each of these, on each of these axes, there’s a fair amount of work to be done.

01:15:26 So on the visual understanding side, in this intelligence Olympics that we’ve set up, what’s

01:15:31 a good test for one of many of visual scene understanding?

01:15:39 Do you think such benchmarks exist?

01:15:41 Sorry to interrupt.

01:15:42 No, there aren’t any.

01:15:43 I think, I think essentially to me, a really good aid to the blind.

01:15:50 So suppose there was a blind person and I needed to assist the blind person.

01:15:57 So ultimately, like we said, vision that aids in the action in a survival in this world,

01:16:05 maybe in the simulated world.

01:16:09 Maybe easier to measure performance in a simulated world, what we are ultimately after is performance

01:16:15 in the real world.

01:16:17 So David Hilbert in 1900 proposed 23 open problems in mathematics, some of which are

01:16:23 still unsolved, most important, famous of which is probably the Riemann hypothesis.

01:16:29 You’ve thought about and presented about the Hilbert problems of computer vision.

01:16:33 So let me ask, what do you today, I don’t know when the last year you presented that

01:16:38 in 2015, but versions of it, you’re kind of the face and the spokesperson for computer

01:16:44 vision.

01:16:45 It’s your job to state what the open problems are for the field.

01:16:51 So what today are the Hilbert problems of computer vision, do you think?

01:16:56 Let me pick one which I regard as clearly unsolved, which is what I would call long

01:17:05 form video understanding.

01:17:08 So we have a video clip and we want to understand the behavior in there in terms of agents,

01:17:20 their goals, intentionality and make predictions about what might happen.

01:17:30 So that kind of understanding which goes away from atomic visual action.

01:17:37 So in the short range, the question is, are you sitting, are you standing, are you catching

01:17:41 a ball?

01:17:44 That we can do now, or even if we can’t do it fully accurately, if we can do it at 50%,

01:17:50 maybe next year we’ll do it at 65% and so forth.

01:17:54 But I think the long range video understanding, I don’t think we can do today.

01:18:01 And it blends into cognition, that’s the reason why it’s challenging.

01:18:06 So you have to track, you have to understand the entities, you have to understand the entities,

01:18:11 you have to track them and you have to have some kind of model of their behavior.

01:18:16 Correct.

01:18:17 And their behavior might be, these are agents, so they are not just like passive objects,

01:18:24 but they’re agents, so therefore they would exhibit goal directed behavior.

01:18:29 Okay, so this is one area.

01:18:32 Then I will talk about understanding the world in 3D.

01:18:37 This may seem paradoxical because in a way we have been able to do 3D understanding even

01:18:43 like 30 years ago, right?

01:18:45 But I don’t think we currently have the richness of 3D understanding in our computer vision

01:18:51 system that we would like.

01:18:55 So let me elaborate on that a bit.

01:18:57 So currently we have two kinds of techniques which are not fully unified.

01:19:03 So they are the kinds of techniques from multi view geometry that you have multiple pictures

01:19:08 of a scene and you do a reconstruction using stereoscopic vision or structure from motion.

01:19:14 But these techniques do not, they totally fail if you just have a single view because

01:19:21 they are relying on this multiple view geometry.

01:19:25 Okay, then we have some techniques that we have developed in the computer vision community

01:19:30 which try to guess 3D from single views.

01:19:34 And these techniques are based on supervised learning and they are based on having a training

01:19:41 time 3D models of objects available.

01:19:46 And this is completely unnatural supervision, right?

01:19:50 That’s not, CAD models are not injected into your brain.

01:19:54 Okay, so what would I like?

01:19:56 What I would like would be a kind of learning as you move around the world notion of 3D.

01:20:06 So we have our succession of visual experiences and from those we, so as part of that I might

01:20:19 see a chair from different viewpoints or a table from different viewpoints and so on.

01:20:24 Now as part that enables me to build some internal representation.

01:20:31 And then next time I just see a single photograph and it may not even be of that chair, it’s

01:20:37 of some other chair.

01:20:38 And I have a guess of what it’s 3D shape is like.

01:20:42 So you’re almost learning the CAD model, kind of.

01:20:45 Yeah, implicitly.

01:20:46 Implicitly.

01:20:47 I mean, the CAD model need not be in the same form as used by computer graphics programs.

01:20:52 Hidden in the representation.

01:20:53 It’s hidden in the representation, the ability to predict new views.

01:20:58 And what I would see if I went to such and such position.

01:21:04 By the way, on a small tangent on that, are you okay or comfortable with neural networks

01:21:14 that do achieve visual understanding that do, for example, achieve this kind of 3D understanding

01:21:19 and you don’t know how they, you’re not able to interest, you’re not able to visualize

01:21:27 or understand or interact with the representation.

01:21:31 So the fact that they’re not or may not be explainable.

01:21:34 Yeah, I think that’s fine.

01:21:38 To me that is, so let me put some caveats on that.

01:21:44 So it depends on the setting.

01:21:46 So first of all, I think the humans are not explainable.

01:21:55 So that’s a really good point.

01:21:57 So we, one human to another human is not fully explainable.

01:22:02 I think there are settings where explainability matters and these might be, for example, questions

01:22:10 on medical diagnosis.

01:22:13 So I’m in a setting where maybe the doctor, maybe a computer program has made a certain

01:22:19 diagnosis and then depending on the diagnosis, perhaps I should have treatment A or treatment

01:22:25 B, right?

01:22:28 So now is the computer program’s diagnosis based on data, which was data collected off

01:22:38 for American males who are in their 30s and 40s and maybe not so relevant to me.

01:22:45 Maybe it is relevant, you know, et cetera, et cetera.

01:22:48 I mean, in medical diagnosis, we have major issues to do with the reference class.

01:22:53 So we may have acquired statistics from one group of people and applying it to a different

01:22:58 group of people who may not share all the same characteristics.

01:23:02 The data might have, there might be error bars in the prediction.

01:23:07 So that prediction should really be taken with a huge grain of salt.

01:23:14 But this has an impact on what treatments should be picked, right?

01:23:20 So there are settings where I want to know more than just, this is the answer.

01:23:26 But what I acknowledge is that, so in that sense, explainability and interpretability

01:23:33 may matter.

01:23:34 It’s about giving error bounds and a better sense of the quality of the decision.

01:23:40 Where I’m willing to sacrifice interpretability is that I believe that there can be systems

01:23:50 which can be highly performant, but which are internally black boxes.

01:23:56 And that seems to be where it’s headed.

01:23:57 Some of the best performing systems are essentially black boxes, fundamentally by their construction.

01:24:04 You and I are black boxes to each other.

01:24:06 Yeah.

01:24:07 So the nice thing about the black boxes we are is, so we ourselves are black boxes, but

01:24:13 we’re also, those of us who are charming are able to convince others, like explain the

01:24:20 black, what’s going on inside the black box with narratives of stories.

01:24:25 So in some sense, neural networks don’t have to actually explain what’s going on inside.

01:24:31 They just have to come up with stories, real or fake that convince you that they know what’s

01:24:37 going on.

01:24:38 And I’m sure we can do that.

01:24:39 We can create those stories, neural networks can create those stories.

01:24:45 Yeah.

01:24:46 And the transformer will be involved.

01:24:50 Do you think we will ever build a system of human level or superhuman level intelligence?

01:24:56 We’ve kind of defined what it takes to try to approach that, but do you think that’s

01:25:01 within our reach?

01:25:02 The thing that we thought we could do, what Turing thought actually we could do by year

01:25:07 2000, right?

01:25:09 What do you think we’ll ever be able to do?

01:25:11 So I think there are two answers here.

01:25:12 One question, one answer is in principle, can we do this at some time?

01:25:18 And my answer is yes.

01:25:20 The second answer is a pragmatic one.

01:25:23 Do you think we will be able to do it in the next 20 years or whatever?

01:25:27 And to that my answer is no.

01:25:30 So of course that’s a wild guess.

01:25:34 I think that, you know, Donald Rumsfeld is not a favorite person of mine, but one of

01:25:40 his lines was very good, which is about known unknowns and unknown unknowns.

01:25:48 So in the business we are in, there are known unknowns and we have unknown unknowns.

01:25:55 So I think with respect to a lot of what’s the case in vision and robotics, I feel like

01:26:04 we have known unknowns.

01:26:06 So I have a sense of where we need to go and what the problems that need to be solved are.

01:26:13 I feel with respect to natural language, understanding and high level cognition, it’s not just known

01:26:21 unknowns, but also unknown unknowns.

01:26:24 So it is very difficult to put any kind of a timeframe to that.

01:26:30 Do you think some of the unknown unknowns might be positive in that they’ll surprise

01:26:36 us and make the job much easier?

01:26:38 So fundamental breakthroughs?

01:26:40 I think that is possible because certainly I have been very positively surprised by how

01:26:45 effective these deep learning systems have been because I certainly would not have believed

01:26:53 that in 2010.

01:26:57 I think what we knew from the mathematical theory was that convex optimization works.

01:27:06 When there’s a single global optima, then these gradient descent techniques would work.

01:27:11 Now these are nonlinear systems with non convex systems.

01:27:16 Huge number of variables, so over parametrized.

01:27:18 And the people who used to play with them a lot, the ones who are totally immersed in

01:27:26 the lore and the black magic, they knew that they worked well, even though they were…

01:27:33 Really?

01:27:34 I thought like everybody…

01:27:35 No, the claim that I hear from my friends like Yann LeCun and so forth is that they

01:27:43 feel that they were comfortable with them.

01:27:45 But the community as a whole was certainly not.

01:27:50 And I think to me that was the surprise that they actually worked robustly for a wide range

01:27:59 of problems from a wide range of initializations and so on.

01:28:04 And so that was certainly more rapid progress than we expected.

01:28:13 But then there are certainly lots of times, in fact, most of the history of AI is when

01:28:19 we have made less progress at a slower rate than we expected.

01:28:24 So we just keep going.

01:28:27 I think what I regard as really unwarranted are these fears of AGI in 10 years and 20

01:28:39 years and that kind of stuff, because that’s based on completely unrealistic models of

01:28:44 how rapidly we will make progress in this field.

01:28:48 So I agree with you, but I’ve also gotten the chance to interact with very smart people

01:28:54 who really worry about existential threats of AI.

01:28:57 And I, as an open minded person, am sort of taking it in.

01:29:04 Do you think if AI systems in some way, the unknown unknowns, not super intelligent AI,

01:29:12 but in ways we don’t quite understand the nature of super intelligence, will have a

01:29:18 detrimental effect on society?

01:29:20 Do you think this is something we should be worried about or we need to first allow the

01:29:25 unknown unknowns to become known unknowns?

01:29:29 I think we need to be worried about AI today.

01:29:32 I think that it is not just a worry we need to have when we get that AGI.

01:29:38 I think that AI is being used in many systems today.

01:29:43 And there might be settings, for example, when it causes biases or decisions which could

01:29:49 be harmful.

01:29:50 I mean, decisions which could be unfair to some people or it could be a self driving

01:29:55 cars which kills a pedestrian.

01:29:57 So AI systems are being deployed today, right?

01:30:02 And they’re being deployed in many different settings, maybe in medical diagnosis, maybe

01:30:05 in a self driving car, maybe in selecting applicants for an interview.

01:30:10 So I would argue that when these systems make mistakes, there are consequences.

01:30:18 And we are in a certain sense responsible for those consequences.

01:30:22 So I would argue that this is a continuous effort.

01:30:27 It is we and this is something that in a way is not so surprising.

01:30:32 It’s about all engineering and scientific progress which great power comes great responsibility.

01:30:40 So as these systems are deployed, we have to worry about them and it’s a continuous

01:30:44 problem.

01:30:45 I don’t think of it as something which will suddenly happen on some day in 2079 for which

01:30:51 I need to design some clever trick.

01:30:54 I’m saying that these problems exist today and we need to be continuously on the lookout

01:31:00 for worrying about safety, biases, risks, right?

01:31:06 I mean, the self driving car kills a pedestrian and they have, right?

01:31:11 I mean, this Uber incident in Arizona, right?

01:31:16 It has happened, right?

01:31:17 This is not about AGI.

01:31:18 In fact, it’s about a very dumb intelligence which is still killing people.

01:31:23 The worry people have with AGI is the scale.

01:31:28 But I think you’re 100% right is like the thing that worries me about AI today and it’s

01:31:34 happening in a huge scale is recommender systems, recommendation systems.

01:31:39 So if you look at Twitter or Facebook or YouTube, they’re controlling the ideas that we have

01:31:47 access to, the news and so on.

01:31:50 And that’s a fundamental machine learning algorithm behind each of these recommendations.

01:31:55 And they, I mean, my life would not be the same without these sources of information.

01:32:00 I’m a totally new human being and the ideas that I know are very much because of the internet,

01:32:07 because of the algorithm that recommend those ideas.

01:32:09 And so as they get smarter and smarter, I mean, that is the AGI is that’s the algorithm

01:32:16 that’s recommending the next YouTube video you should watch has control of millions of

01:32:23 billions of people that that algorithm is already super intelligent and has complete

01:32:30 control of the population, not a complete, but very strong control.

01:32:35 For now we can turn off YouTube, we can just go have a normal life outside of that.

01:32:39 But the more and more that gets into our life, it’s that algorithm we start depending on

01:32:46 it in the different companies that are working on the algorithm.

01:32:49 So I think it’s, you’re right, it’s already there.

01:32:53 And YouTube in particular is using computer vision, doing their hardest to try to understand

01:32:59 the content of videos so they could be able to connect videos with the people who would

01:33:05 benefit from those videos the most.

01:33:08 And so that development could go in a bunch of different directions, some of which might

01:33:12 be harmful.

01:33:14 So yeah, you’re right, the threats of AI are here already and we should be thinking about

01:33:19 them.

01:33:20 On a philosophical notion, if you could, personal perhaps, if you could relive a moment in

01:33:29 your life outside of family because it made you truly happy or it was a profound moment

01:33:36 that impacted the direction of your life, what moment would you go to?

01:33:44 I don’t think of single moments, but I look over the long haul.

01:33:49 I feel that I’ve been very lucky because I feel that, I think that in scientific research,

01:33:58 a lot of it is about being at the right place at the right time.

01:34:03 And you can work on problems at a time when they’re just too premature.

01:34:10 You butt your head against them and nothing happens because the prerequisites for success

01:34:18 are not there.

01:34:19 And then there are times when you are in a field which is all pretty mature and you can

01:34:25 only solve curlicues upon curlicues.

01:34:30 I’ve been lucky to have been in this field which for 34 years, well actually 34 years

01:34:36 as a professor at Berkeley, so longer than that, which when I started in it was just

01:34:44 like some little crazy, absolutely useless field which couldn’t really do anything to

01:34:53 a time when it’s really, really solving a lot of practical problems, has offered a lot

01:35:01 of tools for scientific research because computer vision is impactful for images in biology

01:35:08 or astronomy and so on and so forth.

01:35:12 And we have, so we have made great scientific progress which has had real practical impact

01:35:18 in the world.

01:35:19 And I feel lucky that I got in at a time when the field was very young and at a time when

01:35:28 it is, it’s now mature but not fully mature.

01:35:34 It’s mature but not done.

01:35:35 I mean, it’s really still in a productive phase.

01:35:39 Yeah, I think people 500 years from now would laugh at you calling this field mature.

01:35:45 That is very possible.

01:35:46 Yeah.

01:35:47 So, but you’re also, lest I forget to mention, you’ve also mentored some of the biggest names

01:35:53 of computer vision, computer science and AI today.

01:35:59 So many questions I could ask, but really is what, what is it, how did you do it?

01:36:04 What does it take to be a good mentor?

01:36:06 What does it take to be a good guide?

01:36:09 Yeah, I think what I feel, I’ve been lucky to have had very, very smart and hardworking

01:36:17 and creative students.

01:36:18 I think some part of the credit just belongs to being at Berkeley.

01:36:25 Those of us who are at top universities are blessed because we have very, very smart and

01:36:32 capable students coming on, knocking on our door.

01:36:37 So I have to be humble enough to acknowledge that.

01:36:40 But what have I added?

01:36:41 I think I have added something.

01:36:44 What I have added is, I think what I’ve always tried to teach them is a sense of picking

01:36:52 the right problems.

01:36:54 So I think that in science, in the short run, success is always based on technical competence.

01:37:04 You’re, you know, you’re quick with math or you are whatever.

01:37:09 I mean, there’s certain technical capabilities which make for short range progress.

01:37:15 Long range progress is really determined by asking the right questions and focusing on

01:37:21 the right problems.

01:37:23 And I feel that what I’ve been able to bring to the table in terms of advising these students

01:37:31 is some sense of taste of what are good problems, what are problems that are worth attacking

01:37:38 now as opposed to waiting 10 years.

01:37:41 What’s a good problem?

01:37:42 If you could summarize, is that possible to even summarize, like what’s your sense of

01:37:47 a good problem?

01:37:48 I think, I think I have a sense of what is a good problem, which is there is a British

01:37:55 scientist, in fact, he won a Nobel Prize, Peter Medover, who has a book on this.

01:38:02 And basically he calls it, research is the art of the soluble.

01:38:08 So we need to sort of find problems which are not yet solved, but which are approachable.

01:38:18 And he sort of refers to this sense that there is this problem which isn’t quite solved yet,

01:38:25 but it has a soft underbelly.

01:38:26 There is some place where you can, you know, spear the beast.

01:38:32 And having that intuition that this problem is ripe is a good thing because otherwise

01:38:39 you can just beat your head and not make progress.

01:38:42 So I think that is important.

01:38:45 So if I have that and if I can convey that to students, it’s not just that they do great

01:38:52 research while they’re working with me, but that they continue to do great research.

01:38:56 So in a sense, I’m proud of my students and their achievements and their great research

01:39:01 even 20 years after they’ve ceased being my student.

01:39:05 So it’s in part developing, helping them develop that sense that a problem is not yet solved,

01:39:11 but it’s solvable.

01:39:12 Correct.

01:39:13 The other thing which I have, which I think I bring to the table, is a certain intellectual

01:39:21 breadth.

01:39:22 I’ve spent a fair amount of time studying psychology, neuroscience, relevant areas of

01:39:29 applied math and so forth.

01:39:31 So I can probably help them see some connections to disparate things, which they might not

01:39:40 have otherwise.

01:39:42 So the smart students coming into Berkeley can be very deep, they can think very deeply,

01:39:50 meaning very hard down one particular path, but where I could help them is the shallow

01:39:58 breadth, but they would have the narrow depth, but that’s of some value.

01:40:08 Well, it was beautifully refreshing just to hear you naturally jump to psychology back

01:40:14 to computer science in this conversation back and forth.

01:40:18 That’s actually a rare quality and I think it’s certainly for students empowering to

01:40:23 think about problems in a new way.

01:40:25 So for that and for many other reasons, I really enjoyed this conversation.

01:40:29 Thank you so much.

01:40:30 It was a huge honor.

01:40:31 Thanks for talking to me.

01:40:32 It’s been my pleasure.

01:40:34 Thanks for listening to this conversation with Jitendra Malik and thank you to our sponsors,

01:40:39 BetterHelp and ExpressVPN.

01:40:43 Please consider supporting this podcast by going to betterhelp.com slash Lex and signing

01:40:49 up at expressvpn.com slash LexPod.

01:40:52 Click the links, buy the stuff.

01:40:55 That’s how they know I sent you and it really is the best way to support this podcast and

01:41:00 the journey I’m on.

01:41:02 If you enjoy this thing, subscribe on YouTube, review it with five stars on Apple podcast,

01:41:07 support it on Patreon or connect with me on Twitter at Lex Friedman.

01:41:12 Don’t ask me how to spell that.

01:41:13 I don’t remember it myself.

01:41:15 And now let me leave you with some words from Prince Mishkin in The Idiot by Dostoevsky.

01:41:22 Beauty will save the world.

01:41:24 Thank you for listening and hope to see you next time.