Ishan Misra: Self-Supervised Deep Learning in Computer Vision #206

Transcript

00:00:00 The following is a conversation with Eshan Mizra,

00:00:03 research scientist at Facebook AI Research,

00:00:05 who works on self supervised machine learning

00:00:08 in the domain of computer vision,

00:00:10 or in other words, making AI systems understand

00:00:14 the visual world with minimal help from us humans.

00:00:18 Transformers and self attention has been successfully used

00:00:21 by OpenAI’s DPT3 and other language models

00:00:25 to do self supervised learning in the domain of language.

00:00:28 Eshan, together with Yann LeCun and others,

00:00:31 is trying to achieve the same success

00:00:33 in the domain of images and video.

00:00:36 The goal is to leave a robot

00:00:38 watching YouTube videos all night,

00:00:40 and in the morning, come back to a much smarter robot.

00:00:43 I read the blog post, Self Supervised Learning,

00:00:46 The Dark Matter of Intelligence by Eshan and Yann LeCun,

00:00:50 and then listened to Eshan’s appearance

00:00:52 on the excellent Machine Learning Street Talk podcast,

00:00:57 and I knew I had to talk to him.

00:00:59 By the way, if you’re interested in machine learning and AI,

00:01:02 I cannot recommend the ML Street Talk podcast highly enough.

00:01:07 Those guys are great.

00:01:09 Quick mention of our sponsors.

00:01:11 Onnit, The Information, Grammarly, and Athletic Greens.

00:01:15 Check them out in the description to support this podcast.

00:01:18 As a side note, let me say that,

00:01:20 for those of you who may have been listening

00:01:22 for quite a while, this podcast used to be called

00:01:24 Artificial Intelligence Podcast,

00:01:27 because my life passion has always been,

00:01:29 will always be artificial intelligence,

00:01:32 both narrowly and broadly defined.

00:01:35 My goal with this podcast is still

00:01:37 to have many conversations with world class researchers

00:01:40 in AI, math, physics, biology, and all the other sciences,

00:01:45 but I also want to talk to historians, musicians, athletes,

00:01:49 and of course, occasionally comedians.

00:01:51 In fact, I’m trying out doing this podcast

00:01:53 three times a week now to give me more freedom

00:01:56 with guest selection and maybe get a chance

00:01:59 to have a bit more fun.

00:02:00 Speaking of fun, in this conversation,

00:02:03 I challenge the listener to count the number of times

00:02:05 the word banana is mentioned.

00:02:08 Ishan and I use the word banana as the canonical example

00:02:12 at the core of the hard problem of computer vision

00:02:15 and maybe the hard problem of consciousness.

00:02:19 This is the Lex Friedman Podcast,

00:02:22 and here is my conversation with Ishan Mizra.

00:02:27 What is self supervised learning?

00:02:29 And maybe even give the bigger basics

00:02:32 of what is supervised and semi supervised learning,

00:02:35 and maybe why is self supervised learning

00:02:37 a better term than unsupervised learning?

00:02:40 Let’s start with supervised learning.

00:02:41 So typically for machine learning systems,

00:02:43 the way they’re trained is you get a bunch of humans,

00:02:46 the humans point out particular concepts.

00:02:48 So if it’s in the case of images,

00:02:50 you want the humans to come and tell you

00:02:52 what is present in the image,

00:02:54 draw boxes around them, draw masks of like things,

00:02:57 pixels, which are of particular categories or not.

00:03:00 For NLP, again, there are like lots

00:03:01 of these particular tasks, say about sentiment analysis,

00:03:04 about entailment and so on.

00:03:06 So typically for supervised learning,

00:03:08 we get a big corpus of such annotated or labeled data.

00:03:11 And then we feed that to a system

00:03:12 and the system is really trying to mimic.

00:03:14 So it’s taking this input of the data

00:03:16 and then trying to mimic the output.

00:03:18 So it looks at an image and the human has tagged

00:03:20 that this image contains a banana.

00:03:22 And now the system is basically trying to mimic that.

00:03:24 So that’s its learning signal.

00:03:26 And so for supervised learning,

00:03:28 we try to gather lots of such data

00:03:30 and we train these machine learning models

00:03:31 to imitate the input output.

00:03:33 And the hope is basically by doing so,

00:03:35 now on unseen or like new kinds of data,

00:03:38 this model can automatically learn

00:03:40 to predict these concepts.

00:03:41 So this is a standard sort of supervised setting.

00:03:43 For semi supervised setting,

00:03:45 the idea typically is that you have,

00:03:47 of course, all of the supervised data,

00:03:49 but you have lots of other data,

00:03:50 which is unsupervised or which is like not labeled.

00:03:53 Now, the problem basically with supervised learning

00:03:55 and why you actually have all of these alternate

00:03:57 sort of learning paradigms is,

00:03:59 supervised learning just does not scale.

00:04:01 So if you look at for computer vision,

00:04:03 the sort of largest,

00:04:05 one of the most popular data sets is ImageNet, right?

00:04:07 So the entire ImageNet data set has about 22,000 concepts

00:04:11 and about 14 million images.

00:04:13 So these concepts are basically just nouns

00:04:16 and they’re annotated on images.

00:04:18 And this entire data set was a mammoth data collection

00:04:20 effort that actually gave rise

00:04:22 to a lot of powerful learning algorithms

00:04:23 is credited with like sort of the rise

00:04:25 of deep learning as well.

00:04:27 But this data set took about 22 human years

00:04:30 to collect, to annotate.

00:04:31 And it’s not even that many concepts, right?

00:04:33 It’s not even that many images,

00:04:34 14 million is nothing really.

00:04:36 Like you have about, I think 400 million images or so,

00:04:39 or even more than that uploaded to most of the popular

00:04:41 sort of social media websites today.

00:04:44 So now supervised learning just doesn’t scale.

00:04:46 If I want to now annotate more concepts,

00:04:48 if I want to have various types of fine grained concepts,

00:04:51 then it won’t really scale.

00:04:53 So now you come up to these sort of different

00:04:54 learning paradigms, for example, semi supervised learning,

00:04:57 where the idea is you, of course,

00:04:58 you have this annotated corpus of supervised data

00:05:01 and you have lots of these unlabeled images.

00:05:03 And the idea is that the algorithm should basically try

00:05:05 to measure some kind of consistency

00:05:08 or really try to measure some kind of signal

00:05:10 on this sort of unlabeled data

00:05:12 to make itself more confident

00:05:14 about what it’s really trying to predict.

00:05:16 So by access to this, lots of unlabeled data,

00:05:19 the idea is that the algorithm actually learns

00:05:22 to be more confident and actually gets better

00:05:24 at predicting these concepts.

00:05:26 And now we come to the other extreme,

00:05:28 which is like self supervised learning.

00:05:30 The idea basically is that the machine or the algorithm

00:05:33 should really discover concepts or discover things

00:05:35 about the world or learn representations about the world

00:05:38 which are useful without access

00:05:40 to explicit human supervision.

00:05:41 So the word supervision is still

00:05:44 in the term self supervised.

00:05:46 So what is the supervision signal?

00:05:48 And maybe that perhaps is when Yann LeCun

00:05:51 and you argue that unsupervised

00:05:52 is the incorrect terminology here.

00:05:55 So what is the supervision signal

00:05:57 when the humans aren’t part of the picture

00:05:59 or not a big part of the picture?

00:06:02 Right, so self supervised,

00:06:04 the reason that it has the term supervised in itself

00:06:06 is because you’re using the data itself as supervision.

00:06:10 So because the data serves as its own source of supervision,

00:06:13 it’s self supervised in that way.

00:06:15 Now, the reason a lot of people,

00:06:16 I mean, we did it in that blog post with Yann,

00:06:18 but a lot of other people have also argued

00:06:20 for using this term self supervised.

00:06:22 So starting from like 94 from Virginia Desas group,

00:06:25 I think UCSD, and now she’s at UCSD.

00:06:28 Jeetendra Malik has said this a bunch of times as well.

00:06:31 So you have supervised,

00:06:33 and then unsupervised basically means everything

00:06:35 which is not supervised,

00:06:36 but that includes stuff like semi supervised,

00:06:38 that includes other like transductive learning,

00:06:41 lots of other sort of settings.

00:06:43 So that’s the reason like now people are preferring

00:06:46 this term self supervised

00:06:47 because it explicitly says what’s happening.

00:06:49 The data itself is the source of supervision

00:06:51 and any sort of learning algorithm

00:06:53 which tries to extract just sort of data supervision signals

00:06:56 from the data itself is a self supervised algorithm.

00:06:59 But there is within the data,

00:07:02 a set of tricks which unlock the supervision.

00:07:05 So can you give maybe some examples

00:07:07 and there’s innovation ingenuity required

00:07:11 to unlock that supervision.

00:07:12 The data doesn’t just speak to you some ground truth,

00:07:15 you have to do some kind of trick.

00:07:17 So I don’t know what your favorite domain is.

00:07:19 So you specifically specialize in visual learning,

00:07:23 but is there favorite examples,

00:07:24 maybe in language or other domains?

00:07:26 Perhaps the most successful applications

00:07:28 have been in NLP, not language processing.

00:07:31 So the idea basically being that you can train models

00:07:34 that can you have a sentence and you mask out certain words.

00:07:37 And now these models learn to predict the masked out words.

00:07:40 So if you have like the cat jumped over the dog,

00:07:44 so you can basically mask out cat.

00:07:45 And now you’re essentially asking the model

00:07:47 to predict what was missing, what did I mask out?

00:07:50 So the model is going to predict basically a distribution

00:07:53 over all the possible words that it knows.

00:07:55 And probably it has like if it’s a well trained model,

00:07:58 it has a sort of higher probability density

00:08:00 for this word cat.

00:08:02 For vision, I would say the sort of more,

00:08:05 I mean, the easier example,

00:08:07 which is not as widely used these days,

00:08:09 is basically say, for example, video prediction.

00:08:12 So video is again, a sequence of things.

00:08:14 So you can ask the model,

00:08:15 so if you have a video of say 10 seconds,

00:08:17 you can feed in the first nine seconds to a model

00:08:19 and then ask it, hey, what happens basically

00:08:21 in the 10 second, can you predict what’s going to happen?

00:08:24 And the idea basically is because the model

00:08:26 is predicting something about the data itself.

00:08:29 Of course, you didn’t need any human

00:08:31 to tell you what was happening

00:08:32 because the 10 second video was naturally captured.

00:08:34 Because the model is predicting what’s happening there,

00:08:36 it’s going to automatically learn something

00:08:39 about the structure of the world, how objects move,

00:08:41 object permanence, and these kinds of things.

00:08:44 So like, if I have something at the edge of the table,

00:08:45 it will fall down.

00:08:47 Things like these, which you really don’t have to sit

00:08:49 and annotate.

00:08:50 In a supervised learning setting,

00:08:51 I would have to sit and annotate.

00:08:52 This is a cup, now I move this cup, this is still a cup,

00:08:55 and now I move this cup, it’s still a cup,

00:08:56 and then it falls down, and this is a fallen down cup.

00:08:58 So I won’t have to annotate all of these things

00:09:00 in a self supervised setting.

00:09:02 Isn’t that kind of a brilliant little trick

00:09:05 of taking a series of data that is consistent

00:09:08 and removing one element in that series,

00:09:11 and then teaching the algorithm to predict that element?

00:09:17 Isn’t that, first of all, that’s quite brilliant.

00:09:20 It seems to be applicable in anything

00:09:23 that has the constraint of being a sequence

00:09:27 that is consistent with the physical reality.

00:09:30 The question is, are there other tricks like this

00:09:34 that can generate the self supervision signal?

00:09:37 So sequence is possibly the most widely used one in NLP.

00:09:41 For vision, the one that is actually used for images,

00:09:44 which is very popular these days,

00:09:45 is basically taking an image,

00:09:47 and now taking different crops of that image.

00:09:50 So you can basically decide to crop,

00:09:51 say the top left corner,

00:09:53 and you crop, say the bottom right corner,

00:09:55 and asking a network to basically present it with a choice,

00:09:58 saying that, okay, now you have this image,

00:10:01 you have this image, are these the same or not?

00:10:04 And so the idea basically is that because different crop,

00:10:06 like in an image, different parts of the image

00:10:08 are going to be related.

00:10:09 So for example, if you have a chair and a table,

00:10:12 basically these things are going to be close by,

00:10:14 versus if you take, again,

00:10:16 if you have like a zoomed in picture of a chair,

00:10:19 if you’re taking different crops,

00:10:20 it’s going to be different parts of the chair.

00:10:22 So the idea basically is that different crops

00:10:25 of the image are related,

00:10:26 and so the features or the representations

00:10:27 that you get from these different crops

00:10:29 should also be related.

00:10:30 So this is possibly the most like widely used trick

00:10:32 these days for self supervised learning and computer vision.

00:10:35 So again, using the consistency that’s inherent

00:10:39 to physical reality in visual domain,

00:10:42 that’s, you know, parts of an image are consistent,

00:10:45 and then in the language domain,

00:10:48 or anything that has sequences,

00:10:50 like language or something that’s like a time series,

00:10:53 then you can chop up parts in time.

00:10:55 It’s similar to the story of RNNs and CNNs,

00:11:00 of RNNs and ConvNets.

00:11:02 You and Yann LeCun wrote the blog post in March, 2021,

00:11:06 titled, Self Supervised Learning,

00:11:08 The Dark Matter of Intelligence.

00:11:11 Can you summarize this blog post

00:11:12 and maybe explain the main idea or set of ideas?

00:11:15 The blog post was mainly about sort of just telling,

00:11:18 I mean, this is really a accepted fact,

00:11:21 I would say for a lot of people now,

00:11:22 that self supervised learning is something

00:11:24 that is going to play an important role

00:11:27 for machine learning algorithms

00:11:28 that come in the future, and even now.

00:11:30 Let me just comment that we don’t yet

00:11:33 have a good understanding of what dark matter is.

00:11:36 That’s true.

00:11:37 So the idea basically being…

00:11:40 So maybe the metaphor doesn’t exactly transfer,

00:11:41 but maybe it’s actually perfectly transfers,

00:11:44 that we don’t know, we have an inkling

00:11:47 that it’ll be a big part

00:11:49 of whatever solving intelligence looks like.

00:11:51 Right, so I think self supervised learning,

00:11:52 the way it’s done right now is,

00:11:54 I would say like the first step towards

00:11:56 what it probably should end up like learning

00:11:58 or what it should enable us to do.

00:12:00 So the idea for that particular piece was,

00:12:03 self supervised learning is going to be a very powerful way

00:12:06 to learn common sense about the world,

00:12:08 or like stuff that is really hard to label.

00:12:10 For example, like is this piece

00:12:13 over here heavier than the cup?

00:12:15 Now, for all these kinds of things,

00:12:17 you’ll have to sit and label these things.

00:12:18 So supervised learning is clearly not going to scale.

00:12:21 So what is the thing that’s actually going to scale?

00:12:23 It’s probably going to be an agent

00:12:25 that can either actually interact with it to lift it up,

00:12:27 or observe me doing it.

00:12:29 So if I’m basically lifting these things up,

00:12:31 it can probably reason about,

00:12:32 hey, this is taking him more time to lift up,

00:12:34 or the velocity is different,

00:12:36 whereas the velocity for this is different,

00:12:37 probably this one is heavier.

00:12:39 So essentially, by observations of the data,

00:12:42 you should be able to infer a lot of things about the world

00:12:44 without someone explicitly telling you,

00:12:46 this is heavy, this is not,

00:12:48 this is something that can pour,

00:12:50 this is something that cannot pour,

00:12:51 this is somewhere that you can sit,

00:12:52 this is not somewhere that you can sit.

00:12:53 But you just mentioned ability to interact with the world.

00:12:57 There’s so many questions that are yet,

00:13:01 that are still open, which is,

00:13:02 how do you select the set of data

00:13:04 over which the self supervised learning process works?

00:13:08 How much interactivity like in the active learning

00:13:11 or the machine teaching context is there?

00:13:14 What are the reward signals?

00:13:16 Like how much actual interaction there is

00:13:18 with the physical world?

00:13:20 That kind of thing.

00:13:21 So that could be a huge question.

00:13:24 And then on top of that,

00:13:26 which I have a million questions about,

00:13:28 which we don’t know the answers to,

00:13:30 but it’s worth talking about is,

00:13:32 how much reasoning is involved?

00:13:35 How much accumulation of knowledge

00:13:38 versus something that’s more akin to learning

00:13:40 or whether that’s the same thing.

00:13:43 But so we’re like, it is truly dark matter.

00:13:46 We don’t know how exactly to do it.

00:13:49 But we are, I mean, a lot of us are actually convinced

00:13:52 that it’s going to be a sort of major thing

00:13:54 in machine learning.

00:13:55 So let me reframe it then,

00:13:56 that human supervision cannot be at large scale

00:14:01 the source of the solution to intelligence.

00:14:04 So the machines have to discover the supervision

00:14:08 in the natural signal of the world.

00:14:10 I mean, the other thing is also

00:14:11 that humans are not particularly good labelers.

00:14:14 They’re not very consistent.

00:14:16 For example, like what’s the difference

00:14:17 between a dining table and a table?

00:14:19 Is it just the fact that one,

00:14:21 like if you just look at a particular table,

00:14:23 what makes us say one is dining table

00:14:24 and the other is not?

00:14:26 Humans are not particularly consistent.

00:14:28 They’re not like very good sources of supervision

00:14:30 for a lot of these kinds of edge cases.

00:14:32 So it may be also the fact that if we want an algorithm

00:14:37 or want a machine to solve a particular task for us,

00:14:39 we can maybe just specify the end goal

00:14:42 and like the stuff in between,

00:14:44 we really probably should not be specifying

00:14:46 because we’re not maybe going to confuse it a lot actually.

00:14:49 Well, humans can’t even answer the meaning of life.

00:14:51 So I’m not sure if we’re good supervisors

00:14:53 of the end goal either.

00:14:55 So let me ask you about categories.

00:14:56 Humans are not very good at telling the difference

00:14:59 between what is and isn’t a table, like you mentioned.

00:15:02 Do you think it’s possible,

00:15:04 let me ask you like pretend you’re Plato.

00:15:10 Is it possible to create a pretty good taxonomy

00:15:14 of objects in the world?

00:15:16 It seems like a lot of approaches in machine learning

00:15:19 kind of assume a hopeful vision

00:15:21 that it’s possible to construct a perfect taxonomy

00:15:24 or it exists perhaps out of our reach,

00:15:26 but we can always get closer and closer to it.

00:15:28 Or is that a hopeless pursuit?

00:15:31 I think it’s hopeless in some way.

00:15:33 So the thing is for any particular categorization

00:15:36 that you create,

00:15:36 if you have a discrete sort of categorization,

00:15:38 I can always take the nearest two concepts

00:15:40 or I can take a third concept and I can blend it in

00:15:42 and I can create a new category.

00:15:44 So if you were to enumerate N categories,

00:15:46 I will always find an N plus one category for you.

00:15:48 That’s not going to be in the N categories.

00:15:50 And I can actually create not just N plus one,

00:15:52 I can very easily create far more than N categories.

00:15:55 The thing is a lot of things we talk about

00:15:57 are actually compositional.

00:15:58 So it’s really hard for us to come and sit

00:16:01 and enumerate all of these out.

00:16:03 And they compose in various weird ways, right?

00:16:05 Like you have like a croissant and a donut come together

00:16:08 to form a cronut.

00:16:09 So if you were to like enumerate all the foods up until,

00:16:12 I don’t know, whenever the cronut was about 10 years ago

00:16:15 or 15 years ago,

00:16:16 then this entire thing called cronut would not exist.

00:16:19 Yeah, I remember there was the most awesome video

00:16:21 of a cat wearing a monkey costume.

00:16:23 Yeah, yes.

00:16:26 People should look it up, it’s great.

00:16:28 So is that a monkey or is that a cat?

00:16:31 It’s a very difficult philosophical question.

00:16:33 So there is a concept of similarity between objects.

00:16:37 So you think that can take us very far?

00:16:39 Just kind of getting a good function,

00:16:43 a good way to tell which parts of things are similar

00:16:47 and which parts of things are very different.

00:16:50 I think so, yeah.

00:16:51 So you don’t necessarily need to name everything

00:16:54 or assign a name to everything to be able to use it, right?

00:16:57 So there are like lots of…

00:16:59 Shakespeare said that, what’s in a name?

00:17:01 What’s in a name, yeah, okay.

00:17:03 And I mean, lots of like, for example, animals, right?

00:17:05 They don’t have necessarily a well formed

00:17:08 like syntactic language,

00:17:09 but they’re able to go about their day perfectly.

00:17:11 The same thing happens for us.

00:17:12 So, I mean, we probably look at things and we figure out,

00:17:17 oh, this is similar to something else that I’ve seen before.

00:17:19 And then I can probably learn how to use it.

00:17:22 So I haven’t seen all the possible doorknobs in the world.

00:17:26 But if you show me,

00:17:27 like I was able to get into this particular place

00:17:29 fairly easily, I’ve never seen that particular doorknob.

00:17:32 So I of course related to all the doorknobs that I’ve seen

00:17:34 and I know exactly how it’s going to open.

00:17:36 I have a pretty good idea of how it’s going to open.

00:17:39 And I think this kind of translation between experiences

00:17:41 only happens because of similarity.

00:17:43 Because I’m able to relate it to a doorknob.

00:17:45 If I related it to a hairdryer,

00:17:46 I would probably be stuck still outside, not able to get in.

00:17:50 Again, a bit of a philosophical question,

00:17:52 but can similarity take us all the way

00:17:55 to understanding a thing?

00:17:58 Can having a good function that compares objects

00:18:01 get us to understand something profound

00:18:04 about singular objects?

00:18:07 I think I’ll ask you a question back.

00:18:08 What does it mean to understand objects?

00:18:11 Well, let me tell you what that’s similar to.

00:18:13 No, so there’s an idea of sort of reasoning

00:18:17 by analogy kind of thing.

00:18:19 I think understanding is the process of placing that thing

00:18:24 in some kind of network of knowledge that you have.

00:18:28 That it perhaps is fundamentally related to other concepts.

00:18:33 So it’s not like understanding is fundamentally related

00:18:36 by composition of other concepts

00:18:39 and maybe in relation to other concepts.

00:18:43 And maybe deeper and deeper understanding

00:18:45 is maybe just adding more edges to that graph somehow.

00:18:51 So maybe it is a composition of similarities.

00:18:55 I mean, ultimately, I suppose it is a kind of embedding

00:18:59 in that wisdom space.

00:19:02 Yeah, okay, wisdom space is good.

00:19:06 I think, I do think, right?

00:19:08 So similarity does get you very, very far.

00:19:10 Is it the answer to everything?

00:19:12 I mean, I don’t even know what everything is,

00:19:14 but it’s going to take us really far.

00:19:16 And I think the thing is things are similar

00:19:19 in very different contexts, right?

00:19:21 So an elephant is similar to, I don’t know,

00:19:24 another sort of wild animal.

00:19:25 Let’s just pick, I don’t know, lion in a different way

00:19:28 because they’re both four legged creatures.

00:19:30 They’re also land animals.

00:19:32 But of course they’re very different

00:19:33 in a lot of different ways.

00:19:33 So elephants are like herbivores, lions are not.

00:19:37 So similarity and particularly dissimilarity

00:19:40 also actually helps us understand a lot about things.

00:19:43 And so that’s actually why I think

00:19:45 discrete categorization is very hard.

00:19:47 Just like forming this particular category of elephant

00:19:50 and a particular category of lion,

00:19:51 maybe it’s good for just like taxonomy,

00:19:54 biological taxonomies.

00:19:55 But when it comes to other things which are not as maybe,

00:19:59 for example, like grilled cheese, right?

00:20:01 I have a grilled cheese,

00:20:02 I dip it in tomato and I keep it outside.

00:20:03 Now, is that still a grilled cheese

00:20:05 or is that something else?

00:20:06 Right, so categorization is still very useful

00:20:09 for solving problems.

00:20:11 But is your intuition then sort of the self supervised

00:20:15 should be the, to borrow Jan Lekun’s terminology,

00:20:20 should be the cake and then categorization,

00:20:23 the classification, maybe the supervised like layer

00:20:27 should be just like the thing on top,

00:20:29 the cherry or the icing or whatever.

00:20:31 So if you make it the cake,

00:20:32 it gets in the way of learning.

00:20:35 If you make it the cake,

00:20:36 then you won’t be able to sit and annotate everything.

00:20:39 That’s as simple as it is.

00:20:40 Like that’s my very practical view on it.

00:20:43 It’s just, I mean, in my PhD,

00:20:44 I sat down and annotated like a bunch of cards

00:20:47 for one of my projects.

00:20:48 And very quickly, I was just like, it was in a video

00:20:50 and I was basically drawing boxes around all these cards.

00:20:53 And I think I spent about a week doing all of that

00:20:55 and I barely got anything done.

00:20:57 And basically this was, I think my first year of my PhD

00:21:00 or like a second year of my master’s.

00:21:02 And then by the end of it, I’m like, okay,

00:21:04 this is just hopeless.

00:21:05 I can keep doing it.

00:21:05 And when I’d done that, someone came up to me

00:21:08 and they basically told me, oh, this is a pickup truck.

00:21:10 This is not a card.

00:21:12 And that’s when like, aha, this actually makes sense

00:21:14 because a pickup truck is not really like,

00:21:16 what was I annotating?

00:21:17 Was I annotating anything that is mobile

00:21:19 or was I annotating particular sedans

00:21:21 or was I annotating SUVs?

00:21:22 What was I doing?

00:21:23 By the way, the annotation was bounding boxes?

00:21:25 Bounding boxes, yeah.

00:21:26 There’s so many deep, profound questions here

00:21:30 that you’re almost cheating your way out of

00:21:32 by doing self supervised learning, by the way,

00:21:34 which is like, what makes for an object?

00:21:37 As opposed to solve intelligence,

00:21:39 maybe you don’t ever need to answer that question.

00:21:42 I mean, this is the question

00:21:43 that anyone that’s ever done annotation

00:21:45 because it’s so painful gets to ask,

00:21:48 like, why am I drawing very careful line around this object?

00:21:55 Like, what is the value?

00:21:57 I remember when I first saw semantic segmentation

00:22:00 where you have like instant segmentation

00:22:03 where you have a very exact line

00:22:06 around the object in a 2D plane

00:22:09 of a fundamentally 3D object projected on a 2D plane.

00:22:13 So you’re drawing a line around a car

00:22:15 that might be occluded.

00:22:16 There might be another thing in front of it,

00:22:18 but you’re still drawing the line

00:22:20 of the part of the car that you see.

00:22:23 How is that the car?

00:22:25 Why is that the car?

00:22:27 Like, I had like an existential crisis every time.

00:22:31 Like, how’s that going to help us understand

00:22:33 a solved computer vision?

00:22:35 I’m not sure I have a good answer to what’s better.

00:22:38 And I’m not sure I share the confidence that you have

00:22:41 that self supervised learning can take us far.

00:22:46 I think I’m more and more convinced

00:22:48 that it’s a very important component,

00:22:50 but I still feel like we need to understand

00:22:52 what makes like this dream of maybe what it’s called

00:23:00 like symbolic AI of arriving,

00:23:03 like once you have this common sense base,

00:23:05 be able to play with these concepts and build graphs

00:23:10 or hierarchies of concepts on top

00:23:13 in order to then like form a deep sense

00:23:18 of this three dimensional world or four dimensional world

00:23:22 and be able to reason and then project that onto 2D plane

00:23:25 in order to interpret a 2D image.

00:23:28 Can I ask you just an out there question?

00:23:30 I remember, I think Andre Karpathy had a blog post

00:23:35 about computer vision, like being really hard.

00:23:39 I forgot what the title was, but it was many, many years ago.

00:23:42 And he had, I think President Obama stepping on a scale

00:23:44 and there was humor and there was a bunch of people laughing

00:23:47 and whatever.

00:23:48 And there’s a lot of interesting things about that image

00:23:52 and I think Andre highlighted a bunch of things

00:23:55 about the image that us humans are able

00:23:56 to immediately understand.

00:23:59 Like the idea, I think of gravity

00:24:00 and that you have the concept of a weight.

00:24:04 You immediately project because of our knowledge of pose

00:24:08 and how human bodies are constructed,

00:24:10 you understand how the forces are being applied

00:24:13 with the human body.

00:24:14 The really interesting other thing

00:24:16 that you’re able to understand,

00:24:17 there’s multiple people looking at each other in the image.

00:24:20 You’re able to have a mental model

00:24:22 of what the people are thinking about.

00:24:23 You’re able to infer like,

00:24:25 oh, this person is probably thinks,

00:24:27 like is laughing at how humorous the situation is.

00:24:31 And this person is confused about what the situation is

00:24:34 because they’re looking this way.

00:24:35 We’re able to infer all of that.

00:24:37 So that’s human vision.

00:24:41 How difficult is computer vision?

00:24:45 Like in order to achieve that level of understanding

00:24:48 and maybe how big of a part

00:24:51 does self supervised learning play in that, do you think?

00:24:54 And do you still, you know, back,

00:24:56 that was like over a decade ago,

00:24:58 I think Andre and I think a lot of people agreed

00:25:00 is computer vision is really hard.

00:25:03 Do you still think computer vision is really hard?

00:25:06 I think it is, yes.

00:25:07 And getting to that kind of understanding,

00:25:10 I mean, it’s really out there.

00:25:12 So if you ask me to solve just that particular problem,

00:25:15 I can do it the supervised learning route.

00:25:17 I can always construct a data set and basically predict,

00:25:19 oh, is there humor in this or not?

00:25:21 And of course I can do it.

00:25:22 Actually, that’s a good question.

00:25:23 Do you think you can, okay, okay.

00:25:25 Do you think you can do human supervised annotation of humor?

00:25:29 To some extent, yes.

00:25:29 I’m sure it will work.

00:25:30 I mean, it won’t be as bad as like randomly guessing.

00:25:34 I’m sure it can still predict whether it’s humorous or not

00:25:36 in some way.

00:25:37 Yeah, maybe like Reddit upvotes is the signal.

00:25:40 I don’t know.

00:25:41 I mean, it won’t do a great job, but it’ll do something.

00:25:43 It may actually be like, it may find certain things

00:25:46 which are not humorous, humorous as well,

00:25:47 which is going to be bad for us.

00:25:49 But I mean, it’ll do, it won’t be random.

00:25:52 Yeah, kind of like my sense of humor.

00:25:54 Okay, so fine.

00:25:55 So you can, that particular problem, yes.

00:25:57 But the general problem you’re saying is hard.

00:25:59 The general problem is hard.

00:26:00 And I mean, self supervised learning

00:26:02 is not the answer to everything.

00:26:03 Of course it’s not.

00:26:04 I think if you have machines that are going to communicate

00:26:07 with humans at the end of it,

00:26:08 you want to understand what the algorithm is doing, right?

00:26:10 You want it to be able to produce an output

00:26:13 that you can decipher, that you can understand,

00:26:15 or it’s actually useful for something else,

00:26:17 which again is a human.

00:26:19 So at some point in this sort of entire loop,

00:26:22 a human steps in.

00:26:23 And now this human needs to understand what’s going on.

00:26:26 And at that point, this entire notion of language

00:26:28 or semantics really comes in.

00:26:30 If the machine just spits out something

00:26:32 and if we can’t understand it,

00:26:34 then it’s not really that useful for us.

00:26:36 So self supervised learning is probably going to be useful

00:26:38 for a lot of the things before that part,

00:26:40 before the machine really needs to communicate

00:26:42 a particular kind of output with a human.

00:26:46 Because, I mean, otherwise,

00:26:47 how is it going to do that without language?

00:26:49 Or some kind of communication.

00:26:51 But you’re saying that it’s possible to build

00:26:53 a big base of understanding or whatever,

00:26:55 of what’s a better? Concepts.

00:26:58 Of concepts. Concepts, yeah.

00:26:59 Like common sense concepts. Right.

00:27:02 Supervised learning in the context of computer vision

00:27:06 is something you’ve focused on,

00:27:07 but that’s a really hard domain.

00:27:09 And it’s kind of the cutting edge

00:27:10 of what we’re, as a community, working on today.

00:27:13 Can we take a little bit of a step back

00:27:14 and look at language?

00:27:16 Can you summarize the history of success

00:27:19 of self supervised learning in natural language processing,

00:27:22 language modeling?

00:27:23 What are transformers?

00:27:25 What is the masking, the sentence completion

00:27:28 that you mentioned before?

00:27:31 How does it lead us to understand anything?

00:27:33 Semantic meaning of words,

00:27:34 syntactic role of words and sentences?

00:27:37 So I’m, of course, not the expert on NLP.

00:27:40 I kind of follow it a little bit from the sides.

00:27:43 So the main sort of reason

00:27:45 why all of this masking stuff works is,

00:27:47 I think it’s called the distributional hypothesis in NLP.

00:27:50 The idea basically being that words

00:27:52 that occur in the same context

00:27:54 should have similar meaning.

00:27:55 So if you have the blank jumped over the blank,

00:27:59 it basically, whatever is like in the first blank

00:28:01 is basically an object that can actually jump,

00:28:04 is going to be something that can jump.

00:28:05 So a cat or a dog, or I don’t know, sheep, something,

00:28:08 all of these things can basically be in that particular context.

00:28:11 And now, so essentially the idea is that

00:28:13 if you have words that are in the same context

00:28:16 and you predict them,

00:28:17 you’re going to learn lots of useful things

00:28:20 about how words are related,

00:28:21 because you’re predicting by looking at their context

00:28:23 where the word is going to be.

00:28:24 So in this particular case, the blank jumped over the fence.

00:28:28 So now if it’s a sheep, the sheep jumped over the fence,

00:28:30 the dog jumped over the fence.

00:28:32 So essentially the algorithm or the representation

00:28:35 basically puts together these two concepts together.

00:28:37 So it says, okay, dogs are going to be kind of related to sheep

00:28:40 because both of them occur in the same context.

00:28:42 Of course, now you can decide

00:28:44 depending on your particular application downstream,

00:28:46 you can say that dogs are absolutely not related to sheep

00:28:49 because well, I don’t, I really care about dog food,

00:28:52 for example, I’m a dog food person

00:28:54 and I really want to give this dog food

00:28:55 to this particular animal.

00:28:57 So depending on what your downstream application is,

00:29:00 of course, this notion of similarity or this notion

00:29:03 or this common sense that you’ve learned

00:29:04 may not be applicable.

00:29:05 But the point is basically that this,

00:29:08 just predicting what the blanks are

00:29:09 is going to take you really, really far.

00:29:11 So there’s a nice feature of language

00:29:14 that the number of words in a particular language

00:29:18 is very large, but it’s finite

00:29:20 and it’s actually not that large

00:29:22 in the grand scheme of things.

00:29:24 I still got it because we take it for granted.

00:29:26 So first of all, when you say masking,

00:29:28 you’re talking about this very process of the blank,

00:29:31 of removing words from a sentence

00:29:33 and then having the knowledge of what word went there

00:29:36 in the initial data set,

00:29:38 that’s the ground truth that you’re training on

00:29:41 and then you’re asking the neural network

00:29:43 to predict what goes there.

00:29:46 That’s like a little trick.

00:29:49 It’s a really powerful trick.

00:29:50 The question is how far that takes us.

00:29:53 And the other question is, is there other tricks?

00:29:56 Because to me, it’s very possible

00:29:58 there’s other very fascinating tricks.

00:30:00 I’ll give you an example in autonomous driving,

00:30:05 there’s a bunch of tricks

00:30:06 that give you the self supervised signal back.

00:30:10 For example, very similar to sentences, but not really,

00:30:16 which is you have signals from humans driving the car

00:30:20 because a lot of us drive cars to places.

00:30:23 And so you can ask the neural network to predict

00:30:27 what’s going to happen the next two seconds

00:30:30 for a safe navigation through the environment.

00:30:33 And the signal comes from the fact

00:30:36 that you also have knowledge of what happened

00:30:38 in the next two seconds, because you have video of the data.

00:30:42 The question in autonomous driving, as it is in language,

00:30:46 can we learn how to drive autonomously

00:30:50 based on that kind of self supervision?

00:30:53 Probably the answer is no.

00:30:55 The question is how good can we get?

00:30:57 And the same with language, how good can we get?

00:31:00 And are there other tricks?

00:31:02 Like we get sometimes super excited by this trick

00:31:04 that works really well.

00:31:05 But I wonder, it’s almost like mining for gold.

00:31:09 I wonder how many signals there are in the data

00:31:12 that could be leveraged that are like there.

00:31:17 I just wanted to kind of linger on that

00:31:18 because sometimes it’s easy to think

00:31:20 that maybe this masking process is self supervised learning.

00:31:24 No, it’s only one method.

00:31:27 So there could be many, many other methods,

00:31:29 many tricky methods, maybe interesting ways

00:31:33 to leverage human computation in very interesting ways

00:31:36 that might actually border on semi supervised learning,

00:31:39 something like that.

00:31:40 Obviously the internet is generated by humans

00:31:43 at the end of the day.

00:31:44 So all that to say is what’s your sense

00:31:48 in this particular context of language,

00:31:50 how far can that masking process take us?

00:31:54 So it has stood the test of time, right?

00:31:56 I mean, so Word2vec, the initial sort of NLP technique

00:31:59 that was using this to now, for example,

00:32:02 like all the BERT and all these big models that we get,

00:32:05 BERT and Roberta, for example,

00:32:07 all of them are still sort of based

00:32:08 on the same principle of masking.

00:32:10 It’s taken us really far.

00:32:12 I mean, you can actually do things like,

00:32:14 oh, these two sentences are similar or not,

00:32:16 whether this particular sentence follows this other sentence

00:32:18 in terms of logic, so entailment,

00:32:20 you can do a lot of these things

00:32:21 with just this masking trick.

00:32:23 So I’m not sure if I can predict how far it can take us,

00:32:28 because when it first came out, when Word2vec was out,

00:32:31 I don’t think a lot of us would have imagined

00:32:33 that this would actually help us do some kind

00:32:35 of entailment problems and really that well.

00:32:38 And so just the fact that by just scaling up

00:32:40 the amount of data that we’re training on

00:32:42 and using better and more powerful neural network

00:32:45 architectures has taken us from that to this,

00:32:47 is just showing you how maybe poor predictors we are,

00:32:52 as humans, how poor we are at predicting

00:32:54 how successful a particular technique is going to be.

00:32:57 So I think I can say something now,

00:32:58 but like 10 years from now,

00:33:00 I look completely stupid basically predicting this.

00:33:02 In the language domain, is there something in your work

00:33:07 that you find useful and insightful

00:33:09 and transferable to computer vision,

00:33:12 but also just, I don’t know, beautiful and profound

00:33:15 that I think carries through to the vision domain?

00:33:18 I mean, the idea of masking has been very powerful.

00:33:21 It has been used in vision as well for predicting,

00:33:23 like you say, the next sort of if you have

00:33:25 and sort of frames and you predict

00:33:28 what’s going to happen in the next frame.

00:33:29 So that’s been very powerful.

00:33:30 In terms of modeling, like in just terms

00:33:32 in terms of architecture, I think you would have asked

00:33:34 about transformers a while back.

00:33:36 That has really become like,

00:33:38 it has become super exciting for computer vision now.

00:33:40 Like in the past, I would say year and a half,

00:33:42 it’s become really powerful.

00:33:44 What’s a transformer?

00:33:45 Right.

00:33:46 I mean, the core part of a transformer

00:33:47 is something called the self attention model.

00:33:49 So it came out of Google

00:33:50 and the idea basically is that if you have N elements,

00:33:53 what you’re creating is a way for all of these N elements

00:33:56 to talk to each other.

00:33:57 So the idea basically is that you are paying attention.

00:34:01 Each element is paying attention

00:34:03 to each of the other element.

00:34:04 And basically by doing this,

00:34:06 it’s really trying to figure out,

00:34:08 you’re basically getting a much better view of the data.

00:34:11 So for example, if you have a sentence of like four words,

00:34:14 the point is if you get a representation

00:34:16 or a feature for this entire sentence,

00:34:18 it’s constructed in a way such that each word

00:34:21 has paid attention to everything else.

00:34:23 Now, the reason it’s like different from say,

00:34:26 what you would do in a ConvNet

00:34:28 is basically that in the ConvNet,

00:34:29 you would only pay attention to a local window.

00:34:31 So each word would only pay attention

00:34:33 to its next neighbor or like one neighbor after that.

00:34:36 And the same thing goes for images.

00:34:37 In images, you would basically pay attention to pixels

00:34:40 in a three cross three or a seven cross seven neighborhood.

00:34:42 And that’s it.

00:34:43 Whereas with the transformer, the self attention mainly,

00:34:46 the sort of idea is that each element

00:34:48 needs to pay attention to each other element.

00:34:50 And when you say attention,

00:34:51 maybe another way to phrase that

00:34:53 is you’re considering a context,

00:34:57 a wide context in terms of the wide context of the sentence

00:35:01 in understanding the meaning of a particular word

00:35:05 and in computer vision that’s understanding

00:35:06 a larger context to understand the local pattern

00:35:10 of a particular local part of an image.

00:35:13 Right, so basically if you have say,

00:35:14 again, a banana in the image,

00:35:16 you’re looking at the full image first.

00:35:18 So whether it’s like, you know,

00:35:19 you’re looking at all the pixels that are off a kitchen

00:35:22 or for dining table and so on.

00:35:23 And then you’re basically looking at the banana also.

00:35:25 Yeah, by the way, in terms of,

00:35:27 if we were to train the funny classifier,

00:35:29 there’s something funny about the word banana.

00:35:32 Just wanted to anticipate that.

00:35:33 I am wearing a banana shirt, so yeah.

00:35:36 Is there bananas on it?

00:35:39 Okay, so masking has worked for the vision context as well.

00:35:42 And so this transformer idea has worked as well.

00:35:44 So basically looking at all the elements

00:35:46 to understand a particular element

00:35:48 has been really powerful in vision.

00:35:49 The reason is like a lot of things

00:35:52 when you’re looking at them in isolation.

00:35:53 So if you look at just a blob of pixels,

00:35:55 so Antonio Torralba at MIT used to have

00:35:57 this like really famous image,

00:35:58 which I looked at when I was a PhD student.

00:36:01 But he would basically have a blob of pixels

00:36:02 and he would ask you, hey, what is this?

00:36:04 And it looked basically like a shoe

00:36:06 or like it could look like a TV remote.

00:36:08 It could look like anything.

00:36:10 And it turns out it was a beer bottle.

00:36:12 But I’m not sure it was one of these three things,

00:36:14 but basically he showed you the full picture

00:36:15 and then it was very obvious what it was.

00:36:17 But the point is just by looking at

00:36:19 that particular local window, you couldn’t figure it out.

00:36:21 Because of resolution, because of other things,

00:36:23 it’s just not easy always to just figure it out

00:36:26 by looking at just the neighborhood of pixels,

00:36:27 what these pixels are.

00:36:29 And the same thing happens for language as well.

00:36:32 For the parameters that have to learn

00:36:33 something about the data,

00:36:35 you need to give it the capacity

00:36:37 to learn the essential things.

00:36:39 Like if it’s not actually able to receive the signal at all,

00:36:42 then it’s not gonna be able to learn that signal.

00:36:44 And in order to understand images, to understand language,

00:36:47 you have to be able to see words in their full context.

00:36:50 Okay, what is harder to solve, vision or language?

00:36:54 Visual intelligence or linguistic intelligence?

00:36:57 So I’m going to say computer vision is harder.

00:36:59 My reason for this is basically that

00:37:02 language of course has a big structure to it

00:37:05 because we developed it.

00:37:06 Whereas vision is something that is common

00:37:08 in a lot of animals.

00:37:09 Everyone is able to get by a lot of these animals

00:37:12 on earth are actually able to get by without language.

00:37:15 And a lot of these animals we also deem to be intelligent.

00:37:18 So clearly intelligence does have

00:37:20 like a visual component to it.

00:37:22 And yes, of course, in the case of humans,

00:37:24 it of course also has a linguistic component.

00:37:26 But it means that there is something far more fundamental

00:37:28 about vision than there is about language.

00:37:30 And I’m sorry to anyone who disagrees,

00:37:32 but yes, this is what I feel.

00:37:34 So that’s being a little bit reflected in the challenges

00:37:38 that have to do with the progress

00:37:40 of self supervised learning, would you say?

00:37:42 Or is that just a peculiar accidents

00:37:45 of the progress of the AI community

00:37:47 that we focused on like,

00:37:48 or we discovered self attention and transformers

00:37:51 in the context of language first?

00:37:53 So like the self supervised learning success

00:37:55 was actually for vision has not much to do

00:37:58 with the transformers part.

00:37:59 I would say it’s actually been independent a little bit.

00:38:02 I think it’s just that the signal was a little bit different

00:38:05 for vision than there was for like NLP

00:38:08 and probably NLP folks discovered it before.

00:38:11 So for vision, the main success

00:38:12 has basically been this like crops so far,

00:38:14 like taking different crops of images.

00:38:16 Whereas for NLP, it was this masking thing.

00:38:18 But also the level of success

00:38:20 is still much higher for language.

00:38:22 It has.

00:38:22 So that has a lot to do with,

00:38:24 I mean, I can get into a lot of details.

00:38:26 For this particular question, let’s go for it, okay.

00:38:29 So the first thing is language is very structured.

00:38:32 So you are going to produce a distribution

00:38:34 over a finite vocabulary.

00:38:35 English has a finite number of words.

00:38:37 It’s actually not that large.

00:38:39 And you need to produce basically,

00:38:41 when you’re doing this masking thing,

00:38:42 all you need to do is basically tell me

00:38:44 which one of these like 50,000 words it is.

00:38:46 That’s it.

00:38:47 Now for vision, let’s imagine doing the same thing.

00:38:49 Okay, we’re basically going to blank out

00:38:51 a particular part of the image

00:38:52 and we ask the network or this neural network

00:38:54 to predict what is present in this missing patch.

00:38:58 It’s combinatorially large, right?

00:38:59 You have 256 pixel values.

00:39:02 If you’re even producing basically a seven cross seven

00:39:04 or a 14 cross 14 like window of pixels,

00:39:07 at each of these 169 or each of these 49 locations,

00:39:11 you have 256 values to predict.

00:39:13 And so it’s really, really large.

00:39:15 And very quickly, the kind of like prediction problems

00:39:18 that we’re setting up are going to be extremely

00:39:20 like interactable for us.

00:39:22 And so the thing is for NLP, it has been really successful

00:39:24 because we are very good at predicting,

00:39:27 like doing this like distribution over a finite set.

00:39:30 And the problem is when this set becomes really large,

00:39:33 we are going to become really, really bad

00:39:35 at making these predictions

00:39:36 and at solving basically this particular set of problems.

00:39:41 So if you were to do it exactly in the same way

00:39:44 as NLP for vision, there is very limited success.

00:39:47 The way stuff is working right now

00:39:48 is actually not by predicting these masks.

00:39:51 It’s basically by saying that you take these two

00:39:53 like crops from the image,

00:39:55 you get a feature representation from it.

00:39:57 And just saying that these two features,

00:39:58 so they’re like vectors,

00:40:00 just saying that the distance between these vectors

00:40:02 should be small.

00:40:03 And so it’s a very different way of learning

00:40:06 from the visual signal than there is from NLP.

00:40:09 Okay, the other reason is the distributional hypothesis

00:40:11 that we talked about for NLP, right?

00:40:12 So a word given its context,

00:40:15 basically the context actually supplies

00:40:16 a lot of meaning to the word.

00:40:18 Now, because there are just finite number of words

00:40:22 and there is a finite way in like which we compose them.

00:40:25 Of course, the same thing holds for pixels,

00:40:27 but in language, there’s a lot of structure, right?

00:40:29 So I always say whatever,

00:40:31 the dash jumped over the fence, for example.

00:40:33 There are lots of these sentences that you’ll get.

00:40:36 And from this, you can actually look at

00:40:38 this particular sentence might occur

00:40:40 in a lot of different contexts as well.

00:40:41 This exact same sentence

00:40:42 might occur in a different context.

00:40:44 So the sheep jumped over the fence,

00:40:45 the cat jumped over the fence,

00:40:46 the dog jumped over the fence.

00:40:48 So you immediately get a lot of these words,

00:40:50 which are because this particular token itself

00:40:52 has so much meaning,

00:40:53 you get a lot of these tokens or these words,

00:40:55 which are actually going to have sort of

00:40:57 this related meaning across given this context.

00:41:00 Whereas for vision, it’s much harder

00:41:02 because just by like pure,

00:41:04 like the way we capture images,

00:41:05 lighting can be different.

00:41:07 There might be like different noise in the sensor.

00:41:09 So the thing is you’re capturing a physical phenomenon

00:41:12 and then you’re basically going through

00:41:13 a very complicated pipeline of like image processing.

00:41:16 And then you’re translating that into

00:41:18 some kind of like digital signal.

00:41:20 Whereas with language, you write it down

00:41:23 and you transfer it to a digital signal,

00:41:25 almost like it’s a lossless like transfer.

00:41:27 And each of these tokens are very, very well defined.

00:41:30 There could be a little bit of an argument there

00:41:32 because language as written down

00:41:36 is a projection of thought.

00:41:39 This is one of the open questions is

00:41:42 if you perfectly can solve language,

00:41:46 are you getting close to being able to solve easily

00:41:50 with flying colors past the towing test kind of thing.

00:41:52 So that’s, it’s similar, but different

00:41:56 and the computer vision problem is in the 2D plane

00:41:59 is a projection with three dimensional world.

00:42:02 So perhaps there are similar problems there.

00:42:05 Maybe this is a good.

00:42:06 I mean, I think what I’m saying is NLP is not easy.

00:42:08 Of course, don’t get me wrong.

00:42:09 Like abstract thought expressed in knowledge

00:42:12 or knowledge basically expressed in language

00:42:14 is really hard to understand, right?

00:42:16 I mean, we’ve been communicating with language for so long

00:42:19 and it is of course a very complicated concept.

00:42:22 The thing is at least getting like somewhat reasonable,

00:42:27 like being able to solve some kind of reasonable tasks

00:42:29 with language, I would say slightly easier

00:42:32 than it is with computer vision.

00:42:33 Yeah, I would say, yeah.

00:42:35 So that’s well put.

00:42:36 I would say getting impressive performance on language

00:42:40 is easier.

00:42:43 I feel like for both language and computer vision,

00:42:45 there’s going to be this wall of like,

00:42:49 like this hump you have to overcome

00:42:52 to achieve superhuman level performance

00:42:54 or human level performance.

00:42:56 And I feel like for language, that wall is farther away.

00:43:00 So you can get pretty nice.

00:43:01 You can do a lot of tricks.

00:43:04 You can show really impressive performance.

00:43:06 You can even fool people that you’re tweeting

00:43:09 or you write blog posts writing

00:43:11 or your question answering has intelligence behind it.

00:43:16 But to truly demonstrate understanding of dialogue,

00:43:22 of continuous long form dialogue

00:43:25 that would require perhaps big breakthroughs.

00:43:28 In the same way in computer vision,

00:43:30 I think the big breakthroughs need to happen earlier

00:43:33 to achieve impressive performance.

00:43:36 This might be a good place to, you already mentioned it,

00:43:38 but what is contrastive learning

00:43:41 and what are energy based models?

00:43:43 Contrastive learning is sort of the paradigm of learning

00:43:46 where the idea is that you are learning this embedding space

00:43:50 or so you’re learning this sort of vector space

00:43:52 of all your concepts.

00:43:54 And the way you learn that is basically by contrasting.

00:43:56 So the idea is that you have a sample,

00:43:59 you have another sample that’s related to it.

00:44:01 So that’s called the positive

00:44:02 and you have another sample that’s not related to it.

00:44:05 So that’s negative.

00:44:06 So for example, let’s just take an NLP

00:44:08 or in a simple example in computer vision.

00:44:10 So you have an image of a cat, you have an image of a dog

00:44:14 and for whatever application that you’re doing,

00:44:16 say you’re trying to figure out what the pets are,

00:44:18 you’re saying that these two images are related.

00:44:20 So image of a cat and dog are related,

00:44:22 but now you have another third image of a banana

00:44:25 because you don’t like that word.

00:44:26 So now you basically have this banana.

00:44:28 Thank you for speaking to the crowd.

00:44:30 And so you take both of these images

00:44:32 and you take the image from the cat,

00:44:34 the image from the dog,

00:44:35 you get a feature from both of them.

00:44:36 And now what you’re training the network to do

00:44:38 is basically pull both of these features together

00:44:42 while pushing them away from the feature of a banana.

00:44:44 So this is the contrastive part.

00:44:45 So you’re contrasting against the banana.

00:44:47 So there’s always this notion of a negative and a positive.

00:44:51 Now, energy based models are like one way

00:44:54 that Jan sort of explains a lot of these methods.

00:44:57 So Jan basically, I think a couple of years

00:45:00 or more than that, like when I joined Facebook,

00:45:02 Jan used to keep mentioning this word, energy based models.

00:45:05 And of course I had no idea what he was talking about.

00:45:07 So then one day I caught him in one of the conference rooms

00:45:09 and I’m like, can you please tell me what this is?

00:45:11 So then like very patiently,

00:45:13 he sat down with like a marker and a whiteboard.

00:45:15 And his idea basically is that

00:45:18 rather than talking about probability distributions,

00:45:20 you can talk about energies of models.

00:45:21 So models are trying to minimize certain energies

00:45:24 in certain space,

00:45:24 or they’re trying to maximize a certain kind of energy.

00:45:28 And the idea basically is that

00:45:29 you can explain a lot of the contrastive models,

00:45:32 GANs, for example,

00:45:33 which are like Generative Adversarial Networks.

00:45:36 A lot of these modern learning methods

00:45:37 or VAEs, which are Variational Autoencoders,

00:45:39 you can really explain them very nicely

00:45:41 in terms of an energy function

00:45:43 that they’re trying to minimize or maximize.

00:45:45 And so by putting this common sort of language

00:45:48 for all of these models,

00:45:49 what looks very different in machine learning

00:45:51 that, oh, VAEs are very different from what GANs are,

00:45:54 are very, very different from what contrastive models are,

00:45:56 you actually get a sense of like,

00:45:57 oh, these are actually very, very related.

00:46:00 It’s just that the way or the mechanism

00:46:02 in which they’re sort of maximizing

00:46:04 or minimizing this energy function is slightly different.

00:46:07 It’s revealing the commonalities

00:46:08 between all these approaches

00:46:10 and putting a sexy word on top of it, like energy.

00:46:13 And so similarities,

00:46:14 two things that are similar have low energy.

00:46:16 Like the low energy signifying similarity.

00:46:20 Right, exactly.

00:46:21 So basically the idea is that if you were to imagine

00:46:23 like the embedding as a manifold, a 2D manifold,

00:46:26 you would get a hill or like a high sort of peak

00:46:28 in the energy manifold,

00:46:30 wherever two things are not related.

00:46:32 And basically you would have like a dip

00:46:34 where two things are related.

00:46:35 So you’d get a dip in the manifold.

00:46:37 And in the self supervised context,

00:46:40 how do you know two things are related

00:46:42 and two things are not related?

00:46:44 Right.

00:46:44 So this is where all the sort of ingenuity or tricks

00:46:46 comes in, right?

00:46:47 So for example, like you can take

00:46:50 the fill in the blank problem,

00:46:52 or you can take in the context problem.

00:46:54 And what you can say is two words

00:46:55 that are in the same context are related.

00:46:57 Two words that are in different contexts are not related.

00:47:00 For images, basically two crops

00:47:02 from the same image are related.

00:47:03 And whereas a third image is not related at all.

00:47:06 Or for a video, it can be two frames

00:47:08 from that video are related

00:47:09 because they’re likely to contain

00:47:10 the same sort of concepts in them.

00:47:12 Whereas a third frame

00:47:13 from a different video is not related.

00:47:15 So it basically is, it’s a very general term.

00:47:18 Contrastive learning is nothing really

00:47:19 to do with self supervised learning.

00:47:20 It actually is very popular in for example,

00:47:23 like any kind of metric learning

00:47:25 or any kind of embedding learning.

00:47:26 So it’s also used in supervised learning.

00:47:28 And the thing is because we are not really using labels

00:47:32 to get these positive or negative pairs,

00:47:34 it can basically also be used for self supervised learning.

00:47:37 So you mentioned one of the ideas

00:47:39 in the vision context that works

00:47:42 is to have different crops.

00:47:45 So you could think of that as a way

00:47:47 to sort of manipulating the data

00:47:49 to generate examples that are similar.

00:47:53 Obviously, there’s a bunch of other techniques.

00:47:55 You mentioned lighting as a very,

00:47:58 in images lighting is something that varies a lot

00:48:01 and you can artificially change those kinds of things.

00:48:04 There’s the whole broad field of data augmentation,

00:48:07 which manipulates images in order to increase arbitrarily

00:48:11 the size of the data set.

00:48:13 First of all, what is data augmentation?

00:48:15 And second of all, what’s the role of data augmentation

00:48:18 in self supervised learning and contrastive learning?

00:48:22 So data augmentation is just a way like you said,

00:48:24 it’s basically a way to augment the data.

00:48:26 So you have say n samples.

00:48:28 And what you do is you basically define

00:48:30 some kind of transforms for the sample.

00:48:32 So you take your say image

00:48:33 and then you define a transform

00:48:34 where you can just increase say the colors

00:48:37 like the colors or the brightness of the image

00:48:39 or increase or decrease the contrast of the image

00:48:41 for example, or take different crops of it.

00:48:44 So data augmentation is just a process

00:48:46 to like basically perturb the data

00:48:49 or like augment the data, right?

00:48:51 And so it has played a fundamental role

00:48:53 for computer vision for self supervised learning especially.

00:48:56 The way most of the current methods work

00:48:59 contrastive or otherwise is by taking an image

00:49:02 in the case of images is by taking an image

00:49:05 and then computing basically two perturbations of it.

00:49:08 So these can be two different crops of the image

00:49:11 with like different types of lighting

00:49:12 or different contrast or different colors.

00:49:15 So you jitter the colors a little bit and so on.

00:49:17 And now the idea is basically because it’s the same object

00:49:21 or because it’s like related concepts

00:49:23 in both of these perturbations,

00:49:25 you want the features from both of these perturbations

00:49:27 to be similar.

00:49:28 So now you can use a variety of different ways

00:49:31 to enforce this constraint,

00:49:32 like these features being similar.

00:49:34 You can do this by contrastive learning.

00:49:36 So basically, both of these things are positives,

00:49:38 a third sort of image is negative.

00:49:40 You can do this basically by like clustering.

00:49:43 For example, you can say that both of these images should,

00:49:46 the features from both of these images

00:49:48 should belong in the same cluster because they’re related,

00:49:50 whereas image like another image

00:49:52 should belong to a different cluster.

00:49:53 So there’s a variety of different ways

00:49:55 to basically enforce this particular constraint.

00:49:57 By the way, when you say features,

00:49:59 it means there’s a very large neural network

00:50:01 that extracting patterns from the image

00:50:03 and the kind of patterns that extracts

00:50:05 should be either identical or very similar.

00:50:08 That’s what that means.

00:50:09 So the neural network basically takes in the image

00:50:11 and then outputs a set of like,

00:50:14 basically a vector of like numbers,

00:50:16 and that’s the feature.

00:50:17 And you want this feature for both of these

00:50:20 like different crops that you computed to be similar.

00:50:22 So you want this vector to be identical

00:50:24 in its like entries, for example.

00:50:26 Be like literally close

00:50:28 in this multi dimensional space to each other.

00:50:31 And like you said,

00:50:32 close can mean part of the same cluster or something like that

00:50:35 in this large space.

00:50:37 First of all, that,

00:50:38 I wonder if there is connection

00:50:40 to the way humans learn to this,

00:50:43 almost like maybe subconsciously,

00:50:48 in order to understand a thing,

00:50:50 you kind of have to see it from two, three multiple angles.

00:50:54 I wonder, I have a lot of friends

00:50:57 who are neuroscientists maybe and cognitive scientists.

00:51:00 I wonder if that’s in there somewhere.

00:51:03 Like in order for us to place a concept in its proper place,

00:51:08 we have to basically crop it in all kinds of ways,

00:51:12 do basic data augmentation on it

00:51:14 in whatever very clever ways that the brain likes to do.

00:51:17 Right.

00:51:19 Like spinning around in our minds somehow

00:51:21 that that is very effective.

00:51:23 So I think for some of them, we like need to do it.

00:51:25 So like babies, for example, pick up objects,

00:51:27 like move them and put them close to their eye and whatnot.

00:51:30 But for certain other things,

00:51:31 actually we are good at imagining it as well, right?

00:51:33 So if you, I have never seen, for example,

00:51:35 an elephant from the top.

00:51:36 I’ve never basically looked at it from like top down.

00:51:39 But if you showed me a picture of it,

00:51:40 I could very well tell you that that’s an elephant.

00:51:43 So I think some of it, we’re just like,

00:51:45 we naturally build it or transfer it from other objects

00:51:47 that we’ve seen to imagine what it’s going to look like.

00:51:50 Has anyone done that with augmentation?

00:51:53 Like imagine all the possible things

00:51:56 that are occluded or not there,

00:51:59 but not just like normal things, like wild things,

00:52:03 but they’re nevertheless physically consistent.

00:52:06 So, I mean, people do kind of like

00:52:09 occlusion based augmentation as well.

00:52:11 So you place in like a random like box, gray box

00:52:14 to sort of mask out a certain part of the image.

00:52:17 And the thing is basically you’re kind of occluding it.

00:52:20 For example, you place it say on half of a person’s face.

00:52:23 So basically saying that, you know,

00:52:24 something below their nose is occluded

00:52:26 because it’s grayed out.

00:52:28 So, you know, I meant like, you have like, what is it?

00:52:31 A table and you can’t see behind the table.

00:52:33 And you imagine there’s a bunch of elves

00:52:37 with bananas behind the table.

00:52:38 Like, I wonder if there’s useful

00:52:40 to have a wild imagination for the network

00:52:44 because that’s possible or maybe not elves,

00:52:46 but like puppies and kittens or something like that.

00:52:49 Just have a wild imagination

00:52:51 and like constantly be generating that wild imagination.

00:52:55 Because in terms of data augmentation,

00:52:57 as currently applied, it’s super ultra, very boring.

00:53:01 It’s very basic data augmentation.

00:53:02 I wonder if there’s a benefit to being wildly imaginable

00:53:07 while trying to be consistent with physical reality.

00:53:11 I think it’s a kind of a chicken and egg problem, right?

00:53:14 Because to have like amazing data augmentation,

00:53:16 you need to understand what the scene is.

00:53:18 And what we’re trying to do data augmentation

00:53:20 to learn what a scene is anyway.

00:53:22 So it’s basically just keeps going on.

00:53:23 Before you understand it,

00:53:24 just put elves with bananas

00:53:26 until you know it’s not to be true.

00:53:29 Just like children have a wild imagination

00:53:31 until the adults ruin it all.

00:53:33 Okay, so what are the different kinds of data augmentation

00:53:36 that you’ve seen to be effective in visual intelligence?

00:53:40 For like vision,

00:53:42 it’s a lot of these image filtering operations.

00:53:44 So like blurring the image,

00:53:46 you know, all the kind of Instagram filters

00:53:48 that you can think of.

00:53:49 So like arbitrarily like make the red super red,

00:53:52 make the green super greens, like saturate the image.

00:53:55 Rotation, cropping.

00:53:56 Rotation, cropping, exactly.

00:53:58 All of these kinds of things.

00:53:59 Like I said, lighting is a really interesting one to me.

00:54:02 Like that feels like really complicated to do.

00:54:04 I mean, they don’t,

00:54:05 the augmentations that we work on aren’t like

00:54:08 that involved,

00:54:08 they’re not going to be like

00:54:09 physically realistic versions of lighting.

00:54:11 It’s not that you’re assuming

00:54:12 that there’s a light source up

00:54:13 and then you’re moving it to the right

00:54:15 and then what does the thing look like?

00:54:17 It’s really more about like brightness of the image,

00:54:19 overall brightness of the image

00:54:20 or overall contrast of the image and so on.

00:54:22 But this is a really important point to me.

00:54:25 I always thought that data augmentation

00:54:28 holds an important key

00:54:31 to big improvements in machine learning.

00:54:33 And it seems that it is an important aspect

00:54:36 of self supervised learning.

00:54:39 So I wonder if there’s big improvements to be achieved

00:54:42 on much more intelligent kinds of data augmentation.

00:54:46 For example, currently,

00:54:48 maybe you can correct me if I’m wrong,

00:54:50 data augmentation is not parameterized.

00:54:52 Yeah.

00:54:53 You’re not learning.

00:54:54 You’re not learning.

00:54:55 To me, it seems like data augmentation potentially

00:54:59 should involve more learning

00:55:02 than the learning process itself.

00:55:04 Right.

00:55:05 You’re almost like thinking of like generative kind of,

00:55:08 it’s the elves with bananas.

00:55:10 You’re trying to,

00:55:11 it’s like very active imagination

00:55:13 of messing with the world

00:55:14 and teaching that mechanism for messing with the world

00:55:17 to be realistic.

00:55:19 Right.

00:55:20 Because that feels like,

00:55:22 I mean, it’s imagination.

00:55:24 It’s just, as you said,

00:55:25 it feels like us humans are able to,

00:55:29 maybe sometimes subconsciously,

00:55:30 imagine before we see the thing,

00:55:33 imagine what we’re expecting to see,

00:55:35 like maybe several options.

00:55:37 And especially, we probably forgot,

00:55:38 but when we were younger,

00:55:40 probably the possibilities were wilder, more numerous.

00:55:44 And then as we get older,

00:55:45 we become to understand the world

00:55:47 and the possibilities of what we might see

00:55:51 becomes less and less and less.

00:55:53 So I wonder if you think there’s a lot of breakthroughs

00:55:55 yet to be had in data augmentation.

00:55:57 And maybe also can you just comment on the stuff we have,

00:55:59 is that a big part of self supervised learning?

00:56:02 Yes.

00:56:02 So data augmentation is like key to self supervised learning

00:56:05 that has like the kind of augmentation that we’re using.

00:56:08 And basically the fact that we’re trying to learn

00:56:11 these neural networks that are predicting these features

00:56:13 from images that are robust under data augmentation

00:56:17 has been the key for visual self supervised learning.

00:56:19 And they play a fairly fundamental role to it.

00:56:22 Now, the irony of all of this is that

00:56:24 for like deep learning purists will say

00:56:26 the entire point of deep learning is that

00:56:28 you feed in the pixels to the neural network

00:56:31 and it should figure out the patterns on its own.

00:56:33 So if it really wants to look at edges,

00:56:34 it should look at edges.

00:56:35 You shouldn’t really like really go

00:56:36 and handcraft these like features, right?

00:56:38 You shouldn’t go tell it that look at edges.

00:56:41 So data augmentation

00:56:42 should basically be in the same category, right?

00:56:44 Why should we tell the network

00:56:46 or tell this entire learning paradigm

00:56:48 what kinds of data augmentation that we’re looking for?

00:56:50 We are encoding a very sort of human specific bias there

00:56:55 that we know things are like,

00:56:57 if you change the contrast of the image,

00:56:59 it should still be an apple

00:57:00 or it should still see apple, not banana.

00:57:02 And basically if we change like colors,

00:57:05 it should still be the same kind of concept.

00:57:08 Of course, this is not one,

00:57:09 this is doesn’t feel like super satisfactory

00:57:12 because a lot of our human knowledge

00:57:14 or our human supervision

00:57:15 is actually going into the data augmentation.

00:57:17 So although we are calling it self supervised learning,

00:57:19 a lot of the human knowledge

00:57:21 is actually being encoded in the data augmentation process.

00:57:23 So it’s really like,

00:57:24 we’ve kind of sneaked away the supervision at the input

00:57:27 and we’re like really designing

00:57:28 these nice list of data augmentations

00:57:30 that are working very well.

00:57:31 Of course, the idea is that it’s much easier

00:57:33 to design a list of data augmentation than it is to do.

00:57:36 So humans are doing nevertheless doing less and less work

00:57:39 and maybe leveraging their creativity more and more.

00:57:42 And when we say data augmentation is not parameterized,

00:57:45 it means it’s not part of the learning process.

00:57:48 Do you think it’s possible to integrate

00:57:50 some of the data augmentation into the learning process?

00:57:53 I think so.

00:57:54 I think so.

00:57:54 And in fact, it will be really beneficial for us

00:57:57 because a lot of these data augmentations

00:57:59 that we use in vision are very extreme.

00:58:01 For example, like when you have certain concepts,

00:58:05 again, a banana, you take the banana

00:58:08 and then basically you change the color of the banana, right?

00:58:10 So you make it a purple banana.

00:58:12 Now this data augmentation process

00:58:14 is actually independent of the,

00:58:15 like it has no notion of what is present in the image.

00:58:18 So it can change this color arbitrarily.

00:58:20 It can make it a red banana as well.

00:58:22 And now what we’re doing is we’re telling

00:58:24 the neural network that this red banana

00:58:26 and so a crop of this image which has the red banana

00:58:29 and a crop of this image where I changed the color

00:58:30 to a purple banana should be,

00:58:32 the features should be the same.

00:58:34 Now bananas aren’t red or purple mostly.

00:58:36 So really the data augmentation process

00:58:38 should take into account what is present in the image

00:58:41 and what are the kinds of physical realities

00:58:43 that are possible.

00:58:43 It shouldn’t be completely independent of the image.

00:58:45 So you might get big gains if you,

00:58:48 instead of being drastic, do subtle augmentation

00:58:51 but realistic augmentation.

00:58:53 Right, realistic.

00:58:54 I’m not sure if it’s subtle, but like realistic for sure.

00:58:56 If it’s realistic, then even subtle augmentation

00:58:59 will give you big benefits.

00:59:00 Exactly, yeah.

00:59:01 And it will be like for particular domains

00:59:05 you might actually see like,

00:59:06 if for example, now we’re doing medical imaging,

00:59:08 there are going to be certain kinds

00:59:10 of like geometric augmentation

00:59:11 which are not really going to be very valid

00:59:13 for the human body.

00:59:15 So if you were to like actually loop in data augmentation

00:59:18 into the learning process,

00:59:19 it will actually be much more useful.

00:59:21 Now this actually does take us

00:59:23 to maybe a semi supervised kind of a setting

00:59:25 because you do want to understand

00:59:27 what is it that you’re trying to solve.

00:59:29 So currently self supervised learning

00:59:30 kind of operates in the wild, right?

00:59:32 So you do the self supervised learning

00:59:34 and the purists and all of us basically say that,

00:59:37 okay, this should learn useful representations

00:59:39 and they should be useful for any kind of end task,

00:59:42 no matter it’s like banana recognition

00:59:44 or like autonomous driving.

00:59:46 Now it’s a tall order.

00:59:47 Maybe the first baby step for us should be that,

00:59:50 okay, if you’re trying to loop in this data augmentation

00:59:52 into the learning process,

00:59:53 then we at least need to have some sense

00:59:56 of what we’re trying to do.

00:59:56 Are we trying to distinguish

00:59:57 between different types of bananas

00:59:59 or are we trying to distinguish between banana and apple

01:00:02 or are we trying to do all of these things at once?

01:00:04 And so some notion of like what happens at the end

01:00:07 might actually help us do much better at this side.

01:00:10 Let me ask you a ridiculous question.

01:00:14 If I were to give you like a black box,

01:00:16 like a choice to have an arbitrary large data set

01:00:19 of real natural data

01:00:22 versus really good data augmentation algorithms,

01:00:26 which would you like to train in a self supervised way on?

01:00:31 So natural data from the internet are arbitrary large,

01:00:35 so unlimited data,

01:00:37 or it’s like more controlled good data augmentation

01:00:41 on the finite data set.

01:00:43 The thing is like,

01:00:44 because our learning algorithms for vision right now

01:00:47 really rely on data augmentation,

01:00:49 even if you were to give me

01:00:50 like an infinite source of like image data,

01:00:52 I still need a good data augmentation algorithm.

01:00:54 You need something that tells you

01:00:56 that two things are similar.

01:00:57 Right.

01:00:58 And so something,

01:00:59 because you’ve given me an arbitrary large data set,

01:01:01 I still need to use data augmentation

01:01:03 to take that image construct,

01:01:05 like these two perturbations of it,

01:01:06 and then learn from it.

01:01:08 So the thing is our learning paradigm

01:01:09 is very primitive right now.

01:01:11 Yeah.

01:01:12 Even if you were to give me lots of images,

01:01:13 it’s still not really useful.

01:01:15 A good data augmentation algorithm

01:01:16 is actually going to be more useful.

01:01:18 So you can like reduce down the amount of data

01:01:21 that you give me by like 10 times,

01:01:22 but if you were to give me

01:01:23 a good data augmentation algorithm,

01:01:25 that would probably do better

01:01:26 than giving me like 10 times the size of that data,

01:01:29 but me having to rely on

01:01:30 like a very primitive data augmentation algorithm.

01:01:32 Like through tagging and all those kinds of things,

01:01:35 is there a way to discover things

01:01:37 that are semantically similar on the internet?

01:01:39 Obviously there is, but they might be extremely noisy.

01:01:42 And the difference might be farther away

01:01:45 than you would be comfortable with.

01:01:47 So, I mean, yes, tagging will help you a lot.

01:01:49 It’ll actually go a very long way

01:01:51 in figuring out what images are related or not.

01:01:54 And then, so, but then the purists would argue

01:01:57 that when you’re using human tags,

01:01:58 because these tags are like supervision,

01:02:01 is it really self supervised learning now?

01:02:03 Because you’re using human tags

01:02:05 to figure out which images are like similar.

01:02:07 Hashtag no filter means a lot of things.

01:02:10 Yes.

01:02:11 I mean, there are certain tags

01:02:12 which are going to be applicable pretty much to anything.

01:02:15 So they’re pretty useless for learning.

01:02:18 But I mean, certain tags are actually like

01:02:20 the Eiffel Tower, for example,

01:02:22 or the Taj Mahal, for example.

01:02:23 These tags are like very indicative of what’s going on.

01:02:26 And they are, I mean, they are human supervision.

01:02:29 Yeah.

01:02:30 This is one of the tasks of discovering

01:02:31 from human generated data strong signals

01:02:34 that could be leveraged for self supervision.

01:02:39 Like humans are doing so much work already.

01:02:42 Like many years ago, there was something that was called,

01:02:45 I guess, human computation back in the day.

01:02:48 Humans are doing so much work.

01:02:50 It’d be exciting to discover ways to leverage

01:02:53 the work they’re doing to teach machines

01:02:55 without any extra effort from them.

01:02:57 An example could be, like we said, driving,

01:03:00 humans driving and machines can learn from the driving.

01:03:03 I always hope that there could be some supervision signal

01:03:06 discovered in video games,

01:03:08 because there’s so many people that play video games

01:03:10 that it feels like so much effort is put into video games,

01:03:15 into playing video games,

01:03:17 and you can design video games somewhat cheaply

01:03:21 to include whatever signals you want.

01:03:24 It feels like that could be leverage somehow.

01:03:27 So people are using that.

01:03:28 Like there are actually folks right here in UT Austin,

01:03:30 like Philip Granbull is a professor at UT Austin.

01:03:33 He’s been like working on video games

01:03:36 as a source of supervision.

01:03:38 I mean, it’s really fun.

01:03:39 Like as a PhD student,

01:03:40 getting to basically play video games all day.

01:03:42 Yeah, but so I do hope that kind of thing scales

01:03:44 and like ultimately boils down to discovering

01:03:48 some undeniably very good signal.

01:03:51 It’s like masking in NLP.

01:03:54 But that said, there’s non contrastive methods.

01:03:57 What do non contrastive energy based

01:04:00 self supervised learning methods look like?

01:04:03 And why are they promising?

01:04:05 So like I said about contrastive learning,

01:04:07 you have this notion of a positive and a negative.

01:04:10 Now, the thing is, this entire learning paradigm

01:04:13 really requires access to a lot of negatives

01:04:17 to learn a good sort of feature space.

01:04:19 The idea is if I tell you, okay,

01:04:21 so a cat and a dog are similar,

01:04:23 and they’re very different from a banana.

01:04:25 The thing is, this is a fairly simple analogy, right?

01:04:28 Because bananas look visually very different

01:04:30 from what cats and dogs do.

01:04:32 So very quickly, if this is the only source

01:04:34 of supervision that I’m giving you,

01:04:36 your learning is not going to be like,

01:04:38 after a point, the neural network

01:04:39 is really not going to learn a lot.

01:04:41 Because the negative that you’re getting

01:04:42 is going to be so random.

01:04:43 So it can be, oh, a cat and a dog are very similar,

01:04:46 but they’re very different from a Volkswagen Beetle.

01:04:49 Now, like this car looks very different

01:04:51 from these animals again.

01:04:52 So the thing is in contrastive learning,

01:04:54 the quality of the negative sample really matters a lot.

01:04:58 And so what has happened is basically that

01:05:00 typically these methods that are contrastive

01:05:02 really require access to lots of negatives,

01:05:04 which becomes harder and harder to sort of scale

01:05:06 when designing a learning algorithm.

01:05:09 So that’s been one of the reasons

01:05:10 why non contrastive methods have become like popular

01:05:13 and why people think that they’re going to be more useful.

01:05:16 So a non contrastive method, for example,

01:05:18 like clustering is one non contrastive method.

01:05:20 The idea basically being that you have

01:05:22 two of these samples, so the cat and dog

01:05:25 or two crops of this image,

01:05:27 they belong to the same cluster.

01:05:30 And so essentially you’re basically doing clustering online

01:05:33 when you’re learning this network,

01:05:35 and which is very different from having access

01:05:36 to a lot of negatives explicitly.

01:05:38 The other way which has become really popular

01:05:40 is something called self distillation.

01:05:43 So the idea basically is that you have a teacher network

01:05:45 and a student network,

01:05:47 and the teacher network produces a feature.

01:05:49 So it takes in the image

01:05:51 and basically the neural network figures out the patterns

01:05:53 gets the feature out.

01:05:55 And there’s another neural network

01:05:56 which is the student neural network

01:05:57 and that also produces a feature.

01:05:59 And now all you’re doing is basically saying

01:06:01 that the features produced by the teacher network

01:06:03 and the student network should be very similar.

01:06:06 That’s it.

01:06:06 There is no notion of a negative anymore.

01:06:09 And that’s it.

01:06:10 So it’s all about similarity maximization

01:06:11 between these two features.

01:06:13 And so all I need to now do is figure out

01:06:16 how to have these two sorts of parallel networks,

01:06:18 a student network and a teacher network.

01:06:20 And basically researchers have figured out

01:06:23 very cheap methods to do this.

01:06:24 So you can actually have for free really

01:06:26 two types of neural networks.

01:06:29 They’re kind of related,

01:06:30 but they’re different enough that you can actually

01:06:32 basically have a learning problem set up.

01:06:34 So you can ensure that they always remain different enough.

01:06:38 So the thing doesn’t collapse into something boring.

01:06:41 Exactly.

01:06:41 So the main sort of enemy of self supervised learning,

01:06:44 any kind of similarity maximization technique is collapse.

01:06:47 It’s a collapse means that you learn the same feature

01:06:50 representation for all the images in the world,

01:06:53 which is completely useless.

01:06:54 Everything’s a banana.

01:06:55 Everything is a banana.

01:06:56 Everything is a cat.

01:06:57 Everything is a car.

01:06:59 And so all we need to do is basically come up with ways

01:07:02 to prevent collapse.

01:07:03 Contrastive learning is one way of doing it.

01:07:05 And then for example, like clustering or self distillation

01:07:07 or other ways of doing it.

01:07:09 We also had a recent paper where we used like

01:07:11 de correlation between like two sets of features

01:07:15 to prevent collapse.

01:07:16 So that’s inspired a little bit by like Horace Barlow’s

01:07:18 neuroscience principles.

01:07:20 By the way, I should comment that whoever counts

01:07:23 the number of times the word banana, apple, cat and dog

01:07:27 were using this conversation wins the internet.

01:07:30 I wish you luck.

01:07:32 What is Suave and the main improvement proposed

01:07:36 in the paper on supervised learning of visual features

01:07:40 by contrasting cluster assignments?

01:07:42 Suave basically is a clustering based technique,

01:07:46 which is for again, the same thing for self supervised

01:07:49 learning in vision where we have two crops.

01:07:52 And the idea basically is that you want the features

01:07:55 from these two crops of an image to lie in the same cluster

01:07:58 and basically crops that are coming from different images

01:08:02 to be in different clusters.

01:08:03 Now, typically in a sort of,

01:08:05 if you were to do this clustering,

01:08:07 you would perform clustering offline.

01:08:09 What that means is you would,

01:08:11 if you have a dataset of N examples,

01:08:13 you would run over all of these N examples,

01:08:15 get features for them, perform clustering.

01:08:17 So basically get some clusters

01:08:19 and then repeat the process again.

01:08:21 So this is offline basically because I need to do one pass

01:08:24 through the data to compute its clusters.

01:08:27 Suave is basically just a simple way of doing this online.

01:08:30 So as you’re going through the data,

01:08:31 you’re actually computing these clusters online.

01:08:34 And so of course there is like a lot of tricks involved

01:08:37 in how to do this in a robust manner without collapsing,

01:08:40 but this is this sort of key idea to it.

01:08:42 Is there a nice way to say what is the key methodology

01:08:45 of the clustering that enables that?

01:08:47 Right, so the idea basically is that

01:08:51 when you have N samples,

01:08:52 we assume that we have access to,

01:08:54 like there are always K clusters in a dataset.

01:08:57 K is a fixed number.

01:08:57 So for example, K is 3000.

01:09:00 And so if you have any,

01:09:02 when you look at any sort of small number of examples,

01:09:04 all of them must belong to one of these K clusters.

01:09:08 And we impose this equipartition constraint.

01:09:10 What this means is that basically

01:09:15 your entire set of N samples

01:09:16 should be equally partitioned into K clusters.

01:09:19 So all your K clusters are basically equal,

01:09:21 they have equal contribution to these N samples.

01:09:24 And this ensures that we never collapse.

01:09:26 So collapse can be viewed as a way

01:09:28 in which all samples belong to one cluster, right?

01:09:30 So all this, if all features become the same,

01:09:33 then you have basically just one mega cluster.

01:09:35 You don’t even have like 10 clusters or 3000 clusters.

01:09:38 So Suave basically ensures that at each point,

01:09:40 all these 3000 clusters are being used

01:09:42 in the clustering process.

01:09:45 And that’s it.

01:09:46 Basically just figure out how to do this online.

01:09:48 And again, basically just make sure

01:09:50 that two crops from the same image belong to the same cluster

01:09:54 and others don’t.

01:09:55 And the fact they have a fixed K makes things simpler.

01:09:58 Fixed K makes things simpler.

01:10:00 Our clustering is not like really hard clustering,

01:10:02 it’s soft clustering.

01:10:03 So basically you can be 0.2 to cluster number one

01:10:06 and 0.8 to cluster number two.

01:10:08 So it’s not really hard.

01:10:09 So essentially, even though we have like 3000 clusters,

01:10:12 we can actually represent a lot of clusters.

01:10:15 What is SEER, S E E R?

01:10:19 And what are the key results and insights in the paper,

01:10:23 Self Supervised Pre Training of Visual Features in the Wild?

01:10:27 What is this big, beautiful SEER system?

01:10:30 SEER, so I’ll first go to Suave

01:10:32 because Suave is actually like one

01:10:34 of the key components for SEER.

01:10:35 So Suave was, when we use Suave,

01:10:37 it was demonstrated on ImageNet.

01:10:39 So typically like self supervised methods,

01:10:42 the way we sort of operate is like in the research community,

01:10:46 we kind of cheat.

01:10:47 So we take ImageNet, which of course I talked about

01:10:49 as having lots of labels.

01:10:51 And then we throw away the labels,

01:10:52 like throw away all the hard work that went behind

01:10:54 basically the labeling process.

01:10:56 And we pretend that it is unsupervised.

01:11:00 But the problem here is that we have,

01:11:02 like when we collected these images,

01:11:05 the ImageNet dataset has a particular distribution

01:11:08 of concepts, right?

01:11:09 So these images are very curated.

01:11:11 And what that means is these images, of course,

01:11:15 belong to a certain set of noun concepts.

01:11:17 And also ImageNet has this bias that all images

01:11:20 contain an object, which is like very big

01:11:22 and it’s typically in the center.

01:11:24 So when you’re talking about a dog, it’s a well framed dog,

01:11:26 it’s towards the center of the image.

01:11:28 So a lot of the data augmentation,

01:11:29 a lot of the sort of hidden assumptions

01:11:31 in self supervised learning,

01:11:33 actually really exploit this bias of ImageNet.

01:11:37 And so, I mean, a lot of my work,

01:11:39 a lot of work from other people always uses ImageNet

01:11:42 sort of as the benchmark to show the success

01:11:44 of self supervised learning.

01:11:45 So you’re implying that there’s particular limitations

01:11:47 to this kind of dataset?

01:11:49 Yes, I mean, it’s basically because our data augmentation

01:11:51 that we designed, like all data augmentation

01:11:55 that we designed for self supervised learning in vision

01:11:57 are kind of overfit to ImageNet.

01:11:59 But you’re saying a little bit hard coded

01:12:02 like the cropping.

01:12:03 Exactly, the cropping parameters,

01:12:05 the kind of lighting that we’re using,

01:12:07 the kind of blurring that we’re using.

01:12:08 Yeah, but you would, for more in the wild dataset,

01:12:11 you would need to be clever or more careful

01:12:16 in setting the range of parameters

01:12:17 and those kinds of things.

01:12:18 So for SEER, our main goal was twofold.

01:12:21 One, basically to move away from ImageNet for training.

01:12:24 So the images that we used were like uncurated images.

01:12:27 Now there’s a lot of debate

01:12:28 whether they’re actually curated or not,

01:12:30 but I’ll talk about that later.

01:12:32 But the idea was basically,

01:12:33 these are going to be random internet images

01:12:36 that we’re not going to filter out

01:12:37 based on like particular categories.

01:12:40 So we did not say that, oh, images that belong to dogs

01:12:42 and cats should be the only images

01:12:44 that come in this dataset, banana.

01:12:47 And basically, other images should be thrown out.

01:12:50 So we didn’t do any of that.

01:12:51 So these are random internet images.

01:12:53 And of course, it also goes back to like the problem

01:12:56 of scale that you talked about.

01:12:57 So these were basically about a billion or so images.

01:13:00 And for context ImageNet,

01:13:01 the ImageNet version that we use

01:13:02 was 1 million images earlier.

01:13:04 So this is basically going like

01:13:05 three orders of magnitude more.

01:13:07 The idea was basically to see

01:13:08 if we can train a very large convolutional model

01:13:11 in a self supervised way on this uncurated,

01:13:14 but really large set of images.

01:13:16 And how well would this model do?

01:13:18 So is self supervised learning really overfit to ImageNet

01:13:21 or can it actually work in the wild?

01:13:23 And it was also out of curiosity,

01:13:25 what kind of things will this model learn?

01:13:27 Will it actually be able to still figure out

01:13:30 different types of objects and so on?

01:13:32 Would there be particular kinds of tasks

01:13:33 that would actually do better than an ImageNet train model?

01:13:38 And so for Sear, one of our main findings was that

01:13:40 we can actually train very large models

01:13:43 in a completely self supervised way

01:13:44 on lots of internet images

01:13:46 without really necessarily filtering them out.

01:13:48 Which was in itself a good thing

01:13:49 because it’s a fairly simple process, right?

01:13:51 So you get images which are uploaded

01:13:54 and you basically can immediately use them

01:13:55 to train a model in an unsupervised way.

01:13:57 You don’t really need to sit and filter them out.

01:13:59 These images can be cartoons, these can be memes,

01:14:02 these can be actual pictures uploaded by people.

01:14:04 And you don’t really care about what these images are.

01:14:06 You don’t even care about what concepts they contain.

01:14:08 So this was a very sort of simple setup.

01:14:10 What image selection mechanism would you say

01:14:12 is there like inherent in some aspect of the process?

01:14:18 So you’re kind of implying that there’s almost none,

01:14:21 but what is there would you say if you were to introspect?

01:14:24 Right, so it’s not like uncurated can basically

01:14:28 like one way of imagining uncurated

01:14:30 is basically you have like cameras

01:14:32 that can take pictures at random viewpoints.

01:14:35 When people upload pictures to the internet,

01:14:37 they are typically going to care about the framing of it.

01:14:40 They’re not going to upload, say,

01:14:41 the picture of a zoomed in wall, for example.

01:14:43 Well, when you say internet, do you mean social networks?

01:14:46 Yes. Okay.

01:14:47 So these are not going to be like pictures

01:14:48 of like a zoomed in table or a zoomed in wall.

01:14:51 So it’s not really completely uncurated

01:14:53 because people do have the like photographer’s bias

01:14:55 where they do want to keep things

01:14:57 towards the center a little bit,

01:14:58 or like really have like nice looking things

01:15:01 and so on in the picture.

01:15:02 So that’s the kind of bias that typically exists

01:15:05 in this data set and also the user base, right?

01:15:07 You’re not going to get lots of pictures

01:15:09 from different parts of the world

01:15:10 because there are certain parts of the world

01:15:12 where people may not actually be uploading

01:15:14 a lot of pictures to the internet

01:15:15 or may not even have access to a lot of internet.

01:15:17 So this is a giant data set and a giant neural network.

01:15:21 I don’t think we’ve talked about what architectures

01:15:24 work well for SSL, for self supervised learning.

01:15:29 For SEER and for SWAB, we were using convolutional networks,

01:15:32 but recently in a work called Dyno,

01:15:34 we’ve basically started using transformers for vision.

01:15:36 Both seem to work really well, Connets and transformers.

01:15:39 And depending on what you want to do,

01:15:41 you might choose to use a particular formulation.

01:15:43 So for SEER, it was a Connet.

01:15:45 It was particularly a RegNet model,

01:15:47 which was also a work from Facebook.

01:15:49 RegNets are like really good when it comes to compute

01:15:52 versus like accuracy.

01:15:54 So because it was a very efficient model,

01:15:56 compute and memory wise efficient,

01:15:59 and basically it worked really well in terms of scaling.

01:16:02 So we used a very large RegNet model

01:16:04 and trained it on a billion images.

01:16:05 Can you maybe quickly comment on what RegNets are?

01:16:09 It comes from this paper, Designing Network Design Spaces.

01:16:13 This is a super interesting concept

01:16:15 that emphasizes how to create efficient neural networks,

01:16:18 large neural networks.

01:16:19 So one of the sort of key takeaways from this paper,

01:16:21 which the authors, like whenever you hear them

01:16:23 present this work, they keep saying is,

01:16:26 a lot of neural networks are characterized

01:16:27 in terms of flops, right?

01:16:29 Flops basically being the floating point operations.

01:16:31 And people really love to use flops to say,

01:16:33 this model is like really computationally heavy,

01:16:36 or like our model is computationally cheap and so on.

01:16:39 Now it turns out that flops are really not a good indicator

01:16:41 of how well a particular network is,

01:16:43 like how efficient it is really.

01:16:45 And what a better indicator is, is the activation

01:16:49 or the memory that is being used by this particular model.

01:16:52 And so designing, like one of the key findings

01:16:55 from this paper was basically that you need to design

01:16:57 network families or neural network architectures

01:17:00 that are actually very efficient in the memory space as well,

01:17:02 not just in terms of pure flops.

01:17:04 So RegNet is basically a network architecture family

01:17:07 that came out of this paper that is particularly good

01:17:10 at both flops and the sort of memory required for it.

01:17:13 And of course it builds upon like earlier work,

01:17:15 like ResNet being like the sort of more popular inspiration

01:17:18 for it, where you have residual connections.

01:17:20 But one of the things in this work is basically

01:17:22 they also use like squeeze excitation blocks.

01:17:25 So it’s a lot of nice sort of technical innovation

01:17:27 in all of this from prior work,

01:17:28 and a lot of the ingenuity of these particular authors

01:17:31 in how to combine these multiple building blocks.

01:17:34 But the key constraint was optimize for both flops

01:17:36 and memory when you’re basically doing this,

01:17:38 don’t just look at flops.

01:17:39 And that allows you to what have a,

01:17:42 sort of have very large networks through this process,

01:17:47 can optimize for low, like for efficiency, for low memory.

01:17:51 Also in just in terms of pure hardware,

01:17:53 they fit very well on GPU memory.

01:17:55 So they can be like really powerful neural network

01:17:57 architectures with lots of parameters, lots of flops,

01:18:00 but also because they’re like efficient in terms of

01:18:02 the amount of memory that they’re using,

01:18:04 you can actually fit a lot of these on like a,

01:18:06 you can fit a very large model on a single GPU for example.

01:18:09 Would you say that the choice of architecture

01:18:14 matters more than the choice of maybe data augmentation

01:18:17 techniques?

01:18:18 Is there a possibility to say what matters more?

01:18:21 You kind of imply that you can probably go really far

01:18:24 with just using basic conv nuts.

01:18:27 All right, I think like data and data augmentation,

01:18:30 the algorithm being used for the self supervised training

01:18:33 matters a lot more than the particular kind of architecture.

01:18:36 With different types of architecture,

01:18:37 you will get different like properties in the resulting

01:18:40 sort of representation.

01:18:41 But really, I mean, the secret sauce is in the augmentation

01:18:44 and the algorithm being used to train them.

01:18:47 The architectures, I mean, at this point,

01:18:49 a lot of them perform very similarly,

01:18:51 depending on like the particular task that you care about,

01:18:53 they have certain advantages and disadvantages.

01:18:56 Is there something interesting to be said about what it

01:18:58 takes with Sears to train a giant neural network?

01:19:01 You’re talking about a huge amount of data,

01:19:04 a huge neural network.

01:19:05 Is there something interesting to be said of how to

01:19:08 effectively train something like that fast?

01:19:11 Lots of GPUs.

01:19:13 Okay.

01:19:15 I mean, so the model was like a billion parameters.

01:19:18 And it was trained on a billion images.

01:19:20 So if like, basically the same number of parameters

01:19:23 as the number of images, and it took a while.

01:19:26 I don’t remember the exact number, it’s in the paper,

01:19:28 but it took a while.

01:19:31 I guess I’m trying to get at is,

01:19:34 when you’re thinking of scaling this kind of thing,

01:19:38 I mean, one of the exciting possibilities of self

01:19:42 supervised learning is the several orders of magnitude

01:19:45 scaling of everything, both the neural network

01:19:49 and the size of the data.

01:19:50 And so the question is,

01:19:52 do you think there’s some interesting tricks to do large

01:19:56 scale distributed compute,

01:19:57 or is that really outside of even deep learning?

01:20:00 That’s more about like hardware engineering.

01:20:04 I think more and more there is like this,

01:20:07 a lot of like systems are designed,

01:20:10 basically taking into account

01:20:11 the machine learning needs, right?

01:20:12 So because whenever you’re doing this kind of

01:20:14 distributed training, there is a lot of intercommunication

01:20:17 between nodes.

01:20:17 So like gradients or the model parameters are being passed.

01:20:20 So you really want to minimize communication costs

01:20:22 when you really want to scale these models up.

01:20:25 You want basically to be able to do as much,

01:20:29 like as limited amount of communication as possible.

01:20:31 So currently like a dominant paradigm

01:20:33 is synchronized sort of training.

01:20:35 So essentially after every sort of gradient step,

01:20:38 all you basically have like a synchronization step

01:20:41 between all the sort of compute chips

01:20:43 that you’re going on with.

01:20:45 I think asynchronous training was popular,

01:20:47 but it doesn’t seem to perform as well.

01:20:50 But in general, I think that’s sort of the,

01:20:53 I guess it’s outside my scope as well.

01:20:55 But the main thing is like minimize the amount of

01:21:00 synchronization steps that you have.

01:21:01 That has been the key takeaway, at least in my experience.

01:21:04 The others I have no idea about, how to design the chip.

01:21:06 Yeah, there’s very few things that I see Jim Keller’s eyes

01:21:11 light up as much as talking about giant computers doing

01:21:15 like that fast communication that you’re talking to well

01:21:18 when they’re training machine learning systems.

01:21:21 What is VSSL, V I S S L, the PyTorch based SSL library?

01:21:27 What are the use cases that you might have?

01:21:30 VSSL basically was born out of a lot of us at Facebook

01:21:33 are doing the self supervised learning research.

01:21:35 So it’s a common framework in which we have like a lot of

01:21:38 self supervised learning methods implemented for vision.

01:21:41 It’s also, it has in itself like a benchmark of tasks

01:21:45 that you can evaluate the self supervised representations on.

01:21:48 So the use case for it is basically for anyone who’s either

01:21:51 trying to evaluate their self supervised model

01:21:53 or train their self supervised model,

01:21:56 or a researcher who’s trying to build

01:21:57 a new self supervised technique.

01:21:59 So it’s basically supposed to be all of these things.

01:22:01 So as a researcher before VSSL, for example,

01:22:04 or like when we started doing this work fairly seriously

01:22:06 at Facebook, it was very hard for us to go and implement

01:22:09 every self supervised learning model,

01:22:11 test it out in a like sort of consistent manner.

01:22:14 The experimental setup was very different

01:22:16 across different groups.

01:22:18 Even when someone said that they were reporting

01:22:20 image net accuracy, it could mean lots of different things.

01:22:23 So with VSSL, we tried to really sort of standardize that

01:22:25 as much as possible.

01:22:26 And there was a paper like we did in 2019

01:22:28 just about benchmarking.

01:22:29 And so VSSL basically builds upon a lot of this kind of work

01:22:32 that we did about like benchmarking.

01:22:35 And then every time we try to like,

01:22:37 we come up with a self supervised learning method,

01:22:39 a lot of us try to push that into VSSL as well,

01:22:41 just so that it basically is like the central piece

01:22:43 where a lot of these methods can reside.

01:22:46 Just out of curiosity, people may be,

01:22:49 so certainly outside of Facebook, but just researchers,

01:22:52 or just even people that know how to program in Python

01:22:54 and know how to use PyTorch, what would be the use case?

01:22:58 What would be a fun thing to play around with VSSL on?

01:23:01 Like what’s a fun thing to play around

01:23:04 with self supervised learning on, would you say?

01:23:07 Is there a good Hello World program?

01:23:09 Like is it always about big size that’s important to have,

01:23:14 or is there fun little smaller case playgrounds

01:23:18 to play around with?

01:23:19 So we’re trying to like push something towards that.

01:23:22 I think there are a few setups out there,

01:23:24 but nothing like super standard on the smaller scale.

01:23:26 I mean, ImageNet in itself is actually pretty big also.

01:23:29 So that is not something

01:23:31 which is like feasible for a lot of people.

01:23:33 But we are trying to like push up

01:23:34 with like smaller sort of use cases.

01:23:36 The thing is, at a smaller scale,

01:23:39 a lot of the observations

01:23:40 or a lot of the algorithms that work

01:23:41 don’t necessarily translate into the medium

01:23:43 or the larger scale.

01:23:45 So it’s really tricky to come up

01:23:46 with a good small scale setup

01:23:47 where a lot of your empirical observations

01:23:49 will really translate to the other setup.

01:23:51 So it’s been really challenging.

01:23:53 I’ve been trying to do that for a little bit as well

01:23:54 because it does take time to train stuff on ImageNet.

01:23:56 It does take time to train on like more images,

01:23:59 but pretty much every time I’ve tried to do that,

01:24:02 it’s been unsuccessful

01:24:03 because all the observations I draw

01:24:04 from my set of experiments on a smaller data set

01:24:07 don’t translate into ImageNet

01:24:09 or like don’t translate into another sort of data set.

01:24:11 So it’s been hard for us to figure this one out,

01:24:14 but it’s an important problem.

01:24:15 So there’s this really interesting idea

01:24:17 of learning across multiple modalities.

01:24:20 You have a CVPR 2021 best paper candidate

01:24:26 titled audio visual instance discrimination

01:24:29 with cross modal agreement.

01:24:31 What are the key results, insights in this paper

01:24:33 and what can you say in general

01:24:35 about the promise and power of multimodal learning?

01:24:37 For this paper, it actually came as a little bit

01:24:40 of a shock to me at how well it worked.

01:24:41 So I can describe what the problem set up was.

01:24:44 So it’s been used in the past by lots of folks

01:24:46 like for example, Andrew Owens from MIT,

01:24:48 Alyosha Efros from Berkeley,

01:24:49 Andrew Zisserman from Oxford.

01:24:51 So a lot of these people have been

01:24:52 sort of showing results in this.

01:24:53 Of course, I was aware of this result,

01:24:55 but I wasn’t really sure how well it would work in practice

01:24:58 for like other sort of downstream tasks.

01:25:00 So the results kept getting better.

01:25:02 And I wasn’t sure if like a lot of our insights

01:25:04 from self supervised learning would translate

01:25:05 into this multimodal learning problem.

01:25:08 So multimodal learning is when you have like,

01:25:12 when you have multiple modalities.

01:25:14 That’s not even cool.

01:25:15 Okay, so the particular modalities

01:25:19 that we worked on in this work were audio and video.

01:25:22 So the idea was basically, if you have a video,

01:25:23 you have its corresponding audio track.

01:25:25 And you want to use both of these signals,

01:25:27 the audio signal and the video signal

01:25:29 to learn a good representation for video

01:25:31 and good representation for audio.

01:25:32 Like this podcast.

01:25:33 Like this podcast, exactly.

01:25:35 So what we did in this work was basically train

01:25:38 two different neural networks,

01:25:39 one on the video signal, one on the audio signal.

01:25:41 And what we wanted is basically the features

01:25:43 that we get from both of these neural networks

01:25:45 should be similar.

01:25:46 So it should basically be able to produce

01:25:48 the same kinds of features from the video

01:25:51 and the same kinds of features from the audio.

01:25:53 Now, why is this useful?

01:25:54 Well, for a lot of these objects that we have,

01:25:56 there is a characteristic sound, right?

01:25:58 So trains, when they go by,

01:25:59 they make a particular kind of sound.

01:26:00 Boats make a particular kind of sound.

01:26:02 People, when they’re jumping around,

01:26:03 will like shout, whatever.

01:26:06 Bananas don’t make a sound.

01:26:07 So where you can’t learn anything about bananas there.

01:26:09 Or when humans mentioned bananas.

01:26:11 Well, yes, when they say the word banana, then.

01:26:13 So you can’t trust basically anything

01:26:15 that comes out of a human’s mouth as a source,

01:26:17 that source of audio is useless.

01:26:19 The typical use case is basically like,

01:26:20 for example, someone playing a musical instrument.

01:26:22 So guitars have a particular kind of sound and so on.

01:26:24 So because a lot of these things are correlated,

01:26:27 the idea in multimodal learning

01:26:28 is to take these two kinds of modalities,

01:26:30 video and audio, and learn a common embedding space,

01:26:33 a common feature space where both of these

01:26:35 related modalities can basically be close together.

01:26:38 And again, you use contrastive learning for this.

01:26:40 So in contrastive learning, basically the video

01:26:43 and the corresponding audio are positives.

01:26:45 And you can take any other video or any other audio

01:26:48 and that becomes a negative.

01:26:49 And so basically that’s it.

01:26:51 It’s just a simple application of contrastive learning.

01:26:53 The main sort of finding from this work for us

01:26:56 was basically that you can actually learn

01:26:58 very, very powerful feature representations,

01:27:00 very, very powerful video representations.

01:27:02 So you can learn the sort of video network

01:27:05 that we ended up learning can actually be used

01:27:07 for downstream, for example, recognizing human actions

01:27:11 or recognizing different types of sounds, for example.

01:27:14 So this was sort of the key finding.

01:27:17 Can you give kind of an example of a human action

01:27:20 or like just so we can build up intuition

01:27:23 of what kind of thing?

01:27:24 Right, so there is this data set called kinetics,

01:27:26 for example, which has like 400 different types

01:27:28 of human actions.

01:27:29 So people jumping, people doing different kinds of sports

01:27:32 or different types of swimming.

01:27:34 So like different strokes and swimming, golf and so on.

01:27:37 So there are like just different types of actions

01:27:39 right there.

01:27:40 And the point is this kind of video network

01:27:42 that you learn in a self supervised way

01:27:44 can be used very easily to kind of recognize

01:27:46 these different types of actions.

01:27:48 It can also be used for recognizing

01:27:50 different types of objects.

01:27:53 And what we did is we tried to visualize

01:27:54 whether the network can figure out

01:27:56 where the sound is coming from.

01:27:57 So basically, give it a video

01:27:59 and basically play say of a person just strumming a guitar,

01:28:03 but of course, there is no audio in this.

01:28:04 And now you give it this sound of a guitar.

01:28:07 And you ask like basically try to visualize

01:28:08 where the network thinks the sound is coming from.

01:28:12 And that can kind of basically draw like

01:28:14 when you visualize it,

01:28:15 you can see that it’s basically focusing on the guitar.

01:28:17 Yeah, that’s surreal.

01:28:18 And the same thing, for example,

01:28:20 for certain people’s voices,

01:28:21 like famous celebrities voices,

01:28:22 it can actually figure out where their mouth is.

01:28:26 So it can actually distinguish different people’s voices,

01:28:28 for example, a little bit as well.

01:28:30 Without that ever being annotated in any way.

01:28:33 Right, so this is all what it had discovered.

01:28:35 We never pointed out that this is a guitar

01:28:38 and this is the kind of sound it produces.

01:28:40 It can actually naturally figure that out

01:28:41 because it’s seen so many correlations of this sound

01:28:44 coming with this kind of like an object

01:28:46 that it basically learns to associate this sound

01:28:49 with this kind of an object.

01:28:50 Yeah, that’s really fascinating, right?

01:28:52 That’s really interesting.

01:28:53 So the idea with this kind of network

01:28:55 is then you then fine tune it for a particular task.

01:28:57 So this is forming like a really good knowledge base

01:29:01 within a neural network based on which you could then

01:29:04 the train a little bit more to accomplish a specific task.

01:29:07 Well, so you don’t need a lot of videos of humans

01:29:11 doing actions annotated.

01:29:12 You can just use a few of them to basically get your.

01:29:16 How much insight do you draw from the fact

01:29:18 that it can figure out where the sound is coming from?

01:29:23 I’m trying to see, so that’s kind of very,

01:29:26 it’s very CVPR beautiful, right?

01:29:28 It’s a cool little insight.

01:29:30 I wonder how profound that is.

01:29:33 Does it speak to the idea that multiple modalities

01:29:39 are somehow much bigger than the sum of their parts?

01:29:44 Or is it really, really useful to have multiple modalities?

01:29:48 Or is it just that cool thing that there’s parts

01:29:50 of our world that can be revealed like effectively

01:29:57 through multiple modalities,

01:29:58 but most of it is really all about vision

01:30:01 or about one of the modalities.

01:30:03 I would say a little tending more towards the second part.

01:30:07 So most of it can be sort of figured out with one modality,

01:30:10 but having an extra modality always helps you.

01:30:13 So in this case, for example,

01:30:14 like one thing is when you’re,

01:30:17 if you observe someone cutting something

01:30:19 and you don’t have any sort of sound there,

01:30:21 whether it’s an apple or whether it’s an onion,

01:30:25 it’s very hard to figure that out.

01:30:26 But if you hear someone cutting it,

01:30:28 it’s very easy to figure it out because apples and onions

01:30:30 make a very different kind of characteristics

01:30:33 on when they’re cut.

01:30:34 So you really figure this out based on audio,

01:30:36 it’s much easier.

01:30:38 So your life will become much easier

01:30:40 when you have access to different kinds of modalities.

01:30:42 And the other thing is, so I like to relate it in this way,

01:30:45 it may be like completely wrong,

01:30:46 but the distributional hypothesis in NLP,

01:30:49 where context basically gives kind of meaning to that word,

01:30:53 sound kind of does that too.

01:30:55 So if you have the same sound,

01:30:57 so that’s the same context across different videos,

01:30:59 you’re very likely to be observing the same kind of concept.

01:31:03 So that’s the kind of reason

01:31:04 why it figures out the guitar thing, right?

01:31:06 It observed the same sound across multiple different videos

01:31:09 and it figures out maybe this is the common factor

01:31:11 that’s actually doing it.

01:31:13 I wonder, I used to have this argument with my dad a bunch

01:31:17 for creating general intelligence,

01:31:19 whether smell is an important,

01:31:22 like if that’s important sensory information,

01:31:25 mostly we’re talking about like falling in love

01:31:27 with an AI system and for him,

01:31:30 smell and touch are important.

01:31:31 And I was arguing that it’s not at all.

01:31:33 It’s important, it’s nice and everything,

01:31:35 but like you can fall in love with just language really,

01:31:38 but a voice is very powerful and vision is next

01:31:41 and smell is not that important.

01:31:43 Can I ask you about this process of active learning?

01:31:46 You mentioned interactivity.

01:31:49 Right.

01:31:50 Is there some value

01:31:52 within the self supervised learning context

01:31:57 to select parts of the data in intelligent ways

01:32:02 such that they would most benefit the learning process?

01:32:06 So I think so.

01:32:07 I mean, I know I’m talking to an active learning fan here,

01:32:10 so of course I know the answer.

01:32:12 First you were talking bananas

01:32:14 and now you’re talking about active learning.

01:32:15 I love it.

01:32:16 I think Yannakun told me that active learning

01:32:18 is not that interesting.

01:32:20 I think back then I didn’t want to argue with him too much,

01:32:24 but when we talk again,

01:32:26 we’re gonna spend three hours arguing about active learning.

01:32:28 My sense was you can go extremely far with active learning,

01:32:32 perhaps farther than anything else.

01:32:34 Like to me, there’s this kind of intuition

01:32:37 that similar to data augmentation,

01:32:40 you can get a lot from the data,

01:32:45 from intelligent optimized usage of the data.

01:32:50 I’m trying to speak generally in such a way

01:32:53 that includes data augmentation

01:32:55 and active learning,

01:32:57 that there’s something about maybe interactive exploration

01:32:59 of the data that at least is part

01:33:03 of the solution to intelligence, like an important part.

01:33:07 I don’t know what your thoughts are

01:33:08 on active learning in general.

01:33:09 I actually really like active learning.

01:33:10 So back in the day we did this largely ignored CVPR paper

01:33:14 called learning by asking questions.

01:33:16 So the idea was basically you would train an agent

01:33:18 that would ask a question about the image.

01:33:20 It would get an answer

01:33:21 and basically then it would update itself.

01:33:23 It would see the next image.

01:33:24 It would decide what’s the next hardest question

01:33:26 that I can ask to learn the most.

01:33:28 And the idea was basically because it was being smart

01:33:31 about the kinds of questions it was asking,

01:33:33 it would learn in fewer samples.

01:33:35 It would be more efficient at using data.

01:33:37 And we did find to some extent

01:33:39 that it was actually better than randomly asking questions.

01:33:42 Kind of weird thing about active learning

01:33:43 is it’s also a chicken and egg problem

01:33:45 because when you look at an image,

01:33:47 to ask a good question about the image,

01:33:48 you need to understand something about the image.

01:33:50 You can’t ask a completely arbitrarily random question.

01:33:53 It may not even apply to that particular image.

01:33:55 So there is some amount of understanding or knowledge

01:33:57 that basically keeps getting built

01:33:59 when you’re doing active learning.

01:34:01 So I think active learning by itself is really good.

01:34:04 And the main thing we need to figure out is basically

01:34:07 how do we come up with a technique

01:34:09 to first model what the model knows

01:34:13 and also model what the model does not know.

01:34:16 I think that’s the sort of beauty of it.

01:34:18 Because when you know that there are certain things

01:34:20 that you don’t know anything about,

01:34:22 asking a question about those concepts

01:34:23 is actually going to bring you the most value.

01:34:26 And I think that’s the sort of key challenge.

01:34:28 Now, self supervised learning by itself,

01:34:29 like selecting data for it and so on,

01:34:31 that’s actually really useful.

01:34:32 But I think that’s a very narrow view

01:34:33 of looking at active learning.

01:34:35 If you look at it more broadly,

01:34:36 it is basically about if the model has a knowledge

01:34:40 about N concepts,

01:34:41 and it is weak basically about certain things.

01:34:43 So it needs to ask questions

01:34:45 either to discover new concepts

01:34:46 or to basically increase its knowledge

01:34:49 about these N concepts.

01:34:50 So at that level, it’s a very powerful technique.

01:34:53 I actually do think it’s going to be really useful.

01:34:56 Even in like simple things such as like data labeling,

01:34:59 it’s super useful.

01:35:00 So here is like one simple way

01:35:02 that you can use active learning.

01:35:04 For example, you have your self supervised model,

01:35:06 which is very good at predicting similarities

01:35:08 and dissimilarities between things.

01:35:10 And so if you label a picture as basically say a banana,

01:35:15 now you know that all the images

01:35:17 that are very similar to this image

01:35:19 are also likely to contain bananas.

01:35:21 So probably when you want to understand

01:35:24 what else is a banana,

01:35:25 you’re not going to use these other images.

01:35:26 You’re actually going to use an image

01:35:28 that is not completely dissimilar,

01:35:31 but somewhere in between,

01:35:32 which is not super similar to this image,

01:35:33 but not super dissimilar either.

01:35:35 And that’s going to tell you a lot more

01:35:37 about what this concept of a banana is.

01:35:39 So that’s kind of a heuristic.

01:35:41 I wonder if it’s possible to also learn ways

01:35:46 to discover the most likely,

01:35:50 the most beneficial image.

01:35:52 So like, so not just looking a thing

01:35:54 that’s somewhat similar to a banana,

01:35:58 but not exactly similar,

01:35:59 but have some kind of more complicated learning system,

01:36:03 like learned discovering mechanism

01:36:07 that tells you what image to look for.

01:36:09 Like how, yeah, like actually in a self supervised way,

01:36:14 learning strictly a function that says,

01:36:17 is this image going to be very useful to me

01:36:20 given what I currently know?

01:36:22 I think there’s a lot of synergy there.

01:36:23 It’s just, I think, yeah, it’s going to be explored.

01:36:27 I think very much related to that.

01:36:29 I kind of think of what Tesla Autopilot is doing

01:36:33 currently as kind of active learning.

01:36:36 There’s something that Andre Capati and their team

01:36:39 are calling a data engine.

01:36:41 So you’re basically deploying a bunch of instantiations

01:36:45 of a neural network into the wild,

01:36:47 and they’re collecting a bunch of edge cases

01:36:50 that are then sent back for annotation for particular,

01:36:53 and edge cases as defined as near failure

01:36:56 or some weirdness on a particular task

01:36:59 that’s then sent back.

01:37:01 It’s that not exactly a banana,

01:37:04 but almost the banana cases sent back for annotation.

01:37:07 And then there’s this loop that keeps going

01:37:09 and you keep retraining and retraining.

01:37:11 And the active learning step there,

01:37:13 or whatever you want to call it,

01:37:14 is the cars themselves that are sending you back the data.

01:37:19 Like, what the hell happened here?

01:37:20 This was weird.

01:37:22 What are your thoughts about that sort of deployment

01:37:26 of neural networks in the wild?

01:37:28 Another way to ask a question from first is your thoughts.

01:37:31 And maybe if you want to comment,

01:37:33 is there applications for autonomous driving,

01:37:36 like computer vision based autonomous driving,

01:37:40 applications of self supervised learning

01:37:42 in the context of computer vision based autonomous driving?

01:37:47 So I think so.

01:37:48 I think for self supervised learning

01:37:49 to be used in autonomous driving,

01:37:50 there are lots of opportunities.

01:37:51 I mean, just like pure consistency in predictions

01:37:54 is one way, right?

01:37:55 So because you have this nice sequence of data

01:38:00 that is coming in, a video stream of it,

01:38:02 associated of course with the actions

01:38:04 that say the car took,

01:38:05 you can form a very nice predictive model

01:38:07 of what’s happening.

01:38:08 So for example, like all the way,

01:38:11 like one way possibly in which how they’re figuring out

01:38:14 what data to get labeled is basically

01:38:15 through prediction uncertainty, right?

01:38:17 So you predict that the car was going to turn right.

01:38:20 So this was the action that was going to happen,

01:38:21 say in the shadow mode.

01:38:23 And now the driver turned left.

01:38:24 And this is a really big surprise.

01:38:27 So basically by forming these good predictive models,

01:38:30 you are, I mean, these are kind of self supervised models.

01:38:32 Prediction models are basically being trained

01:38:34 just by looking at what’s going to happen next

01:38:36 and asking them to predict what’s going to happen next.

01:38:38 So I would say this is really like one use

01:38:40 of self supervised learning.

01:38:42 It’s a predictive model

01:38:43 and you’re learning a predictive model

01:38:44 basically just by looking at what data you have.

01:38:46 Is there something about that active learning context

01:38:49 that you find insights from?

01:38:53 Like that kind of deployment of the system,

01:38:54 seeing cases where it doesn’t perform as you expected

01:38:59 and then retraining the system based on that?

01:39:01 I think that, I mean, that really resonates with me.

01:39:03 It’s super smart to do it that way.

01:39:05 Because I mean, the thing is with any kind

01:39:08 of like practical system, like autonomous driving,

01:39:11 there are those edge cases that are the things

01:39:13 that are actually the problem, right?

01:39:14 I mean, highway driving or like freeway driving

01:39:17 has basically been like,

01:39:19 there has been a lot of success in that particular part

01:39:21 of autonomous driving for a long time.

01:39:22 I would say like since the eighties or something.

01:39:25 Now the point is all these failure cases

01:39:28 are the sort of reason why autonomous driving

01:39:30 hasn’t become like super, super mainstream and available

01:39:33 like in every possible car right now.

01:39:35 And so basically by really scaling this problem out

01:39:38 by really trying to get all of these edge cases out

01:39:40 as quickly as possible,

01:39:41 and then just like using those to improve your model,

01:39:43 that’s super smart.

01:39:45 And prediction uncertainty to do that

01:39:47 is like one really nice way of doing it.

01:39:49 Let me put you on the spot.

01:39:52 So we mentioned offline Jitendra,

01:39:55 he thinks that the Tesla computer vision approach

01:39:58 or really any approach for autonomous driving

01:40:00 is very far away.

01:40:02 How many years away,

01:40:05 if you have to bet all your money on it,

01:40:06 are we to solving autonomous driving

01:40:09 with this kind of computer vision only

01:40:12 machine learning based approach?

01:40:13 Okay, so what does solving autonomous driving mean?

01:40:15 Does it mean solving it in the US?

01:40:17 Does it mean solving it in India?

01:40:18 Because I can tell you

01:40:19 that very different types of driving happening.

01:40:21 Not India, not Russia.

01:40:23 In the United States, autonomous,

01:40:26 so what solving means is when the car says it has control,

01:40:31 it is fully liable.

01:40:34 You can go to sleep, it’s driving by itself.

01:40:37 So this is highway and city driving,

01:40:39 but not everywhere, but mostly everywhere.

01:40:42 And it’s, let’s say significantly better,

01:40:45 like say five times less accidents than humans.

01:40:50 Sufficiently safer such that the public feels

01:40:53 like that transition is enticing beneficial

01:40:57 both for our safety and financial

01:40:59 and all those kinds of things.

01:41:01 Okay, so first disclaimer,

01:41:02 I’m not an expert in autonomous driving.

01:41:04 So let me put it out there.

01:41:05 I would say like at least five to 10 years.

01:41:09 This would be my guess from now.

01:41:12 Yeah, I’m actually very impressed.

01:41:14 Like when I sat in a friend’s Tesla recently

01:41:16 and of course, like looking on that screen,

01:41:20 it basically shows all the detections and everything.

01:41:22 The car is doing as you’re driving by

01:41:24 and that’s super distracting for me as a person

01:41:26 because all I keep looking at is like the bounding boxes

01:41:29 in the cars it’s tracking and it’s really impressive.

01:41:31 Like especially when it’s raining and it’s able to do that,

01:41:34 that was the most impressive part for me.

01:41:36 It’s actually able to get through rain and do that.

01:41:38 And one of the reasons why like a lot of us believed

01:41:41 and I would put myself in that category

01:41:44 is LIDAR based sort of technology for autonomous driving

01:41:47 was the key driver, right?

01:41:48 So Waymo was using it for the longest time.

01:41:50 And Tesla then decided to go this completely other route

01:41:53 that we are not going to even use LIDAR.

01:41:55 So their initial system I think was camera and radar based

01:41:58 and now they’re actually moving

01:41:59 to a completely like vision based system.

01:42:02 And so that was just like, it sounded completely crazy.

01:42:04 Like LIDAR is very useful in cases

01:42:07 where you have low visibility.

01:42:09 Of course it comes with its own set of complications.

01:42:11 But now to see that happen in like on a live Tesla

01:42:15 that basically just proves everyone wrong

01:42:16 I would say in a way.

01:42:18 And that’s just working really well.

01:42:20 I think there were also like a lot of advancements

01:42:22 in camera technology.

01:42:23 Now there were like, I know at CMU when I was there

01:42:26 there was a particular kind of camera

01:42:27 that had been developed that was really good

01:42:30 at basically low visibility setting.

01:42:32 So like lots of snow and lots of rain

01:42:34 it could actually still have a very reasonable visibility.

01:42:37 And I think there are lots of these kinds of innovations

01:42:39 that will happen on the sensor side itself

01:42:40 which is actually going to make this very easy

01:42:42 in the future.

01:42:43 And so maybe that’s actually why I’m more optimistic

01:42:46 about vision based self, like autonomous driving.

01:42:49 I was going to call it self supervised driving, but.

01:42:51 Vision based autonomous driving.

01:42:53 That’s the reason I’m quite optimistic about it

01:42:55 because I think there are going to be lots

01:42:56 of these advances on the sensor side itself.

01:42:58 So acquiring this data

01:43:00 we’re actually going to get much better about it.

01:43:02 And then of course, once we’re able to scale out

01:43:05 and get all of these edge cases in

01:43:06 as like Andre described

01:43:08 I think that’s going to make us go very far away.

01:43:11 Yeah, so it’s funny.

01:43:13 I’m very much with you on the five to 10 years

01:43:16 maybe 10 years

01:43:17 but you made it, I’m not sure how you made it sound

01:43:21 but for some people that seem

01:43:23 that might seem like really far away.

01:43:25 And then for other people, it might seem like very close.

01:43:30 There’s a lot of fundamental questions

01:43:32 about how much game theory is in this whole thing.

01:43:36 So like, how much is this simply a collision avoidance

01:43:41 problem and how much of it is you still interacting

01:43:45 with other humans in the scene

01:43:46 and you’re trying to create an experience

01:43:48 that’s compelling.

01:43:49 So you want to get from point A to point B quickly

01:43:53 you want to navigate the scene in a safe way

01:43:55 but you also want to show some level of aggression

01:43:58 because well, certainly this is why you’re screwed in India

01:44:02 because you have to show aggression.

01:44:03 Or Jersey or New Jersey.

01:44:04 Or Jersey, right.

01:44:05 So like, or New York or basically any major city

01:44:11 but I think it’s probably Elon

01:44:13 that I talked the most about this

01:44:14 which is a surprise to the level of which

01:44:17 they’re not considering human beings

01:44:20 as a huge problem in this, as a source of problem.

01:44:22 Like the driving is fundamentally a robot on robot

01:44:29 versus the environment problem

01:44:31 versus like you can just consider humans

01:44:33 not part of the problem.

01:44:35 I used to think humans are almost certainly

01:44:38 have to be modeled really well.

01:44:41 Pedestrians and cyclists and humans inside other cars

01:44:44 you have to have like mental models for them.

01:44:46 You cannot just see it as objects

01:44:48 but more and more it’s like the

01:44:51 it’s the same kind of intuition breaking thing

01:44:53 that’s self supervised learning does, which is

01:44:57 well maybe through the learning

01:44:58 you’ll get all the human like human information you need.

01:45:04 Right?

01:45:04 Like maybe you’ll get it just with enough data.

01:45:07 You don’t need to have explicit good models

01:45:09 of human behavior.

01:45:10 Maybe you get it through the data.

01:45:12 So, I mean my skepticism also just knowing

01:45:14 a lot of automotive companies

01:45:16 and how difficult it is to be innovative.

01:45:18 I was skeptical that they would be able at scale

01:45:22 to convert the driving scene across the world

01:45:27 into digital form such that you can create

01:45:30 this data engine at scale.

01:45:33 And the fact that Tesla is at least getting there

01:45:36 or are already there makes me think that

01:45:41 it’s now starting to be coupled

01:45:43 to this self supervised learning vision

01:45:47 which is like if that’s gonna work

01:45:49 if through purely this process you can get really far

01:45:52 then maybe you can solve driving that way.

01:45:54 I don’t know.

01:45:55 I tend to believe we don’t give enough credit

01:46:00 to the how amazing humans are both at driving

01:46:05 and at supervising autonomous systems.

01:46:09 And also we don’t, this is, I wish we were.

01:46:13 I wish there was much more driver sensing inside Teslas

01:46:17 and much deeper consideration of human factors

01:46:21 like understanding psychology and drowsiness

01:46:24 and all those kinds of things

01:46:26 when the car does more and more of the work.

01:46:28 How to keep utilizing the little human supervision

01:46:32 that are needed to keep this whole thing safe.

01:46:35 I mean it’s a fascinating dance of human robot interaction.

01:46:38 To me autonomous driving for a long time

01:46:42 is a human robot interaction problem.

01:46:45 It is not a robotics problem or computer vision problem.

01:46:48 Like you have to have a human in the loop.

01:46:50 But so which is why I think it’s 10 years plus.

01:46:53 But I do think there’ll be a bunch of cities and contexts

01:46:56 where geo restricted it will work really, really damn well.

01:47:02 So I think for me that gets five if I’m being optimistic

01:47:05 and it’s going to be five for a lot of cases

01:47:07 and 10 plus, yeah, I agree with you.

01:47:09 10 plus basically if we want to recover most of the,

01:47:13 say, contiguous United States or something.

01:47:15 Oh, interesting.

01:47:16 So my optimistic is five and pessimistic is 30.

01:47:20 30.

01:47:21 I have a long tail on this one.

01:47:22 I haven’t watched enough driving videos.

01:47:24 I’ve watched enough pedestrians to think like we may be,

01:47:29 like there’s a small part of me still, not a small,

01:47:31 like a pretty big part of me that thinks

01:47:34 we will have to build AGI to solve driving.

01:47:37 Oh, well.

01:47:38 Like there’s something to me,

01:47:39 like because humans are part of the picture,

01:47:41 deeply part of the picture,

01:47:44 and also human society is part of the picture

01:47:46 in that human life is at stake.

01:47:47 Anytime a robot kills a human,

01:47:50 it’s not clear to me that that’s not a problem

01:47:54 that machine learning will also have to solve.

01:47:56 Like it has to, you have to integrate that

01:47:59 into the whole thing.

01:48:00 Just like Facebook or social networks,

01:48:03 one thing is to say how to make

01:48:04 a really good recommender system.

01:48:06 And then the other thing is to integrate

01:48:08 into that recommender system,

01:48:10 all the journalists that will write articles

01:48:12 about that recommender system.

01:48:13 Like you have to consider the society

01:48:15 within which the AI system operates.

01:48:18 And in order to, and like politicians too,

01:48:21 this is the regulatory stuff for autonomous driving.

01:48:24 It’s kind of fascinating that the more successful

01:48:26 your AI system becomes,

01:48:28 the more it gets integrated in society

01:48:31 and the more precious politicians

01:48:33 and the public and the clickbait journalists

01:48:36 and all the different fascinating forces

01:48:38 of our society start acting on it.

01:48:40 And then it’s no longer how good you are

01:48:42 at doing the initial task.

01:48:43 It’s also how good you are at navigating human nature,

01:48:47 which is a fascinating space.

01:48:49 What do you think are the limits of deep learning?

01:48:52 If you allow me, we’ll zoom out a little bit

01:48:54 into the big question of artificial intelligence.

01:48:58 You said dark matter of intelligence is self supervised

01:49:02 learning, but there could be more.

01:49:04 What do you think the limits of self supervised learning

01:49:07 and just learning in general, deep learning are?

01:49:10 I think like for deep learning in particular,

01:49:12 because self supervised learning is I would say

01:49:14 a little bit more vague right now.

01:49:16 So I wouldn’t, like for something that’s so vague,

01:49:18 it’s hard to predict what its limits are going to be.

01:49:21 But like I said, I think anywhere you want to interact

01:49:25 with human self supervised learning kind of hits a boundary

01:49:27 very quickly because you need to have an interface

01:49:29 to be able to communicate with the human.

01:49:31 So really like if you have just like vacuous concepts

01:49:35 or like just like nebulous concepts discovered

01:49:37 by a network, it’s very hard to communicate those

01:49:39 with the human without like inserting some kind

01:49:41 of human knowledge or some kind of like human bias there.

01:49:45 In general, I think for deep learning,

01:49:47 the biggest challenge is just like data efficiency.

01:49:50 Even with self supervised learning,

01:49:52 even with anything else, if you just see

01:49:54 a single concept once, like one image of like,

01:49:59 I don’t know, whatever you want to call it,

01:50:01 like any concept, it’s really hard for these methods

01:50:03 to generalize by looking at just one or two samples

01:50:07 of things and that has been a real challenge.

01:50:09 I think that’s actually why like these edge cases,

01:50:11 for example, for Tesla are actually that important.

01:50:14 Because if you see just one instance of the car failing

01:50:18 and if you just annotate that and you get that

01:50:20 into your data set, you have like very limited guarantee

01:50:23 that it’s not going to happen again.

01:50:25 And you’re actually going to be able to recognize

01:50:26 this kind of instance in a very different scenario.

01:50:28 So like when it was snowing, so you got that thing labeled

01:50:31 when it was snowing, but now when it’s raining,

01:50:33 you’re actually not able to get it.

01:50:34 Or you basically have the same scenario

01:50:36 in a different part of the world.

01:50:37 So the lighting was different or so on.

01:50:39 So it’s just really hard for these models,

01:50:41 like deep learning especially to do that.

01:50:42 What’s your intuition?

01:50:43 How do we solve handwritten digit recognition problem

01:50:47 when we only have one example for each number?

01:50:51 It feels like humans are using something like learning.

01:50:54 Right.

01:50:55 I think we are good at transferring knowledge a little bit.

01:50:59 We are just better at like for a lot of these problems

01:51:02 where we are generalizing from a single sample

01:51:04 or recognizing from a single sample,

01:51:06 we are using a lot of our own domain knowledge

01:51:08 and a lot of our like inductive bias

01:51:10 into that one sample to generalize it.

01:51:12 So I’ve never seen you write the number nine, for example.

01:51:15 And if you were to write it, I would still get it.

01:51:17 And if you were to write a different kind of alphabet

01:51:19 and like write it in two different ways,

01:51:20 I would still probably be able to figure out

01:51:22 that these are the same two characters.

01:51:24 It’s just that I have been very used

01:51:26 to seeing handwritten digits in my life.

01:51:29 The other sort of problem with any deep learning system

01:51:31 or any kind of machine learning system is like,

01:51:33 it’s guarantees, right?

01:51:34 There are no guarantees for it.

01:51:35 Now you can argue that humans also don’t have any guarantees.

01:51:38 Like there is no guarantee that I can recognize a cat

01:51:41 in every scenario.

01:51:42 I’m sure there are going to be lots of cats

01:51:43 that I don’t recognize, lots of scenarios

01:51:45 in which I don’t recognize cats in general.

01:51:48 But I think from just a sort of application perspective,

01:51:52 you do need guarantees, right?

01:51:54 We call these things algorithms.

01:51:56 Now algorithms, like traditional CS algorithms

01:51:59 have guarantees.

01:51:59 Sorting is a guarantee.

01:52:01 If you were to call sort on a particular array of numbers,

01:52:05 you are guaranteed that it’s going to be sorted.

01:52:07 Otherwise it’s a bug.

01:52:09 Now for machine learning,

01:52:10 it’s very hard to characterize this.

01:52:12 We know for a fact that a cat recognition model

01:52:15 is not going to recognize cats,

01:52:17 every cat in the world in every circumstance.

01:52:19 I think most people would agree with that statement,

01:52:22 but we are still okay with it.

01:52:23 We still don’t call this as a bug.

01:52:25 Whereas in traditional computer science

01:52:26 or traditional science,

01:52:27 like if you have this kind of failure case existing,

01:52:29 then you think of it as like something is wrong.

01:52:33 I think there is this sort of notion

01:52:34 of nebulous correctness for machine learning.

01:52:37 And that’s something we just need to be very comfortable

01:52:38 with.

01:52:39 And for deep learning,

01:52:40 or like for a lot of these machine learning algorithms,

01:52:42 it’s not clear how do we characterize

01:52:44 this notion of correctness.

01:52:46 I think limitation in our understanding,

01:52:48 or at least a limitation in our phrasing of this.

01:52:51 And if we were to come up with better ways

01:52:53 to understand this limitation,

01:52:55 then it would actually help us a lot.

01:52:57 Do you think there’s a distinction

01:52:58 between the concept of learning

01:53:01 and the concept of reasoning?

01:53:04 Do you think it’s possible for neural networks to reason?

01:53:10 So I think of it slightly differently.

01:53:11 So for me, learning is whenever

01:53:14 I can like make a snap judgment.

01:53:16 So if you show me a picture of a dog,

01:53:17 I can immediately say it’s a dog.

01:53:18 But if you give me like a puzzle,

01:53:20 like whatever a Goldsberg machine

01:53:23 of like things going to happen,

01:53:24 then I have to reason because I’ve never,

01:53:26 it’s a very complicated setup.

01:53:27 I’ve never seen that particular setup.

01:53:29 And I really need to draw and like imagine in my head

01:53:32 what’s going to happen to figure it out.

01:53:34 So I think, yes, neural networks are really good

01:53:36 at recognition, but they’re not very good at reasoning.

01:53:41 Because they have seen something before

01:53:44 or seen something similar before, they’re very good

01:53:46 at making those sort of snap judgments.

01:53:48 But if you were to give them a very complicated thing

01:53:50 that they’ve not seen before,

01:53:52 they have very limited ability right now

01:53:55 to compose different things.

01:53:56 Like, oh, I’ve seen this particular part before.

01:53:58 I’ve seen this particular part before.

01:54:00 And now probably like this is how

01:54:01 they’re going to work in tandem.

01:54:02 It’s very hard for them to come up

01:54:04 with these kinds of things.

01:54:05 Well, there’s a certain aspect to reasoning

01:54:08 that you can maybe convert into the process of programming.

01:54:11 And so there’s the whole field of program synthesis

01:54:14 and people have been applying machine learning

01:54:17 to the problem of program synthesis.

01:54:18 And the question is, can they, the step of composition,

01:54:22 why can’t that be learned?

01:54:25 You know, this step of like building things on top of you,

01:54:29 like little intuitions, concepts on top of each other,

01:54:33 can that be learnable?

01:54:35 What’s your intuition there?

01:54:36 Or like, I guess similar set of techniques,

01:54:39 do you think that will be applicable?

01:54:42 So I think it is, of course, it is learnable

01:54:44 because like we are prime examples of machines

01:54:47 that have like, or individuals that have learned this, right?

01:54:49 Like humans have learned this.

01:54:51 So it is, of course, it is a technique

01:54:52 that is very easy to learn.

01:54:55 I think where we are kind of hitting a wall

01:54:58 basically with like current machine learning

01:55:00 is the fact that when the network learns

01:55:03 all of this information,

01:55:04 we basically are not able to figure out

01:55:07 how well it’s going to generalize to an unseen thing.

01:55:10 And we have no, like a priori, no way of characterizing that.

01:55:15 And I think that’s basically telling us a lot about,

01:55:18 like a lot about the fact that we really don’t know

01:55:20 what this model has learned and how well it’s basically,

01:55:22 because we don’t know how well it’s going to transfer.

01:55:25 There’s also a sense in which it feels like

01:55:28 we humans may not be aware of how much like background,

01:55:34 how good our background model is,

01:55:36 how much knowledge we just have slowly building

01:55:39 on top of each other.

01:55:41 It feels like neural networks

01:55:42 are constantly throwing stuff out.

01:55:43 Like you’ll do some incredible thing

01:55:45 where you’re learning a particular task in computer vision,

01:55:49 you celebrate your state of the art successes

01:55:51 and you throw that out.

01:55:52 Like, it feels like it’s,

01:55:54 you’re never using stuff you’ve learned

01:55:56 for your future successes in other domains.

01:56:00 And humans are obviously doing that exceptionally well,

01:56:03 still throwing stuff away in their mind,

01:56:05 but keeping certain kernels of truth.

01:56:07 Right, so I think we’re like,

01:56:09 continual learning is sort of the paradigm

01:56:11 for this in machine learning.

01:56:11 And I don’t think it’s a very well explored paradigm.

01:56:15 We have like things in deep learning, for example,

01:56:17 catastrophic forgetting is like one of the standard things.

01:56:20 The thing basically being that if you teach a network

01:56:23 like to recognize dogs,

01:56:24 and now you teach that same network to recognize cats,

01:56:27 it basically forgets how to recognize dogs.

01:56:29 So it forgets very quickly.

01:56:30 I mean, and whereas a human,

01:56:32 if you were to teach someone to recognize dogs

01:56:34 and then to recognize cats,

01:56:35 they don’t forget immediately how to recognize these dogs.

01:56:38 I think that’s basically sort of what you’re trying to get.

01:56:40 Yeah, I just, I wonder if like

01:56:42 the long term memory mechanisms

01:56:44 or the mechanisms that store not just memories,

01:56:47 but concepts that allow you to the reason

01:56:54 and compose concepts,

01:56:57 if those things will look very different

01:56:59 than neural networks,

01:56:59 or if you can do that within a single neural network

01:57:02 with some particular sort of architecture quirks,

01:57:06 that seems to be a really open problem.

01:57:07 And of course I go up and down on that

01:57:09 because there’s something so compelling to the symbolic AI

01:57:14 or to the ideas of logic based sort of expert systems.

01:57:20 You have like human interpretable facts

01:57:22 that built on top of each other.

01:57:24 It’s really annoying like with self supervised learning

01:57:27 that the AI is not very explainable.

01:57:31 Like you can’t like understand

01:57:33 all the beautiful things it has learned.

01:57:35 You can’t ask it like questions,

01:57:38 but then again, maybe that’s a stupid thing

01:57:40 for us humans to want.

01:57:42 Right, I think whenever we try to like understand it,

01:57:45 we are putting our own subjective human bias into it.

01:57:47 Yeah.

01:57:48 And I think that’s the sort of problem

01:57:50 with self supervised learning,

01:57:51 the goal is that it should learn naturally from the data.

01:57:54 So now if you try to understand it,

01:57:55 you are using your own preconceived notions

01:57:58 of what this model has learned.

01:58:00 And that’s the problem.

01:58:03 High level question.

01:58:04 What do you think it takes to build a system

01:58:07 with superhuman, maybe let’s say human level

01:58:10 or superhuman level general intelligence?

01:58:13 We’ve already kind of started talking about this,

01:58:15 but what’s your intuition?

01:58:17 Like, does this thing have to have a body?

01:58:20 Does it have to interact richly with the world?

01:58:25 Does it have to have some more human elements

01:58:27 like self awareness?

01:58:30 I think emotion.

01:58:32 I think emotion is something which is like,

01:58:35 it’s not really attributed typically

01:58:37 in standard machine learning.

01:58:38 It’s not something we think about,

01:58:39 like there is NLP, there is vision,

01:58:41 there is no like emotion.

01:58:42 Emotion is never a part of all of this.

01:58:44 And that just seems a little bit weird to me.

01:58:47 I think the reason basically being that there is surprise

01:58:50 and like, basically emotion is like one of the reasons

01:58:53 emotions arise is like what happens

01:58:55 and what do you expect to happen, right?

01:58:57 There is like a mismatch between these things.

01:58:59 And so that gives rise to like,

01:59:01 I can either be surprised or I can be saddened

01:59:03 or I can be happy and all of this.

01:59:05 And so this basically indicates

01:59:07 that I already have a predictive model in my head

01:59:10 and something that I predicted or something

01:59:11 that I thought was likely to happen.

01:59:13 And then there was something that I observed

01:59:15 that happened that there was a disconnect

01:59:16 between these two things.

01:59:18 And that basically is like maybe one of the reasons

01:59:21 like you have a lot of emotions.

01:59:24 Yeah, I think, so I talk to people a lot about them

01:59:26 like Lisa Feldman Barrett.

01:59:29 I think that’s an interesting concept of emotion

01:59:31 but I have a sense that emotion primarily

01:59:36 in the way we think about it,

01:59:38 which is the display of emotion

01:59:40 is a communication mechanism between humans.

01:59:43 So it’s a part of basically human to human interaction,

01:59:48 an important part, but just the part.

01:59:50 So it’s like, I would throw it into the full mix

01:59:55 of communication.

01:59:58 And to me, communication can be done with objects

02:00:01 that don’t look at all like humans.

02:00:04 Okay.

02:00:05 I’ve seen our ability to anthropomorphize

02:00:07 our ability to connect with things that look like a Roomba

02:00:10 our ability to connect.

02:00:12 First of all, let’s talk about other biological systems

02:00:14 like dogs, our ability to love things

02:00:17 that are very different than humans.

02:00:19 But they do display emotion, right?

02:00:20 I mean, dogs do display emotion.

02:00:23 So they don’t have to be anthropomorphic

02:00:25 for them to like display the kind of emotions

02:00:27 that we don’t.

02:00:28 Exactly.

02:00:29 So, I mean, but then the word emotion starts to lose.

02:00:33 So then we have to be, I guess specific, but yeah.

02:00:36 So have rich flavorful communication.

02:00:39 Communication, yeah.

02:00:40 Yeah, so like, yes, it’s full of emotion.

02:00:43 It’s full of wit and humor and moods

02:00:49 and all those kinds of things, yeah.

02:00:50 So you’re talking about like flavor.

02:00:53 Flavor, yeah.

02:00:54 Okay, let’s call it that.

02:00:55 So there’s content and then there is flavor

02:00:57 and I’m talking about the flavor.

02:00:58 Do you think it needs to have a body?

02:01:00 Do you think like to interact with the physical world?

02:01:02 Do you think you can understand the physical world

02:01:04 without being able to directly interact with it?

02:01:07 I don’t think so, yeah.

02:01:08 I think at some point we will need to bite the bullet

02:01:10 and actually interact with the physical,

02:01:12 as much as I like working on like passive computer vision

02:01:15 where I just like sit in my arm chair

02:01:17 and look at videos and learn.

02:01:19 I do think that we will need to have some kind of embodiment

02:01:22 or some kind of interaction

02:01:24 to figure out things about the world.

02:01:26 What about consciousness?

02:01:28 Do you think, how often do you think about consciousness

02:01:32 when you think about your work?

02:01:34 You could think of it

02:01:35 as the more simple thing of self awareness,

02:01:38 of being aware that you are a perceiving,

02:01:43 sensing, acting thing in this world.

02:01:46 Or you can think about the bigger version of that,

02:01:50 which is consciousness,

02:01:51 which is having it feel like something to be that entity,

02:01:57 the subjective experience of being in this world.

02:01:59 So I think of self awareness a little bit more

02:02:01 than like the broader goal of it,

02:02:03 because I think self awareness is pretty critical

02:02:06 for like any kind of like any kind of AGI

02:02:09 or whatever you want to call it that we build,

02:02:10 because it needs to contextualize what it is

02:02:13 and what role it’s playing

02:02:15 with respect to all the other things that exist around it.

02:02:17 I think that requires self awareness.

02:02:19 It needs to understand that it’s an autonomous car, right?

02:02:23 And what does that mean?

02:02:24 What are its limitations?

02:02:26 What are the things that it is supposed to do and so on?

02:02:29 What is its role in some way?

02:02:30 Or, I mean, these are the kinds of things

02:02:34 that we kind of expect from it, I would say.

02:02:36 And so that’s the level of self awareness

02:02:39 that’s, I would say, basically required at least,

02:02:42 if not more than that.

02:02:44 Yeah, I tend to, on the emotion side,

02:02:46 believe that it has to have,

02:02:48 it has to be able to display consciousness.

02:02:52 Display consciousness, what do you mean by that?

02:02:54 Meaning like for us humans to connect with each other

02:02:57 or to connect with other living entities,

02:03:01 I think we need to feel,

02:03:04 like in order for us to truly feel

02:03:06 like that there’s another being there,

02:03:09 we have to believe that they’re conscious.

02:03:11 And so we won’t ever connect with something

02:03:14 that doesn’t have elements of consciousness.

02:03:17 Now I tend to think that that’s easier to achieve

02:03:21 than it may sound,

02:03:23 because we anthropomorphize stuff so hard.

02:03:25 Like you have a mug that just like has wheels

02:03:28 and like rotates every once in a while and makes a sound.

02:03:31 I think a couple of days in,

02:03:34 especially if you don’t hang out with humans,

02:03:39 you might start to believe that mug on wheels is conscious.

02:03:42 So I think we anthropomorphize pretty effectively

02:03:44 as human beings.

02:03:46 But I do think that it’s in the same bucket

02:03:49 that we’ll call emotion,

02:03:50 that show that you’re,

02:03:54 I think of consciousness as the capacity to suffer.

02:03:58 And if you’re an entity that’s able to feel things

02:04:02 in the world and to communicate that to others,

02:04:06 I think that’s a really powerful way

02:04:08 to interact with humans.

02:04:10 And in order to create an AGI system,

02:04:13 I believe you should be able to richly interact with humans.

02:04:18 Like humans would need to want to interact with you.

02:04:21 Like it can’t be like,

02:04:22 it’s the self supervised learning versus like,

02:04:27 like the robot shouldn’t have to pay you

02:04:29 to interact with me.

02:04:30 So like it should be a natural fun thing.

02:04:33 And then you’re going to scale up significantly

02:04:36 how much interaction it gets.

02:04:39 It’s the Alexa prize,

02:04:40 which they were trying to get me to be a judge

02:04:43 on their contest.

02:04:44 Let’s see if I want to do that.

02:04:46 But their challenge is to talk to you,

02:04:50 make the human sufficiently interested

02:04:53 that the human keeps talking for 20 minutes.

02:04:56 To Alexa?

02:04:57 To Alexa, yeah.

02:04:58 And right now they’re not even close to that

02:05:00 because it just gets so boring when you’re like,

02:05:02 when the intelligence is not there,

02:05:04 it gets very not interesting to talk to it.

02:05:06 And so the robot needs to be interesting.

02:05:08 And one of the ways it can be interesting

02:05:10 is display the capacity to love, to suffer.

02:05:14 And I would say that essentially means

02:05:17 the capacity to display consciousness.

02:05:20 Like it is an entity, much like a human being.

02:05:25 Of course, what that really means,

02:05:27 I don’t know if that’s fundamentally a robotics problem

02:05:30 or some kind of problem that we’re not yet even aware.

02:05:33 Like if it is truly a hard problem of consciousness,

02:05:36 I tend to maybe optimistically think it’s a,

02:05:38 we can pretty effectively fake it till we make it.

02:05:42 So we can display a lot of human like elements for a while.

02:05:46 And that will be sufficient to form

02:05:49 really close connections with humans.

02:05:52 What’s used the most beautiful idea

02:05:53 in self supervised learning?

02:05:55 Like when you sit back with, I don’t know,

02:05:59 with a glass of wine and an armchair

02:06:03 and just at a fireplace,

02:06:06 just thinking how beautiful this world that you get

02:06:08 to explore is, what do you think

02:06:10 is the especially beautiful idea?

02:06:13 The fact that like object level,

02:06:16 what objects are and some notion of objectness emerges

02:06:19 from these models by just like self supervised learning.

02:06:23 So for example, like one of the things like the dyno paper

02:06:28 that I was a part of at Facebook is the object sort

02:06:33 of boundaries emerge from these representations.

02:06:35 So if you have like a dog running in the field,

02:06:38 the boundaries around the dog,

02:06:39 the network is basically able to figure out

02:06:42 what the boundaries of this dog are automatically.

02:06:45 And it was never trained to do that.

02:06:47 It was never trained to, no one taught it

02:06:50 that this is a dog and these pixels belong to a dog.

02:06:52 It’s able to group these things together automatically.

02:06:55 So that’s one.

02:06:56 I think in general, that entire notion that this dumb idea

02:07:00 that you take like these two crops of an image

02:07:01 and then you say that the features should be similar,

02:07:04 that has resulted in something like this,

02:07:06 like the model is able to figure out

02:07:07 what the dog pixels are and so on.

02:07:10 That just seems like so surprising.

02:07:13 And I mean, I don’t think a lot of us even understand

02:07:16 how that is happening really.

02:07:18 And it’s something we are taking for granted,

02:07:20 maybe like a lot in terms of how we’re setting up

02:07:23 these algorithms, but it’s just,

02:07:24 it’s a very beautiful and powerful idea.

02:07:26 So it’s really fundamentally telling us something about

02:07:30 that there is so much signal in the pixels

02:07:32 that we can be super dumb about it.

02:07:34 How about how we are setting up

02:07:35 the self sequencing problem.

02:07:37 And despite being like super dumb about it,

02:07:39 we’ll actually get very good,

02:07:41 like we’ll actually get something that is able to do

02:07:44 very like surprising things.

02:07:45 I wonder if there’s other like objectness

02:07:48 of other concepts that can emerge.

02:07:51 I don’t know if you follow Francois Chollet,

02:07:53 he had the competition for intelligence

02:07:56 that basically it’s kind of like an IQ test,

02:07:59 but for machines, but for an IQ test,

02:08:02 you have to have a few concepts that you want to apply.

02:08:05 One of them is objectness.

02:08:07 I wonder if those concepts can emerge

02:08:11 through self supervised learning on billions of images.

02:08:14 I think something like object permanence

02:08:16 can definitely emerge, right?

02:08:17 So that’s like a fundamental concept which we have,

02:08:20 maybe not through images, through video,

02:08:21 but that’s another concept that should be emerging from it

02:08:25 because it’s not something that,

02:08:26 like if we don’t teach humans that this isn’t,

02:08:29 this is like about this concept of object permanence,

02:08:31 it actually emerges.

02:08:32 And the same thing for like animals, like dogs,

02:08:34 I think actually permanence automatically

02:08:36 is something that they are born with.

02:08:38 So I think it should emerge from the data.

02:08:40 It should emerge basically very quickly.

02:08:42 I wonder if ideas like symmetry, rotation,

02:08:45 these kinds of things might emerge.

02:08:47 So I think rotation, probably yes.

02:08:50 Yeah, rotation, yes.

02:08:51 I mean, there’s some constraints in the architecture itself,

02:08:55 but it’s interesting if all of them could be,

02:08:59 like counting was another one, being able to kind of

02:09:04 understand that there’s multiple objects

02:09:06 of the same kind in the image and be able to count them.

02:09:10 I wonder if all of that could be,

02:09:11 if constructed correctly, they can emerge

02:09:14 because then you can transfer those concepts

02:09:16 to then interpret images at a deeper level.

02:09:20 Right.

02:09:21 Counting, I do believe, I mean, it should be possible.

02:09:24 You don’t know like yet,

02:09:25 but I do think it’s not that far in the realm of possibility.

02:09:29 Yeah, that’d be interesting

02:09:30 if using self supervised learning on images

02:09:33 can then be applied to then solving those kinds of IQ tests,

02:09:36 which seem currently to be kind of impossible.

02:09:40 What idea do you believe might be true

02:09:43 that most people think is not true

02:09:46 or don’t agree with you on?

02:09:48 Is there something like that?

02:09:50 So this is going to be a little controversial,

02:09:52 but okay, sure.

02:09:53 I don’t believe in simulation.

02:09:55 Like actually using simulation to do things very much.

02:09:58 Just to clarify, because this is a podcast

02:10:01 where you talk about, are we living in a simulation often?

02:10:03 You’re referring to using simulation to construct worlds

02:10:08 that you then leverage for machine learning.

02:10:10 Right, yeah.

02:10:11 For example, like one example would be like

02:10:13 to train an autonomous car driving system.

02:10:15 You basically first build a simulator,

02:10:17 which builds like the environment of the world.

02:10:19 And then you basically have a lot of like,

02:10:22 you train your machine learning system in that.

02:10:25 So I believe it is possible,

02:10:27 but I think it’s a really expensive way of doing things.

02:10:30 And at the end of it, you do need the real world.

02:10:33 So I’m not sure.

02:10:35 So maybe for certain settings,

02:10:36 like maybe the payout is so large,

02:10:38 like for autonomous driving, the payout is so large

02:10:40 that you can actually invest that much money to build it.

02:10:43 But I think as a general sort of principle,

02:10:45 it does not apply to a lot of concepts.

02:10:47 You can’t really build simulations of everything.

02:10:49 Not only because like one, it’s expensive,

02:10:51 because second, it’s also not possible for a lot of things.

02:10:54 So in general, like there’s a lot of work

02:10:59 on like using synthetic data and like synthetic simulators.

02:11:02 I generally am not very, like I don’t believe in that.

02:11:05 So you’re saying it’s very challenging visually,

02:11:09 like to correctly like simulate the visual,

02:11:11 like the lighting, all those kinds of things.

02:11:13 I mean, all these companies that you have, right?

02:11:15 So like Pixar and like whatever,

02:11:17 all these companies are,

02:11:19 all this like computer graphics stuff

02:11:21 is really about accurately,

02:11:22 a lot of them is about like accurately trying to figure out

02:11:26 how the lighting is and like how things reflect off

02:11:28 of one another and so on,

02:11:30 and like how sparkly things look and so on.

02:11:32 So it’s a very hard problem.

02:11:34 So do we really need to solve that first

02:11:37 to be able to like do computer vision?

02:11:39 Probably not.

02:11:40 And for me, in the context of autonomous driving,

02:11:44 it’s very tempting to be able to use simulation, right?

02:11:48 Because it’s a safety critical application,

02:11:50 but the other limitation of simulation that perhaps

02:11:54 is a bigger one than the visual limitation

02:11:58 is the behavior of objects.

02:12:00 So you’re ultimately interested in edge cases.

02:12:03 And the question is,

02:12:05 how well can you generate edge cases in simulation,

02:12:08 especially with human behavior?

02:12:11 I think another problem is like for autonomous driving,

02:12:13 it’s a constantly changing world.

02:12:15 So say autonomous driving like in 10 years from now,

02:12:18 like there are lots of autonomous cars,

02:12:20 but they’re still going to be humans.

02:12:22 So now there are 50% of the agents say, which are humans,

02:12:25 50% of the agents that are autonomous,

02:12:26 like car driving agents.

02:12:28 So now the mixture has changed.

02:12:30 So now the kinds of behaviors that you actually expect

02:12:32 from the other agents or other cars on the road

02:12:35 are actually going to be very different.

02:12:36 And as the proportion of the number of autonomous cars

02:12:39 to humans keeps changing,

02:12:40 this behavior will actually change a lot.

02:12:42 So now if you were to build a simulator based on

02:12:44 just like right now to build them today,

02:12:46 you don’t have that many autonomous cars on the road.

02:12:48 So you would try to like make all of the other agents

02:12:50 in that simulator behave as humans,

02:12:52 but that’s not really going to hold true 10, 15, 20,

02:12:55 30 years from now.

02:12:57 Do you think we’re living in a simulation?

02:12:59 No.

02:13:01 How hard is it?

02:13:02 This is why I think it’s an interesting question.

02:13:04 How hard is it to build a video game,

02:13:07 like virtual reality game where it is so real,

02:13:12 forget like ultra realistic to where

02:13:15 you can’t tell the difference,

02:13:17 but like it’s so nice that you just want to stay there.

02:13:20 You just want to stay there and you don’t want to come back.

02:13:24 Do you think that’s doable within our lifetime?

02:13:29 Within our lifetime, probably.

02:13:31 Yeah.

02:13:32 I eat healthy, I live long.

02:13:33 Does that make you sad that there’ll be like

02:13:39 like population of kids that basically spend 95%,

02:13:44 99% of their time in a virtual world?

02:13:50 Very, very hard question to answer.

02:13:53 For certain people, it might be something

02:13:55 that they really derive a lot of value out of,

02:13:58 derive a lot of enjoyment and like happiness out of,

02:14:00 and maybe the real world wasn’t giving them that.

02:14:03 That’s why they did that.

02:14:03 So maybe it is good for certain people.

02:14:05 So ultimately, if it maximizes happiness,

02:14:09 Right, I think if.

02:14:10 Or we could judge.

02:14:11 Yeah, I think if it’s making people happy,

02:14:12 maybe it’s okay.

02:14:14 Again, I think this is a very hard question.

02:14:18 So like you’ve been a part of a lot of amazing papers.

02:14:23 What advice would you give to somebody

02:14:25 on what it takes to write a good paper?

02:14:29 Grad students writing papers now,

02:14:31 is there common things that you’ve learned along the way

02:14:34 that you think it takes,

02:14:35 both for a good idea and a good paper?

02:14:39 Right, so I think both of these have picked up

02:14:44 from like lots of people I’ve worked with in the past.

02:14:46 So one of them is picking the right problem

02:14:48 to work on in research is as important

02:14:51 as like finding the solution to it.

02:14:53 So I mean, there are multiple reasons for this.

02:14:56 So one is that there are certain problems

02:14:59 that can actually be solved in a particular timeframe.

02:15:02 So now say you want to work on finding the meaning of life.

02:15:06 This is a great problem.

02:15:07 I think most people will agree with that.

02:15:09 But do you believe that your talents

02:15:12 and like the energy that you’ll spend on it

02:15:13 will make some kind of meaningful progress

02:15:17 in your lifetime?

02:15:18 If you are optimistic about it, then go ahead.

02:15:21 That’s why I started this podcast.

02:15:22 I keep asking people about the meaning of life.

02:15:24 I’m hoping by episode like 2.20, I’ll figure it out.

02:15:27 Oh, not too many episodes to go.

02:15:30 All right, cool.

02:15:31 Maybe today, I don’t know, but you’re right.

02:15:33 So that seems intractable at the moment.

02:15:36 Right, so I think it’s just the fact of like,

02:15:39 if you’re starting a PhD, for example,

02:15:41 what is one problem that you want to focus on

02:15:43 that you do think is interesting enough,

02:15:45 and you will be able to make a reasonable amount

02:15:47 of headway into it that you think you’ll be doing a PhD for?

02:15:50 So in that kind of a timeframe.

02:15:53 So that’s one.

02:15:53 Of course, there’s the second part,

02:15:54 which is what excites you genuinely.

02:15:56 So you shouldn’t just pick problems

02:15:57 that you are not excited about,

02:15:59 because as a grad student or as a researcher,

02:16:01 you really need to be passionate about it

02:16:03 to continue doing that,

02:16:04 because there are so many other things

02:16:05 that you could be doing in life.

02:16:07 So you really need to believe in that

02:16:08 to be able to do that for that long.

02:16:10 In terms of papers, I think the one thing

02:16:12 that I’ve learned is,

02:16:15 like in the past, whenever I used to write things,

02:16:17 and even now, whenever I do that,

02:16:18 I try to cram in a lot of things into the paper,

02:16:21 whereas what really matters

02:16:22 is just pushing one simple idea, that’s it.

02:16:25 That’s all because the paper is going to be like,

02:16:29 whatever, eight or nine pages.

02:16:32 If you keep cramming in lots of ideas,

02:16:34 it’s really hard for the single thing

02:16:36 that you believe in to stand out.

02:16:38 So if you really try to just focus,

02:16:40 especially in terms of writing,

02:16:41 really try to focus on one particular idea

02:16:43 and articulate it out in multiple different ways,

02:16:46 it’s far more valuable to the reader as well,

02:16:49 and basically to the reader, of course,

02:16:51 because they get to,

02:16:53 they know that this particular idea

02:16:54 is associated with this paper,

02:16:56 and also for you, because you have,

02:16:59 when you write about a particular idea in different ways,

02:17:01 you think about it more deeply.

02:17:02 So as a grad student, I used to always wait to it,

02:17:06 maybe in the last week or whatever, to write the paper,

02:17:08 because I used to always believe

02:17:10 that doing the experiments

02:17:11 was actually the bigger part of research than writing.

02:17:13 And my advisor always told me

02:17:15 that you should start writing very early on,

02:17:16 and I thought, oh, it doesn’t matter,

02:17:17 I don’t know what he’s talking about.

02:17:19 But I think more and more I realized that’s the case.

02:17:22 Whenever I write something that I’m doing,

02:17:24 I actually think much better about it.

02:17:26 And so if you start writing early on,

02:17:28 you actually, I think, get better ideas,

02:17:31 or at least you figure out holes in your theory,

02:17:33 or particular experiments that you should run

02:17:36 to plug those holes, and so on.

02:17:38 Yeah, I’m continually surprised

02:17:40 how many really good papers throughout history

02:17:43 are quite short and quite simple.

02:17:48 And there’s a lesson to that.

02:17:50 If you want to dream about writing a paper

02:17:52 that changes the world,

02:17:54 and you wanna go by example, they’re usually simple.

02:17:58 And that’s, it’s not cramming,

02:18:01 or it’s focusing on one idea, and thinking deeply.

02:18:07 And you’re right that the writing process itself

02:18:10 reveals the idea.

02:18:12 It challenges you to really think about what is the idea

02:18:15 that explains it, the thread that ties it all together.

02:18:19 And so a lot of famous researchers I know

02:18:21 actually would start off, like, first they were,

02:18:24 even before the experiments were in,

02:18:27 a lot of them would actually start

02:18:28 with writing the introduction of the paper,

02:18:30 with zero experiments in.

02:18:32 Because that at least helps them figure out

02:18:33 what they’re trying to solve,

02:18:35 and how it fits in the context of things right now.

02:18:38 And that would really guide their entire research.

02:18:40 So a lot of them would actually first write in intros

02:18:42 with zero experiments in,

02:18:43 and that’s how they would start projects.

02:18:46 Some basic questions about people maybe

02:18:49 that are more like beginners in this field.

02:18:51 What’s the best programming language to learn

02:18:54 if you’re interested in machine learning?

02:18:56 I would say Python,

02:18:57 just because it’s the easiest one to learn.

02:19:00 And also a lot of like programming

02:19:03 and machine learning happens in Python.

02:19:05 So if you don’t know any other programming language,

02:19:07 Python is actually going to get you a long way.

02:19:09 Yeah, it seems like sort of a,

02:19:11 it’s a toss up question because it seems like Python

02:19:14 is so much dominating the space now.

02:19:16 But I wonder if there’s an interesting alternative.

02:19:18 Obviously there’s like Swift,

02:19:19 and there’s a lot of interesting alternatives popping up,

02:19:22 even JavaScript.

02:19:23 So I, or are more like for the data science applications.

02:19:28 But it seems like Python more and more

02:19:31 is actually being used to teach like introduction

02:19:34 to programming at universities.

02:19:35 So it just combines everything very nicely.

02:19:39 Even harder question.

02:19:41 What are the pros and cons of PyTorch versus TensorFlow?

02:19:46 I see.

02:19:48 Okay.

02:19:49 You can go with no comment.

02:19:51 So a disclaimer to this is that the last time

02:19:53 I used TensorFlow was probably like four years ago.

02:19:56 And so it was right when it had come out

02:19:58 because so I started on like deep learning in 2014 or so,

02:20:02 and the dominant sort of framework for us then

02:20:06 for vision was Cafe, which was out of Berkeley.

02:20:09 And we used Cafe a lot, it was really nice.

02:20:12 And then TensorFlow came in,

02:20:13 which was basically like Python first.

02:20:15 So Cafe was mainly C++,

02:20:17 and it had like very loose kind of Python binding.

02:20:19 So Python wasn’t really the first language you would use.

02:20:21 You would really use either MATLAB or C++

02:20:24 like get stuff done in like Cafe.

02:20:28 And then Python of course became popular a little bit later.

02:20:30 So TensorFlow was basically around that time.

02:20:32 So 2015, 2016 is when I last used it.

02:20:36 It’s been a while.

02:20:37 And then what, did you use Torch or did you?

02:20:40 So then I moved to LuaTorch, which was the torch in Lua.

02:20:44 And then in 2017, I think basically pretty much

02:20:46 to PyTorch completely.

02:20:48 Oh, interesting.

02:20:49 So you went to Lua, cool.

02:20:50 Yeah.

02:20:51 Huh, so you were there before it was cool.

02:20:54 Yeah, I mean, so LuaTorch was really good

02:20:56 because it actually allowed you

02:20:59 to do a lot of different kinds of things.

02:21:01 So which Cafe was very rigid in terms of its structure.

02:21:03 Like you would create a neural network once and that’s it.

02:21:06 Whereas if you wanted like very dynamic graphs and so on,

02:21:09 it was very hard to do that.

02:21:10 And LuaTorch was much more friendly

02:21:11 for all of these things.

02:21:13 Okay, so in terms of PyTorch and TensorFlow,

02:21:15 my personal bias is PyTorch

02:21:17 just because I’ve been using it longer

02:21:19 and I’m more familiar with it.

02:21:20 And also that PyTorch is much easier to debug

02:21:23 is what I find because it’s imperative in nature

02:21:26 compared to like TensorFlow, which is not imperative.

02:21:28 But that’s telling you a lot that basically

02:21:30 the imperative design is sort of a way

02:21:33 in which a lot of people are taught programming

02:21:35 and that’s what actually makes debugging easier for them.

02:21:38 So like I learned programming in C, C++.

02:21:40 And so for me, imperative way of programming is more natural.

02:21:44 Do you think it’s good to have

02:21:45 kind of these two communities, this kind of competition?

02:21:48 I think PyTorch is kind of more and more

02:21:50 becoming dominant in the research community,

02:21:52 but TensorFlow is still very popular

02:21:54 in the more sort of application machine learning community.

02:21:57 So do you think it’s good to have

02:21:59 that kind of split in code bases?

02:22:02 Or so like the benefit there is the competition challenges

02:22:06 the library developers to step up to a game.

02:22:09 But the downside is there’s these code bases

02:22:12 that are in different libraries.

02:22:15 Right, so I think the downside is that,

02:22:17 I mean, for a lot of research code

02:22:18 that’s released in one framework

02:22:19 and if you’re using the other one,

02:22:20 it’s really hard to like really build on top of it.

02:22:23 But thankfully the open source community

02:22:25 in machine learning is amazing.

02:22:27 So whenever like something pops up in TensorFlow,

02:22:30 you wait a few days and someone who’s like super sharp

02:22:33 will actually come and translate that particular code

02:22:35 based into PyTorch and basically have figured that

02:22:38 all the nooks and crannies out.

02:22:39 So the open source community is amazing

02:22:41 and they really like figure out this gap.

02:22:44 So I think in terms of like having these two frameworks

02:22:47 or multiple, I think of course there are different use cases

02:22:49 so there are going to be benefits

02:22:51 to using one or the other framework.

02:22:52 And like you said, I think competition is just healthy

02:22:54 because both of these frameworks keep

02:22:57 or like all of these frameworks really sort of

02:22:59 keep learning from each other

02:23:00 and keep incorporating different things

02:23:01 to just make them better and better.

02:23:03 What advice would you have for someone

02:23:06 new to machine learning, you know,

02:23:09 maybe just started or haven’t even started

02:23:11 but are curious about it and who want to get in the field?

02:23:14 Don’t be afraid to get your hands dirty.

02:23:16 I think that’s the main thing.

02:23:17 So if something doesn’t work,

02:23:19 like really drill into why things are not working.

02:23:22 Can you elaborate what your hands dirty means?

02:23:24 Right, so for example, like if an algorithm,

02:23:27 if you try to train the network and it’s not converging,

02:23:29 whatever, rather than trying to like Google the answer

02:23:32 or trying to do something,

02:23:33 like really spend those like five, eight, 10, 15, 20,

02:23:36 whatever number of hours really trying

02:23:37 to figure it out yourself.

02:23:39 Because in that process, you’ll actually learn a lot more.

02:23:41 Yeah.

02:23:42 Googling is of course like a good way to solve it

02:23:44 when you need a quick answer.

02:23:45 But I think initially, especially like when you’re starting

02:23:48 out, it’s much nicer to like figure things out by yourself.

02:23:51 And I just say that from experience

02:23:52 because like when I started out,

02:23:54 there were not a lot of resources.

02:23:55 So we would like in the lab, a lot of us,

02:23:57 like we would look up to senior students

02:23:59 and then the senior students were of course busy

02:24:01 and they would be like, hey, why don’t you go figure it out?

02:24:03 Because I just don’t have the time.

02:24:04 I’m working on my dissertation or whatever.

02:24:06 I’ll find a PhD students.

02:24:07 And so then we would sit down

02:24:08 and like just try to figure it out.

02:24:10 And that I think really helped me.

02:24:12 That has really helped me figure a lot of things out.

02:24:15 I think in general, if I were to generalize that,

02:24:18 I feel like persevering through any kind of struggle

02:24:22 on a thing you care about is good.

02:24:25 So you’re basically, you try to make it seem

02:24:27 like it’s good to spend time debugging,

02:24:30 but really any kind of struggle, whatever form that takes,

02:24:33 it could be just Googling a lot.

02:24:36 Just basically anything, just sticking with it

02:24:38 and going through the hard thing that could take a form

02:24:41 of implementing stuff from scratch.

02:24:43 It could take the form of re implementing

02:24:45 with different libraries

02:24:46 or different programming languages.

02:24:49 It could take a lot of different forms,

02:24:50 but struggle is good for the soul.

02:24:53 So like in Pittsburgh, where I did my PhD,

02:24:55 the thing was it used to snow a lot.

02:24:58 And so when it was snowed, you really couldn’t do much.

02:25:00 So the thing that a lot of people said

02:25:02 was snow builds character.

02:25:05 Because when it’s snowing, you can’t do anything else.

02:25:07 You focus on work.

02:25:09 Do you have advice in general for people,

02:25:10 you’ve already exceptionally successful, you’re young,

02:25:13 but do you have advice for young people starting out

02:25:15 in college or maybe in high school?

02:25:18 Advice for their career, advice for their life,

02:25:21 how to pave a successful path in career and life?

02:25:25 I would say just be hungry.

02:25:27 Always be hungry for what you want.

02:25:29 And I think I’ve been inspired by a lot of people

02:25:33 who are just driven and who really go for what they want,

02:25:36 no matter what, like you shouldn’t want it,

02:25:39 you should need it.

02:25:40 So if you need something,

02:25:41 you basically go towards the ends to make it work.

02:25:44 How do you know when you come across a thing

02:25:47 that’s like you need?

02:25:51 I think there’s not going to be any single thing

02:25:53 that you’re going to need.

02:25:53 There are going to be different types of things

02:25:54 that you need, but whenever you need something,

02:25:56 you just go push for it.

02:25:57 And of course, once you, you may not get it,

02:26:00 or you may find that this was not even the thing

02:26:01 that you were looking for, it might be a different thing.

02:26:03 But the point is like you’re pushing through things

02:26:06 and that actually brings a lot of skills

02:26:08 and builds a certain kind of attitude

02:26:12 which will probably help you get the other thing

02:26:15 once you figure out what’s really the thing that you want.

02:26:18 Yeah, I think a lot of people are,

02:26:20 I’ve noticed, kind of afraid of that

02:26:22 is because one, it’s a fear of commitment.

02:26:24 And two, there’s so many amazing things in this world,

02:26:26 you almost don’t want to miss out

02:26:28 on all the other amazing things

02:26:29 by committing to this one thing.

02:26:31 So I think a lot of it has to do with just

02:26:32 allowing yourself to notice that thing

02:26:37 and just go all the way with it.

02:26:41 I mean, I also like failure, right?

02:26:43 So I know this is like super cheesy that failure

02:26:47 is something that you should be prepared for and so on,

02:26:49 but I do think, I mean, especially in research,

02:26:52 for example, failure is something that happens

02:26:54 almost every day is like experiments failing

02:26:58 and not working.

02:26:59 And so you really need to be so used to it.

02:27:02 You need to have a thick skin,

02:27:03 but, and only basically through,

02:27:06 like when you get through it is when you find

02:27:07 the one thing that’s actually working.

02:27:09 So Thomas Edison was like one person like that, right?

02:27:11 So I really, like when I was a kid,

02:27:13 I used to really read about how he found like the filament,

02:27:17 the light bulb filament.

02:27:18 And then he, I think his thing was like,

02:27:20 he tried 990 things that didn’t work

02:27:23 or something of the sort.

02:27:24 And then they asked him like, so what did you learn?

02:27:26 Because all of these were failed experiments.

02:27:28 And then he says, oh, these 990 things don’t work.

02:27:31 And I know that.

02:27:32 Did you know that?

02:27:33 I mean, that’s really inspiring.

02:27:35 So you spent a few years on this earth

02:27:38 performing a self supervised kind of learning process.

02:27:43 Have you figured out the meaning of life yet?

02:27:46 I told you I’m doing this podcast

02:27:47 to try to get the answer.

02:27:49 I’m hoping you could tell me,

02:27:50 what do you think the meaning of it all is?

02:27:54 I don’t think I figured this out.

02:27:55 No, I have no idea.

02:27:57 Do you think AI will help us figure it out

02:28:02 or do you think there’s no answer?

02:28:03 The whole point is to keep searching.

02:28:05 I think, yeah, I think it’s an endless sort of quest for us.

02:28:08 I don’t think AI will help us there.

02:28:10 This is like a very hard, hard, hard question

02:28:13 which so many humans have tried to answer.

02:28:15 Well, that’s the interesting thing

02:28:16 about the difference between AI and humans.

02:28:19 Humans don’t seem to know what the hell they’re doing.

02:28:21 And AI is almost always operating

02:28:23 under well defined objective functions.

02:28:28 And I wonder whether our lack of ability

02:28:33 to define good longterm objective functions

02:28:37 or introspect what is the objective function

02:28:40 under which we operate, if that’s a feature or a bug.

02:28:44 I would say it’s a feature

02:28:45 because then everyone actually has very different kinds

02:28:47 of objective functions that they’re optimizing

02:28:49 and those objective functions evolve

02:28:51 and change dramatically through the course

02:28:53 of their life.

02:28:54 That’s actually what makes us interesting, right?

02:28:56 If otherwise, like if everyone was doing

02:28:58 the exact same thing, that would be pretty boring.

02:29:00 We do want like people with different kinds

02:29:02 of perspectives, also people evolve continuously.

02:29:06 That’s like, I would say the biggest feature of being human.

02:29:09 And then we get to like the ones that die

02:29:11 because they do something stupid.

02:29:12 We get to watch that, see it and learn from it.

02:29:15 And as a species, we take that lesson

02:29:20 and become better and better

02:29:22 because of all the dumb people in the world

02:29:24 that died doing something wild and beautiful.

02:29:29 Ishan, thank you so much for this incredible conversation.

02:29:31 We did a depth first search through the space

02:29:37 of machine learning and it was fun and fascinating.

02:29:41 So it’s really an honor to meet you

02:29:43 and it was a really awesome conversation.

02:29:45 Thanks for coming down today and talking with me.

02:29:48 Thanks Lex, I mean, I’ve listened to you.

02:29:50 I told you it was unreal for me to actually meet you

02:29:52 in person and I’m so happy to be here, thank you.

02:29:55 Thanks man.

02:29:56 Thanks for listening to this conversation

02:29:58 with Ishan Misra and thank you to Onnit,

02:30:01 The Information, Grammarly and Athletic Greens.

02:30:05 Check them out in the description to support this podcast.

02:30:08 And now let me leave you with some words

02:30:10 from Arthur C. Clarke.

02:30:12 Any sufficiently advanced technology

02:30:14 is indistinguishable from magic.

02:30:18 Thank you for listening and hope to see you next time.