Transcript
00:00:00 The following is a conversation with Eshan Mizra,
00:00:03 research scientist at Facebook AI Research,
00:00:05 who works on self supervised machine learning
00:00:08 in the domain of computer vision,
00:00:10 or in other words, making AI systems understand
00:00:14 the visual world with minimal help from us humans.
00:00:18 Transformers and self attention has been successfully used
00:00:21 by OpenAI’s DPT3 and other language models
00:00:25 to do self supervised learning in the domain of language.
00:00:28 Eshan, together with Yann LeCun and others,
00:00:31 is trying to achieve the same success
00:00:33 in the domain of images and video.
00:00:36 The goal is to leave a robot
00:00:38 watching YouTube videos all night,
00:00:40 and in the morning, come back to a much smarter robot.
00:00:43 I read the blog post, Self Supervised Learning,
00:00:46 The Dark Matter of Intelligence by Eshan and Yann LeCun,
00:00:50 and then listened to Eshan’s appearance
00:00:52 on the excellent Machine Learning Street Talk podcast,
00:00:57 and I knew I had to talk to him.
00:00:59 By the way, if you’re interested in machine learning and AI,
00:01:02 I cannot recommend the ML Street Talk podcast highly enough.
00:01:07 Those guys are great.
00:01:09 Quick mention of our sponsors.
00:01:11 Onnit, The Information, Grammarly, and Athletic Greens.
00:01:15 Check them out in the description to support this podcast.
00:01:18 As a side note, let me say that,
00:01:20 for those of you who may have been listening
00:01:22 for quite a while, this podcast used to be called
00:01:24 Artificial Intelligence Podcast,
00:01:27 because my life passion has always been,
00:01:29 will always be artificial intelligence,
00:01:32 both narrowly and broadly defined.
00:01:35 My goal with this podcast is still
00:01:37 to have many conversations with world class researchers
00:01:40 in AI, math, physics, biology, and all the other sciences,
00:01:45 but I also want to talk to historians, musicians, athletes,
00:01:49 and of course, occasionally comedians.
00:01:51 In fact, I’m trying out doing this podcast
00:01:53 three times a week now to give me more freedom
00:01:56 with guest selection and maybe get a chance
00:01:59 to have a bit more fun.
00:02:00 Speaking of fun, in this conversation,
00:02:03 I challenge the listener to count the number of times
00:02:05 the word banana is mentioned.
00:02:08 Ishan and I use the word banana as the canonical example
00:02:12 at the core of the hard problem of computer vision
00:02:15 and maybe the hard problem of consciousness.
00:02:19 This is the Lex Friedman Podcast,
00:02:22 and here is my conversation with Ishan Mizra.
00:02:27 What is self supervised learning?
00:02:29 And maybe even give the bigger basics
00:02:32 of what is supervised and semi supervised learning,
00:02:35 and maybe why is self supervised learning
00:02:37 a better term than unsupervised learning?
00:02:40 Let’s start with supervised learning.
00:02:41 So typically for machine learning systems,
00:02:43 the way they’re trained is you get a bunch of humans,
00:02:46 the humans point out particular concepts.
00:02:48 So if it’s in the case of images,
00:02:50 you want the humans to come and tell you
00:02:52 what is present in the image,
00:02:54 draw boxes around them, draw masks of like things,
00:02:57 pixels, which are of particular categories or not.
00:03:00 For NLP, again, there are like lots
00:03:01 of these particular tasks, say about sentiment analysis,
00:03:04 about entailment and so on.
00:03:06 So typically for supervised learning,
00:03:08 we get a big corpus of such annotated or labeled data.
00:03:11 And then we feed that to a system
00:03:12 and the system is really trying to mimic.
00:03:14 So it’s taking this input of the data
00:03:16 and then trying to mimic the output.
00:03:18 So it looks at an image and the human has tagged
00:03:20 that this image contains a banana.
00:03:22 And now the system is basically trying to mimic that.
00:03:24 So that’s its learning signal.
00:03:26 And so for supervised learning,
00:03:28 we try to gather lots of such data
00:03:30 and we train these machine learning models
00:03:31 to imitate the input output.
00:03:33 And the hope is basically by doing so,
00:03:35 now on unseen or like new kinds of data,
00:03:38 this model can automatically learn
00:03:40 to predict these concepts.
00:03:41 So this is a standard sort of supervised setting.
00:03:43 For semi supervised setting,
00:03:45 the idea typically is that you have,
00:03:47 of course, all of the supervised data,
00:03:49 but you have lots of other data,
00:03:50 which is unsupervised or which is like not labeled.
00:03:53 Now, the problem basically with supervised learning
00:03:55 and why you actually have all of these alternate
00:03:57 sort of learning paradigms is,
00:03:59 supervised learning just does not scale.
00:04:01 So if you look at for computer vision,
00:04:03 the sort of largest,
00:04:05 one of the most popular data sets is ImageNet, right?
00:04:07 So the entire ImageNet data set has about 22,000 concepts
00:04:11 and about 14 million images.
00:04:13 So these concepts are basically just nouns
00:04:16 and they’re annotated on images.
00:04:18 And this entire data set was a mammoth data collection
00:04:20 effort that actually gave rise
00:04:22 to a lot of powerful learning algorithms
00:04:23 is credited with like sort of the rise
00:04:25 of deep learning as well.
00:04:27 But this data set took about 22 human years
00:04:30 to collect, to annotate.
00:04:31 And it’s not even that many concepts, right?
00:04:33 It’s not even that many images,
00:04:34 14 million is nothing really.
00:04:36 Like you have about, I think 400 million images or so,
00:04:39 or even more than that uploaded to most of the popular
00:04:41 sort of social media websites today.
00:04:44 So now supervised learning just doesn’t scale.
00:04:46 If I want to now annotate more concepts,
00:04:48 if I want to have various types of fine grained concepts,
00:04:51 then it won’t really scale.
00:04:53 So now you come up to these sort of different
00:04:54 learning paradigms, for example, semi supervised learning,
00:04:57 where the idea is you, of course,
00:04:58 you have this annotated corpus of supervised data
00:05:01 and you have lots of these unlabeled images.
00:05:03 And the idea is that the algorithm should basically try
00:05:05 to measure some kind of consistency
00:05:08 or really try to measure some kind of signal
00:05:10 on this sort of unlabeled data
00:05:12 to make itself more confident
00:05:14 about what it’s really trying to predict.
00:05:16 So by access to this, lots of unlabeled data,
00:05:19 the idea is that the algorithm actually learns
00:05:22 to be more confident and actually gets better
00:05:24 at predicting these concepts.
00:05:26 And now we come to the other extreme,
00:05:28 which is like self supervised learning.
00:05:30 The idea basically is that the machine or the algorithm
00:05:33 should really discover concepts or discover things
00:05:35 about the world or learn representations about the world
00:05:38 which are useful without access
00:05:40 to explicit human supervision.
00:05:41 So the word supervision is still
00:05:44 in the term self supervised.
00:05:46 So what is the supervision signal?
00:05:48 And maybe that perhaps is when Yann LeCun
00:05:51 and you argue that unsupervised
00:05:52 is the incorrect terminology here.
00:05:55 So what is the supervision signal
00:05:57 when the humans aren’t part of the picture
00:05:59 or not a big part of the picture?
00:06:02 Right, so self supervised,
00:06:04 the reason that it has the term supervised in itself
00:06:06 is because you’re using the data itself as supervision.
00:06:10 So because the data serves as its own source of supervision,
00:06:13 it’s self supervised in that way.
00:06:15 Now, the reason a lot of people,
00:06:16 I mean, we did it in that blog post with Yann,
00:06:18 but a lot of other people have also argued
00:06:20 for using this term self supervised.
00:06:22 So starting from like 94 from Virginia Desas group,
00:06:25 I think UCSD, and now she’s at UCSD.
00:06:28 Jeetendra Malik has said this a bunch of times as well.
00:06:31 So you have supervised,
00:06:33 and then unsupervised basically means everything
00:06:35 which is not supervised,
00:06:36 but that includes stuff like semi supervised,
00:06:38 that includes other like transductive learning,
00:06:41 lots of other sort of settings.
00:06:43 So that’s the reason like now people are preferring
00:06:46 this term self supervised
00:06:47 because it explicitly says what’s happening.
00:06:49 The data itself is the source of supervision
00:06:51 and any sort of learning algorithm
00:06:53 which tries to extract just sort of data supervision signals
00:06:56 from the data itself is a self supervised algorithm.
00:06:59 But there is within the data,
00:07:02 a set of tricks which unlock the supervision.
00:07:05 So can you give maybe some examples
00:07:07 and there’s innovation ingenuity required
00:07:11 to unlock that supervision.
00:07:12 The data doesn’t just speak to you some ground truth,
00:07:15 you have to do some kind of trick.
00:07:17 So I don’t know what your favorite domain is.
00:07:19 So you specifically specialize in visual learning,
00:07:23 but is there favorite examples,
00:07:24 maybe in language or other domains?
00:07:26 Perhaps the most successful applications
00:07:28 have been in NLP, not language processing.
00:07:31 So the idea basically being that you can train models
00:07:34 that can you have a sentence and you mask out certain words.
00:07:37 And now these models learn to predict the masked out words.
00:07:40 So if you have like the cat jumped over the dog,
00:07:44 so you can basically mask out cat.
00:07:45 And now you’re essentially asking the model
00:07:47 to predict what was missing, what did I mask out?
00:07:50 So the model is going to predict basically a distribution
00:07:53 over all the possible words that it knows.
00:07:55 And probably it has like if it’s a well trained model,
00:07:58 it has a sort of higher probability density
00:08:00 for this word cat.
00:08:02 For vision, I would say the sort of more,
00:08:05 I mean, the easier example,
00:08:07 which is not as widely used these days,
00:08:09 is basically say, for example, video prediction.
00:08:12 So video is again, a sequence of things.
00:08:14 So you can ask the model,
00:08:15 so if you have a video of say 10 seconds,
00:08:17 you can feed in the first nine seconds to a model
00:08:19 and then ask it, hey, what happens basically
00:08:21 in the 10 second, can you predict what’s going to happen?
00:08:24 And the idea basically is because the model
00:08:26 is predicting something about the data itself.
00:08:29 Of course, you didn’t need any human
00:08:31 to tell you what was happening
00:08:32 because the 10 second video was naturally captured.
00:08:34 Because the model is predicting what’s happening there,
00:08:36 it’s going to automatically learn something
00:08:39 about the structure of the world, how objects move,
00:08:41 object permanence, and these kinds of things.
00:08:44 So like, if I have something at the edge of the table,
00:08:45 it will fall down.
00:08:47 Things like these, which you really don’t have to sit
00:08:49 and annotate.
00:08:50 In a supervised learning setting,
00:08:51 I would have to sit and annotate.
00:08:52 This is a cup, now I move this cup, this is still a cup,
00:08:55 and now I move this cup, it’s still a cup,
00:08:56 and then it falls down, and this is a fallen down cup.
00:08:58 So I won’t have to annotate all of these things
00:09:00 in a self supervised setting.
00:09:02 Isn’t that kind of a brilliant little trick
00:09:05 of taking a series of data that is consistent
00:09:08 and removing one element in that series,
00:09:11 and then teaching the algorithm to predict that element?
00:09:17 Isn’t that, first of all, that’s quite brilliant.
00:09:20 It seems to be applicable in anything
00:09:23 that has the constraint of being a sequence
00:09:27 that is consistent with the physical reality.
00:09:30 The question is, are there other tricks like this
00:09:34 that can generate the self supervision signal?
00:09:37 So sequence is possibly the most widely used one in NLP.
00:09:41 For vision, the one that is actually used for images,
00:09:44 which is very popular these days,
00:09:45 is basically taking an image,
00:09:47 and now taking different crops of that image.
00:09:50 So you can basically decide to crop,
00:09:51 say the top left corner,
00:09:53 and you crop, say the bottom right corner,
00:09:55 and asking a network to basically present it with a choice,
00:09:58 saying that, okay, now you have this image,
00:10:01 you have this image, are these the same or not?
00:10:04 And so the idea basically is that because different crop,
00:10:06 like in an image, different parts of the image
00:10:08 are going to be related.
00:10:09 So for example, if you have a chair and a table,
00:10:12 basically these things are going to be close by,
00:10:14 versus if you take, again,
00:10:16 if you have like a zoomed in picture of a chair,
00:10:19 if you’re taking different crops,
00:10:20 it’s going to be different parts of the chair.
00:10:22 So the idea basically is that different crops
00:10:25 of the image are related,
00:10:26 and so the features or the representations
00:10:27 that you get from these different crops
00:10:29 should also be related.
00:10:30 So this is possibly the most like widely used trick
00:10:32 these days for self supervised learning and computer vision.
00:10:35 So again, using the consistency that’s inherent
00:10:39 to physical reality in visual domain,
00:10:42 that’s, you know, parts of an image are consistent,
00:10:45 and then in the language domain,
00:10:48 or anything that has sequences,
00:10:50 like language or something that’s like a time series,
00:10:53 then you can chop up parts in time.
00:10:55 It’s similar to the story of RNNs and CNNs,
00:11:00 of RNNs and ConvNets.
00:11:02 You and Yann LeCun wrote the blog post in March, 2021,
00:11:06 titled, Self Supervised Learning,
00:11:08 The Dark Matter of Intelligence.
00:11:11 Can you summarize this blog post
00:11:12 and maybe explain the main idea or set of ideas?
00:11:15 The blog post was mainly about sort of just telling,
00:11:18 I mean, this is really a accepted fact,
00:11:21 I would say for a lot of people now,
00:11:22 that self supervised learning is something
00:11:24 that is going to play an important role
00:11:27 for machine learning algorithms
00:11:28 that come in the future, and even now.
00:11:30 Let me just comment that we don’t yet
00:11:33 have a good understanding of what dark matter is.
00:11:36 That’s true.
00:11:37 So the idea basically being…
00:11:40 So maybe the metaphor doesn’t exactly transfer,
00:11:41 but maybe it’s actually perfectly transfers,
00:11:44 that we don’t know, we have an inkling
00:11:47 that it’ll be a big part
00:11:49 of whatever solving intelligence looks like.
00:11:51 Right, so I think self supervised learning,
00:11:52 the way it’s done right now is,
00:11:54 I would say like the first step towards
00:11:56 what it probably should end up like learning
00:11:58 or what it should enable us to do.
00:12:00 So the idea for that particular piece was,
00:12:03 self supervised learning is going to be a very powerful way
00:12:06 to learn common sense about the world,
00:12:08 or like stuff that is really hard to label.
00:12:10 For example, like is this piece
00:12:13 over here heavier than the cup?
00:12:15 Now, for all these kinds of things,
00:12:17 you’ll have to sit and label these things.
00:12:18 So supervised learning is clearly not going to scale.
00:12:21 So what is the thing that’s actually going to scale?
00:12:23 It’s probably going to be an agent
00:12:25 that can either actually interact with it to lift it up,
00:12:27 or observe me doing it.
00:12:29 So if I’m basically lifting these things up,
00:12:31 it can probably reason about,
00:12:32 hey, this is taking him more time to lift up,
00:12:34 or the velocity is different,
00:12:36 whereas the velocity for this is different,
00:12:37 probably this one is heavier.
00:12:39 So essentially, by observations of the data,
00:12:42 you should be able to infer a lot of things about the world
00:12:44 without someone explicitly telling you,
00:12:46 this is heavy, this is not,
00:12:48 this is something that can pour,
00:12:50 this is something that cannot pour,
00:12:51 this is somewhere that you can sit,
00:12:52 this is not somewhere that you can sit.
00:12:53 But you just mentioned ability to interact with the world.
00:12:57 There’s so many questions that are yet,
00:13:01 that are still open, which is,
00:13:02 how do you select the set of data
00:13:04 over which the self supervised learning process works?
00:13:08 How much interactivity like in the active learning
00:13:11 or the machine teaching context is there?
00:13:14 What are the reward signals?
00:13:16 Like how much actual interaction there is
00:13:18 with the physical world?
00:13:20 That kind of thing.
00:13:21 So that could be a huge question.
00:13:24 And then on top of that,
00:13:26 which I have a million questions about,
00:13:28 which we don’t know the answers to,
00:13:30 but it’s worth talking about is,
00:13:32 how much reasoning is involved?
00:13:35 How much accumulation of knowledge
00:13:38 versus something that’s more akin to learning
00:13:40 or whether that’s the same thing.
00:13:43 But so we’re like, it is truly dark matter.
00:13:46 We don’t know how exactly to do it.
00:13:49 But we are, I mean, a lot of us are actually convinced
00:13:52 that it’s going to be a sort of major thing
00:13:54 in machine learning.
00:13:55 So let me reframe it then,
00:13:56 that human supervision cannot be at large scale
00:14:01 the source of the solution to intelligence.
00:14:04 So the machines have to discover the supervision
00:14:08 in the natural signal of the world.
00:14:10 I mean, the other thing is also
00:14:11 that humans are not particularly good labelers.
00:14:14 They’re not very consistent.
00:14:16 For example, like what’s the difference
00:14:17 between a dining table and a table?
00:14:19 Is it just the fact that one,
00:14:21 like if you just look at a particular table,
00:14:23 what makes us say one is dining table
00:14:24 and the other is not?
00:14:26 Humans are not particularly consistent.
00:14:28 They’re not like very good sources of supervision
00:14:30 for a lot of these kinds of edge cases.
00:14:32 So it may be also the fact that if we want an algorithm
00:14:37 or want a machine to solve a particular task for us,
00:14:39 we can maybe just specify the end goal
00:14:42 and like the stuff in between,
00:14:44 we really probably should not be specifying
00:14:46 because we’re not maybe going to confuse it a lot actually.
00:14:49 Well, humans can’t even answer the meaning of life.
00:14:51 So I’m not sure if we’re good supervisors
00:14:53 of the end goal either.
00:14:55 So let me ask you about categories.
00:14:56 Humans are not very good at telling the difference
00:14:59 between what is and isn’t a table, like you mentioned.
00:15:02 Do you think it’s possible,
00:15:04 let me ask you like pretend you’re Plato.
00:15:10 Is it possible to create a pretty good taxonomy
00:15:14 of objects in the world?
00:15:16 It seems like a lot of approaches in machine learning
00:15:19 kind of assume a hopeful vision
00:15:21 that it’s possible to construct a perfect taxonomy
00:15:24 or it exists perhaps out of our reach,
00:15:26 but we can always get closer and closer to it.
00:15:28 Or is that a hopeless pursuit?
00:15:31 I think it’s hopeless in some way.
00:15:33 So the thing is for any particular categorization
00:15:36 that you create,
00:15:36 if you have a discrete sort of categorization,
00:15:38 I can always take the nearest two concepts
00:15:40 or I can take a third concept and I can blend it in
00:15:42 and I can create a new category.
00:15:44 So if you were to enumerate N categories,
00:15:46 I will always find an N plus one category for you.
00:15:48 That’s not going to be in the N categories.
00:15:50 And I can actually create not just N plus one,
00:15:52 I can very easily create far more than N categories.
00:15:55 The thing is a lot of things we talk about
00:15:57 are actually compositional.
00:15:58 So it’s really hard for us to come and sit
00:16:01 and enumerate all of these out.
00:16:03 And they compose in various weird ways, right?
00:16:05 Like you have like a croissant and a donut come together
00:16:08 to form a cronut.
00:16:09 So if you were to like enumerate all the foods up until,
00:16:12 I don’t know, whenever the cronut was about 10 years ago
00:16:15 or 15 years ago,
00:16:16 then this entire thing called cronut would not exist.
00:16:19 Yeah, I remember there was the most awesome video
00:16:21 of a cat wearing a monkey costume.
00:16:23 Yeah, yes.
00:16:26 People should look it up, it’s great.
00:16:28 So is that a monkey or is that a cat?
00:16:31 It’s a very difficult philosophical question.
00:16:33 So there is a concept of similarity between objects.
00:16:37 So you think that can take us very far?
00:16:39 Just kind of getting a good function,
00:16:43 a good way to tell which parts of things are similar
00:16:47 and which parts of things are very different.
00:16:50 I think so, yeah.
00:16:51 So you don’t necessarily need to name everything
00:16:54 or assign a name to everything to be able to use it, right?
00:16:57 So there are like lots of…
00:16:59 Shakespeare said that, what’s in a name?
00:17:01 What’s in a name, yeah, okay.
00:17:03 And I mean, lots of like, for example, animals, right?
00:17:05 They don’t have necessarily a well formed
00:17:08 like syntactic language,
00:17:09 but they’re able to go about their day perfectly.
00:17:11 The same thing happens for us.
00:17:12 So, I mean, we probably look at things and we figure out,
00:17:17 oh, this is similar to something else that I’ve seen before.
00:17:19 And then I can probably learn how to use it.
00:17:22 So I haven’t seen all the possible doorknobs in the world.
00:17:26 But if you show me,
00:17:27 like I was able to get into this particular place
00:17:29 fairly easily, I’ve never seen that particular doorknob.
00:17:32 So I of course related to all the doorknobs that I’ve seen
00:17:34 and I know exactly how it’s going to open.
00:17:36 I have a pretty good idea of how it’s going to open.
00:17:39 And I think this kind of translation between experiences
00:17:41 only happens because of similarity.
00:17:43 Because I’m able to relate it to a doorknob.
00:17:45 If I related it to a hairdryer,
00:17:46 I would probably be stuck still outside, not able to get in.
00:17:50 Again, a bit of a philosophical question,
00:17:52 but can similarity take us all the way
00:17:55 to understanding a thing?
00:17:58 Can having a good function that compares objects
00:18:01 get us to understand something profound
00:18:04 about singular objects?
00:18:07 I think I’ll ask you a question back.
00:18:08 What does it mean to understand objects?
00:18:11 Well, let me tell you what that’s similar to.
00:18:13 No, so there’s an idea of sort of reasoning
00:18:17 by analogy kind of thing.
00:18:19 I think understanding is the process of placing that thing
00:18:24 in some kind of network of knowledge that you have.
00:18:28 That it perhaps is fundamentally related to other concepts.
00:18:33 So it’s not like understanding is fundamentally related
00:18:36 by composition of other concepts
00:18:39 and maybe in relation to other concepts.
00:18:43 And maybe deeper and deeper understanding
00:18:45 is maybe just adding more edges to that graph somehow.
00:18:51 So maybe it is a composition of similarities.
00:18:55 I mean, ultimately, I suppose it is a kind of embedding
00:18:59 in that wisdom space.
00:19:02 Yeah, okay, wisdom space is good.
00:19:06 I think, I do think, right?
00:19:08 So similarity does get you very, very far.
00:19:10 Is it the answer to everything?
00:19:12 I mean, I don’t even know what everything is,
00:19:14 but it’s going to take us really far.
00:19:16 And I think the thing is things are similar
00:19:19 in very different contexts, right?
00:19:21 So an elephant is similar to, I don’t know,
00:19:24 another sort of wild animal.
00:19:25 Let’s just pick, I don’t know, lion in a different way
00:19:28 because they’re both four legged creatures.
00:19:30 They’re also land animals.
00:19:32 But of course they’re very different
00:19:33 in a lot of different ways.
00:19:33 So elephants are like herbivores, lions are not.
00:19:37 So similarity and particularly dissimilarity
00:19:40 also actually helps us understand a lot about things.
00:19:43 And so that’s actually why I think
00:19:45 discrete categorization is very hard.
00:19:47 Just like forming this particular category of elephant
00:19:50 and a particular category of lion,
00:19:51 maybe it’s good for just like taxonomy,
00:19:54 biological taxonomies.
00:19:55 But when it comes to other things which are not as maybe,
00:19:59 for example, like grilled cheese, right?
00:20:01 I have a grilled cheese,
00:20:02 I dip it in tomato and I keep it outside.
00:20:03 Now, is that still a grilled cheese
00:20:05 or is that something else?
00:20:06 Right, so categorization is still very useful
00:20:09 for solving problems.
00:20:11 But is your intuition then sort of the self supervised
00:20:15 should be the, to borrow Jan Lekun’s terminology,
00:20:20 should be the cake and then categorization,
00:20:23 the classification, maybe the supervised like layer
00:20:27 should be just like the thing on top,
00:20:29 the cherry or the icing or whatever.
00:20:31 So if you make it the cake,
00:20:32 it gets in the way of learning.
00:20:35 If you make it the cake,
00:20:36 then you won’t be able to sit and annotate everything.
00:20:39 That’s as simple as it is.
00:20:40 Like that’s my very practical view on it.
00:20:43 It’s just, I mean, in my PhD,
00:20:44 I sat down and annotated like a bunch of cards
00:20:47 for one of my projects.
00:20:48 And very quickly, I was just like, it was in a video
00:20:50 and I was basically drawing boxes around all these cards.
00:20:53 And I think I spent about a week doing all of that
00:20:55 and I barely got anything done.
00:20:57 And basically this was, I think my first year of my PhD
00:21:00 or like a second year of my master’s.
00:21:02 And then by the end of it, I’m like, okay,
00:21:04 this is just hopeless.
00:21:05 I can keep doing it.
00:21:05 And when I’d done that, someone came up to me
00:21:08 and they basically told me, oh, this is a pickup truck.
00:21:10 This is not a card.
00:21:12 And that’s when like, aha, this actually makes sense
00:21:14 because a pickup truck is not really like,
00:21:16 what was I annotating?
00:21:17 Was I annotating anything that is mobile
00:21:19 or was I annotating particular sedans
00:21:21 or was I annotating SUVs?
00:21:22 What was I doing?
00:21:23 By the way, the annotation was bounding boxes?
00:21:25 Bounding boxes, yeah.
00:21:26 There’s so many deep, profound questions here
00:21:30 that you’re almost cheating your way out of
00:21:32 by doing self supervised learning, by the way,
00:21:34 which is like, what makes for an object?
00:21:37 As opposed to solve intelligence,
00:21:39 maybe you don’t ever need to answer that question.
00:21:42 I mean, this is the question
00:21:43 that anyone that’s ever done annotation
00:21:45 because it’s so painful gets to ask,
00:21:48 like, why am I drawing very careful line around this object?
00:21:55 Like, what is the value?
00:21:57 I remember when I first saw semantic segmentation
00:22:00 where you have like instant segmentation
00:22:03 where you have a very exact line
00:22:06 around the object in a 2D plane
00:22:09 of a fundamentally 3D object projected on a 2D plane.
00:22:13 So you’re drawing a line around a car
00:22:15 that might be occluded.
00:22:16 There might be another thing in front of it,
00:22:18 but you’re still drawing the line
00:22:20 of the part of the car that you see.
00:22:23 How is that the car?
00:22:25 Why is that the car?
00:22:27 Like, I had like an existential crisis every time.
00:22:31 Like, how’s that going to help us understand
00:22:33 a solved computer vision?
00:22:35 I’m not sure I have a good answer to what’s better.
00:22:38 And I’m not sure I share the confidence that you have
00:22:41 that self supervised learning can take us far.
00:22:46 I think I’m more and more convinced
00:22:48 that it’s a very important component,
00:22:50 but I still feel like we need to understand
00:22:52 what makes like this dream of maybe what it’s called
00:23:00 like symbolic AI of arriving,
00:23:03 like once you have this common sense base,
00:23:05 be able to play with these concepts and build graphs
00:23:10 or hierarchies of concepts on top
00:23:13 in order to then like form a deep sense
00:23:18 of this three dimensional world or four dimensional world
00:23:22 and be able to reason and then project that onto 2D plane
00:23:25 in order to interpret a 2D image.
00:23:28 Can I ask you just an out there question?
00:23:30 I remember, I think Andre Karpathy had a blog post
00:23:35 about computer vision, like being really hard.
00:23:39 I forgot what the title was, but it was many, many years ago.
00:23:42 And he had, I think President Obama stepping on a scale
00:23:44 and there was humor and there was a bunch of people laughing
00:23:47 and whatever.
00:23:48 And there’s a lot of interesting things about that image
00:23:52 and I think Andre highlighted a bunch of things
00:23:55 about the image that us humans are able
00:23:56 to immediately understand.
00:23:59 Like the idea, I think of gravity
00:24:00 and that you have the concept of a weight.
00:24:04 You immediately project because of our knowledge of pose
00:24:08 and how human bodies are constructed,
00:24:10 you understand how the forces are being applied
00:24:13 with the human body.
00:24:14 The really interesting other thing
00:24:16 that you’re able to understand,
00:24:17 there’s multiple people looking at each other in the image.
00:24:20 You’re able to have a mental model
00:24:22 of what the people are thinking about.
00:24:23 You’re able to infer like,
00:24:25 oh, this person is probably thinks,
00:24:27 like is laughing at how humorous the situation is.
00:24:31 And this person is confused about what the situation is
00:24:34 because they’re looking this way.
00:24:35 We’re able to infer all of that.
00:24:37 So that’s human vision.
00:24:41 How difficult is computer vision?
00:24:45 Like in order to achieve that level of understanding
00:24:48 and maybe how big of a part
00:24:51 does self supervised learning play in that, do you think?
00:24:54 And do you still, you know, back,
00:24:56 that was like over a decade ago,
00:24:58 I think Andre and I think a lot of people agreed
00:25:00 is computer vision is really hard.
00:25:03 Do you still think computer vision is really hard?
00:25:06 I think it is, yes.
00:25:07 And getting to that kind of understanding,
00:25:10 I mean, it’s really out there.
00:25:12 So if you ask me to solve just that particular problem,
00:25:15 I can do it the supervised learning route.
00:25:17 I can always construct a data set and basically predict,
00:25:19 oh, is there humor in this or not?
00:25:21 And of course I can do it.
00:25:22 Actually, that’s a good question.
00:25:23 Do you think you can, okay, okay.
00:25:25 Do you think you can do human supervised annotation of humor?
00:25:29 To some extent, yes.
00:25:29 I’m sure it will work.
00:25:30 I mean, it won’t be as bad as like randomly guessing.
00:25:34 I’m sure it can still predict whether it’s humorous or not
00:25:36 in some way.
00:25:37 Yeah, maybe like Reddit upvotes is the signal.
00:25:40 I don’t know.
00:25:41 I mean, it won’t do a great job, but it’ll do something.
00:25:43 It may actually be like, it may find certain things
00:25:46 which are not humorous, humorous as well,
00:25:47 which is going to be bad for us.
00:25:49 But I mean, it’ll do, it won’t be random.
00:25:52 Yeah, kind of like my sense of humor.
00:25:54 Okay, so fine.
00:25:55 So you can, that particular problem, yes.
00:25:57 But the general problem you’re saying is hard.
00:25:59 The general problem is hard.
00:26:00 And I mean, self supervised learning
00:26:02 is not the answer to everything.
00:26:03 Of course it’s not.
00:26:04 I think if you have machines that are going to communicate
00:26:07 with humans at the end of it,
00:26:08 you want to understand what the algorithm is doing, right?
00:26:10 You want it to be able to produce an output
00:26:13 that you can decipher, that you can understand,
00:26:15 or it’s actually useful for something else,
00:26:17 which again is a human.
00:26:19 So at some point in this sort of entire loop,
00:26:22 a human steps in.
00:26:23 And now this human needs to understand what’s going on.
00:26:26 And at that point, this entire notion of language
00:26:28 or semantics really comes in.
00:26:30 If the machine just spits out something
00:26:32 and if we can’t understand it,
00:26:34 then it’s not really that useful for us.
00:26:36 So self supervised learning is probably going to be useful
00:26:38 for a lot of the things before that part,
00:26:40 before the machine really needs to communicate
00:26:42 a particular kind of output with a human.
00:26:46 Because, I mean, otherwise,
00:26:47 how is it going to do that without language?
00:26:49 Or some kind of communication.
00:26:51 But you’re saying that it’s possible to build
00:26:53 a big base of understanding or whatever,
00:26:55 of what’s a better? Concepts.
00:26:58 Of concepts. Concepts, yeah.
00:26:59 Like common sense concepts. Right.
00:27:02 Supervised learning in the context of computer vision
00:27:06 is something you’ve focused on,
00:27:07 but that’s a really hard domain.
00:27:09 And it’s kind of the cutting edge
00:27:10 of what we’re, as a community, working on today.
00:27:13 Can we take a little bit of a step back
00:27:14 and look at language?
00:27:16 Can you summarize the history of success
00:27:19 of self supervised learning in natural language processing,
00:27:22 language modeling?
00:27:23 What are transformers?
00:27:25 What is the masking, the sentence completion
00:27:28 that you mentioned before?
00:27:31 How does it lead us to understand anything?
00:27:33 Semantic meaning of words,
00:27:34 syntactic role of words and sentences?
00:27:37 So I’m, of course, not the expert on NLP.
00:27:40 I kind of follow it a little bit from the sides.
00:27:43 So the main sort of reason
00:27:45 why all of this masking stuff works is,
00:27:47 I think it’s called the distributional hypothesis in NLP.
00:27:50 The idea basically being that words
00:27:52 that occur in the same context
00:27:54 should have similar meaning.
00:27:55 So if you have the blank jumped over the blank,
00:27:59 it basically, whatever is like in the first blank
00:28:01 is basically an object that can actually jump,
00:28:04 is going to be something that can jump.
00:28:05 So a cat or a dog, or I don’t know, sheep, something,
00:28:08 all of these things can basically be in that particular context.
00:28:11 And now, so essentially the idea is that
00:28:13 if you have words that are in the same context
00:28:16 and you predict them,
00:28:17 you’re going to learn lots of useful things
00:28:20 about how words are related,
00:28:21 because you’re predicting by looking at their context
00:28:23 where the word is going to be.
00:28:24 So in this particular case, the blank jumped over the fence.
00:28:28 So now if it’s a sheep, the sheep jumped over the fence,
00:28:30 the dog jumped over the fence.
00:28:32 So essentially the algorithm or the representation
00:28:35 basically puts together these two concepts together.
00:28:37 So it says, okay, dogs are going to be kind of related to sheep
00:28:40 because both of them occur in the same context.
00:28:42 Of course, now you can decide
00:28:44 depending on your particular application downstream,
00:28:46 you can say that dogs are absolutely not related to sheep
00:28:49 because well, I don’t, I really care about dog food,
00:28:52 for example, I’m a dog food person
00:28:54 and I really want to give this dog food
00:28:55 to this particular animal.
00:28:57 So depending on what your downstream application is,
00:29:00 of course, this notion of similarity or this notion
00:29:03 or this common sense that you’ve learned
00:29:04 may not be applicable.
00:29:05 But the point is basically that this,
00:29:08 just predicting what the blanks are
00:29:09 is going to take you really, really far.
00:29:11 So there’s a nice feature of language
00:29:14 that the number of words in a particular language
00:29:18 is very large, but it’s finite
00:29:20 and it’s actually not that large
00:29:22 in the grand scheme of things.
00:29:24 I still got it because we take it for granted.
00:29:26 So first of all, when you say masking,
00:29:28 you’re talking about this very process of the blank,
00:29:31 of removing words from a sentence
00:29:33 and then having the knowledge of what word went there
00:29:36 in the initial data set,
00:29:38 that’s the ground truth that you’re training on
00:29:41 and then you’re asking the neural network
00:29:43 to predict what goes there.
00:29:46 That’s like a little trick.
00:29:49 It’s a really powerful trick.
00:29:50 The question is how far that takes us.
00:29:53 And the other question is, is there other tricks?
00:29:56 Because to me, it’s very possible
00:29:58 there’s other very fascinating tricks.
00:30:00 I’ll give you an example in autonomous driving,
00:30:05 there’s a bunch of tricks
00:30:06 that give you the self supervised signal back.
00:30:10 For example, very similar to sentences, but not really,
00:30:16 which is you have signals from humans driving the car
00:30:20 because a lot of us drive cars to places.
00:30:23 And so you can ask the neural network to predict
00:30:27 what’s going to happen the next two seconds
00:30:30 for a safe navigation through the environment.
00:30:33 And the signal comes from the fact
00:30:36 that you also have knowledge of what happened
00:30:38 in the next two seconds, because you have video of the data.
00:30:42 The question in autonomous driving, as it is in language,
00:30:46 can we learn how to drive autonomously
00:30:50 based on that kind of self supervision?
00:30:53 Probably the answer is no.
00:30:55 The question is how good can we get?
00:30:57 And the same with language, how good can we get?
00:31:00 And are there other tricks?
00:31:02 Like we get sometimes super excited by this trick
00:31:04 that works really well.
00:31:05 But I wonder, it’s almost like mining for gold.
00:31:09 I wonder how many signals there are in the data
00:31:12 that could be leveraged that are like there.
00:31:17 I just wanted to kind of linger on that
00:31:18 because sometimes it’s easy to think
00:31:20 that maybe this masking process is self supervised learning.
00:31:24 No, it’s only one method.
00:31:27 So there could be many, many other methods,
00:31:29 many tricky methods, maybe interesting ways
00:31:33 to leverage human computation in very interesting ways
00:31:36 that might actually border on semi supervised learning,
00:31:39 something like that.
00:31:40 Obviously the internet is generated by humans
00:31:43 at the end of the day.
00:31:44 So all that to say is what’s your sense
00:31:48 in this particular context of language,
00:31:50 how far can that masking process take us?
00:31:54 So it has stood the test of time, right?
00:31:56 I mean, so Word2vec, the initial sort of NLP technique
00:31:59 that was using this to now, for example,
00:32:02 like all the BERT and all these big models that we get,
00:32:05 BERT and Roberta, for example,
00:32:07 all of them are still sort of based
00:32:08 on the same principle of masking.
00:32:10 It’s taken us really far.
00:32:12 I mean, you can actually do things like,
00:32:14 oh, these two sentences are similar or not,
00:32:16 whether this particular sentence follows this other sentence
00:32:18 in terms of logic, so entailment,
00:32:20 you can do a lot of these things
00:32:21 with just this masking trick.
00:32:23 So I’m not sure if I can predict how far it can take us,
00:32:28 because when it first came out, when Word2vec was out,
00:32:31 I don’t think a lot of us would have imagined
00:32:33 that this would actually help us do some kind
00:32:35 of entailment problems and really that well.
00:32:38 And so just the fact that by just scaling up
00:32:40 the amount of data that we’re training on
00:32:42 and using better and more powerful neural network
00:32:45 architectures has taken us from that to this,
00:32:47 is just showing you how maybe poor predictors we are,
00:32:52 as humans, how poor we are at predicting
00:32:54 how successful a particular technique is going to be.
00:32:57 So I think I can say something now,
00:32:58 but like 10 years from now,
00:33:00 I look completely stupid basically predicting this.
00:33:02 In the language domain, is there something in your work
00:33:07 that you find useful and insightful
00:33:09 and transferable to computer vision,
00:33:12 but also just, I don’t know, beautiful and profound
00:33:15 that I think carries through to the vision domain?
00:33:18 I mean, the idea of masking has been very powerful.
00:33:21 It has been used in vision as well for predicting,
00:33:23 like you say, the next sort of if you have
00:33:25 and sort of frames and you predict
00:33:28 what’s going to happen in the next frame.
00:33:29 So that’s been very powerful.
00:33:30 In terms of modeling, like in just terms
00:33:32 in terms of architecture, I think you would have asked
00:33:34 about transformers a while back.
00:33:36 That has really become like,
00:33:38 it has become super exciting for computer vision now.
00:33:40 Like in the past, I would say year and a half,
00:33:42 it’s become really powerful.
00:33:44 What’s a transformer?
00:33:45 Right.
00:33:46 I mean, the core part of a transformer
00:33:47 is something called the self attention model.
00:33:49 So it came out of Google
00:33:50 and the idea basically is that if you have N elements,
00:33:53 what you’re creating is a way for all of these N elements
00:33:56 to talk to each other.
00:33:57 So the idea basically is that you are paying attention.
00:34:01 Each element is paying attention
00:34:03 to each of the other element.
00:34:04 And basically by doing this,
00:34:06 it’s really trying to figure out,
00:34:08 you’re basically getting a much better view of the data.
00:34:11 So for example, if you have a sentence of like four words,
00:34:14 the point is if you get a representation
00:34:16 or a feature for this entire sentence,
00:34:18 it’s constructed in a way such that each word
00:34:21 has paid attention to everything else.
00:34:23 Now, the reason it’s like different from say,
00:34:26 what you would do in a ConvNet
00:34:28 is basically that in the ConvNet,
00:34:29 you would only pay attention to a local window.
00:34:31 So each word would only pay attention
00:34:33 to its next neighbor or like one neighbor after that.
00:34:36 And the same thing goes for images.
00:34:37 In images, you would basically pay attention to pixels
00:34:40 in a three cross three or a seven cross seven neighborhood.
00:34:42 And that’s it.
00:34:43 Whereas with the transformer, the self attention mainly,
00:34:46 the sort of idea is that each element
00:34:48 needs to pay attention to each other element.
00:34:50 And when you say attention,
00:34:51 maybe another way to phrase that
00:34:53 is you’re considering a context,
00:34:57 a wide context in terms of the wide context of the sentence
00:35:01 in understanding the meaning of a particular word
00:35:05 and in computer vision that’s understanding
00:35:06 a larger context to understand the local pattern
00:35:10 of a particular local part of an image.
00:35:13 Right, so basically if you have say,
00:35:14 again, a banana in the image,
00:35:16 you’re looking at the full image first.
00:35:18 So whether it’s like, you know,
00:35:19 you’re looking at all the pixels that are off a kitchen
00:35:22 or for dining table and so on.
00:35:23 And then you’re basically looking at the banana also.
00:35:25 Yeah, by the way, in terms of,
00:35:27 if we were to train the funny classifier,
00:35:29 there’s something funny about the word banana.
00:35:32 Just wanted to anticipate that.
00:35:33 I am wearing a banana shirt, so yeah.
00:35:36 Is there bananas on it?
00:35:39 Okay, so masking has worked for the vision context as well.
00:35:42 And so this transformer idea has worked as well.
00:35:44 So basically looking at all the elements
00:35:46 to understand a particular element
00:35:48 has been really powerful in vision.
00:35:49 The reason is like a lot of things
00:35:52 when you’re looking at them in isolation.
00:35:53 So if you look at just a blob of pixels,
00:35:55 so Antonio Torralba at MIT used to have
00:35:57 this like really famous image,
00:35:58 which I looked at when I was a PhD student.
00:36:01 But he would basically have a blob of pixels
00:36:02 and he would ask you, hey, what is this?
00:36:04 And it looked basically like a shoe
00:36:06 or like it could look like a TV remote.
00:36:08 It could look like anything.
00:36:10 And it turns out it was a beer bottle.
00:36:12 But I’m not sure it was one of these three things,
00:36:14 but basically he showed you the full picture
00:36:15 and then it was very obvious what it was.
00:36:17 But the point is just by looking at
00:36:19 that particular local window, you couldn’t figure it out.
00:36:21 Because of resolution, because of other things,
00:36:23 it’s just not easy always to just figure it out
00:36:26 by looking at just the neighborhood of pixels,
00:36:27 what these pixels are.
00:36:29 And the same thing happens for language as well.
00:36:32 For the parameters that have to learn
00:36:33 something about the data,
00:36:35 you need to give it the capacity
00:36:37 to learn the essential things.
00:36:39 Like if it’s not actually able to receive the signal at all,
00:36:42 then it’s not gonna be able to learn that signal.
00:36:44 And in order to understand images, to understand language,
00:36:47 you have to be able to see words in their full context.
00:36:50 Okay, what is harder to solve, vision or language?
00:36:54 Visual intelligence or linguistic intelligence?
00:36:57 So I’m going to say computer vision is harder.
00:36:59 My reason for this is basically that
00:37:02 language of course has a big structure to it
00:37:05 because we developed it.
00:37:06 Whereas vision is something that is common
00:37:08 in a lot of animals.
00:37:09 Everyone is able to get by a lot of these animals
00:37:12 on earth are actually able to get by without language.
00:37:15 And a lot of these animals we also deem to be intelligent.
00:37:18 So clearly intelligence does have
00:37:20 like a visual component to it.
00:37:22 And yes, of course, in the case of humans,
00:37:24 it of course also has a linguistic component.
00:37:26 But it means that there is something far more fundamental
00:37:28 about vision than there is about language.
00:37:30 And I’m sorry to anyone who disagrees,
00:37:32 but yes, this is what I feel.
00:37:34 So that’s being a little bit reflected in the challenges
00:37:38 that have to do with the progress
00:37:40 of self supervised learning, would you say?
00:37:42 Or is that just a peculiar accidents
00:37:45 of the progress of the AI community
00:37:47 that we focused on like,
00:37:48 or we discovered self attention and transformers
00:37:51 in the context of language first?
00:37:53 So like the self supervised learning success
00:37:55 was actually for vision has not much to do
00:37:58 with the transformers part.
00:37:59 I would say it’s actually been independent a little bit.
00:38:02 I think it’s just that the signal was a little bit different
00:38:05 for vision than there was for like NLP
00:38:08 and probably NLP folks discovered it before.
00:38:11 So for vision, the main success
00:38:12 has basically been this like crops so far,
00:38:14 like taking different crops of images.
00:38:16 Whereas for NLP, it was this masking thing.
00:38:18 But also the level of success
00:38:20 is still much higher for language.
00:38:22 It has.
00:38:22 So that has a lot to do with,
00:38:24 I mean, I can get into a lot of details.
00:38:26 For this particular question, let’s go for it, okay.
00:38:29 So the first thing is language is very structured.
00:38:32 So you are going to produce a distribution
00:38:34 over a finite vocabulary.
00:38:35 English has a finite number of words.
00:38:37 It’s actually not that large.
00:38:39 And you need to produce basically,
00:38:41 when you’re doing this masking thing,
00:38:42 all you need to do is basically tell me
00:38:44 which one of these like 50,000 words it is.
00:38:46 That’s it.
00:38:47 Now for vision, let’s imagine doing the same thing.
00:38:49 Okay, we’re basically going to blank out
00:38:51 a particular part of the image
00:38:52 and we ask the network or this neural network
00:38:54 to predict what is present in this missing patch.
00:38:58 It’s combinatorially large, right?
00:38:59 You have 256 pixel values.
00:39:02 If you’re even producing basically a seven cross seven
00:39:04 or a 14 cross 14 like window of pixels,
00:39:07 at each of these 169 or each of these 49 locations,
00:39:11 you have 256 values to predict.
00:39:13 And so it’s really, really large.
00:39:15 And very quickly, the kind of like prediction problems
00:39:18 that we’re setting up are going to be extremely
00:39:20 like interactable for us.
00:39:22 And so the thing is for NLP, it has been really successful
00:39:24 because we are very good at predicting,
00:39:27 like doing this like distribution over a finite set.
00:39:30 And the problem is when this set becomes really large,
00:39:33 we are going to become really, really bad
00:39:35 at making these predictions
00:39:36 and at solving basically this particular set of problems.
00:39:41 So if you were to do it exactly in the same way
00:39:44 as NLP for vision, there is very limited success.
00:39:47 The way stuff is working right now
00:39:48 is actually not by predicting these masks.
00:39:51 It’s basically by saying that you take these two
00:39:53 like crops from the image,
00:39:55 you get a feature representation from it.
00:39:57 And just saying that these two features,
00:39:58 so they’re like vectors,
00:40:00 just saying that the distance between these vectors
00:40:02 should be small.
00:40:03 And so it’s a very different way of learning
00:40:06 from the visual signal than there is from NLP.
00:40:09 Okay, the other reason is the distributional hypothesis
00:40:11 that we talked about for NLP, right?
00:40:12 So a word given its context,
00:40:15 basically the context actually supplies
00:40:16 a lot of meaning to the word.
00:40:18 Now, because there are just finite number of words
00:40:22 and there is a finite way in like which we compose them.
00:40:25 Of course, the same thing holds for pixels,
00:40:27 but in language, there’s a lot of structure, right?
00:40:29 So I always say whatever,
00:40:31 the dash jumped over the fence, for example.
00:40:33 There are lots of these sentences that you’ll get.
00:40:36 And from this, you can actually look at
00:40:38 this particular sentence might occur
00:40:40 in a lot of different contexts as well.
00:40:41 This exact same sentence
00:40:42 might occur in a different context.
00:40:44 So the sheep jumped over the fence,
00:40:45 the cat jumped over the fence,
00:40:46 the dog jumped over the fence.
00:40:48 So you immediately get a lot of these words,
00:40:50 which are because this particular token itself
00:40:52 has so much meaning,
00:40:53 you get a lot of these tokens or these words,
00:40:55 which are actually going to have sort of
00:40:57 this related meaning across given this context.
00:41:00 Whereas for vision, it’s much harder
00:41:02 because just by like pure,
00:41:04 like the way we capture images,
00:41:05 lighting can be different.
00:41:07 There might be like different noise in the sensor.
00:41:09 So the thing is you’re capturing a physical phenomenon
00:41:12 and then you’re basically going through
00:41:13 a very complicated pipeline of like image processing.
00:41:16 And then you’re translating that into
00:41:18 some kind of like digital signal.
00:41:20 Whereas with language, you write it down
00:41:23 and you transfer it to a digital signal,
00:41:25 almost like it’s a lossless like transfer.
00:41:27 And each of these tokens are very, very well defined.
00:41:30 There could be a little bit of an argument there
00:41:32 because language as written down
00:41:36 is a projection of thought.
00:41:39 This is one of the open questions is
00:41:42 if you perfectly can solve language,
00:41:46 are you getting close to being able to solve easily
00:41:50 with flying colors past the towing test kind of thing.
00:41:52 So that’s, it’s similar, but different
00:41:56 and the computer vision problem is in the 2D plane
00:41:59 is a projection with three dimensional world.
00:42:02 So perhaps there are similar problems there.
00:42:05 Maybe this is a good.
00:42:06 I mean, I think what I’m saying is NLP is not easy.
00:42:08 Of course, don’t get me wrong.
00:42:09 Like abstract thought expressed in knowledge
00:42:12 or knowledge basically expressed in language
00:42:14 is really hard to understand, right?
00:42:16 I mean, we’ve been communicating with language for so long
00:42:19 and it is of course a very complicated concept.
00:42:22 The thing is at least getting like somewhat reasonable,
00:42:27 like being able to solve some kind of reasonable tasks
00:42:29 with language, I would say slightly easier
00:42:32 than it is with computer vision.
00:42:33 Yeah, I would say, yeah.
00:42:35 So that’s well put.
00:42:36 I would say getting impressive performance on language
00:42:40 is easier.
00:42:43 I feel like for both language and computer vision,
00:42:45 there’s going to be this wall of like,
00:42:49 like this hump you have to overcome
00:42:52 to achieve superhuman level performance
00:42:54 or human level performance.
00:42:56 And I feel like for language, that wall is farther away.
00:43:00 So you can get pretty nice.
00:43:01 You can do a lot of tricks.
00:43:04 You can show really impressive performance.
00:43:06 You can even fool people that you’re tweeting
00:43:09 or you write blog posts writing
00:43:11 or your question answering has intelligence behind it.
00:43:16 But to truly demonstrate understanding of dialogue,
00:43:22 of continuous long form dialogue
00:43:25 that would require perhaps big breakthroughs.
00:43:28 In the same way in computer vision,
00:43:30 I think the big breakthroughs need to happen earlier
00:43:33 to achieve impressive performance.
00:43:36 This might be a good place to, you already mentioned it,
00:43:38 but what is contrastive learning
00:43:41 and what are energy based models?
00:43:43 Contrastive learning is sort of the paradigm of learning
00:43:46 where the idea is that you are learning this embedding space
00:43:50 or so you’re learning this sort of vector space
00:43:52 of all your concepts.
00:43:54 And the way you learn that is basically by contrasting.
00:43:56 So the idea is that you have a sample,
00:43:59 you have another sample that’s related to it.
00:44:01 So that’s called the positive
00:44:02 and you have another sample that’s not related to it.
00:44:05 So that’s negative.
00:44:06 So for example, let’s just take an NLP
00:44:08 or in a simple example in computer vision.
00:44:10 So you have an image of a cat, you have an image of a dog
00:44:14 and for whatever application that you’re doing,
00:44:16 say you’re trying to figure out what the pets are,
00:44:18 you’re saying that these two images are related.
00:44:20 So image of a cat and dog are related,
00:44:22 but now you have another third image of a banana
00:44:25 because you don’t like that word.
00:44:26 So now you basically have this banana.
00:44:28 Thank you for speaking to the crowd.
00:44:30 And so you take both of these images
00:44:32 and you take the image from the cat,
00:44:34 the image from the dog,
00:44:35 you get a feature from both of them.
00:44:36 And now what you’re training the network to do
00:44:38 is basically pull both of these features together
00:44:42 while pushing them away from the feature of a banana.
00:44:44 So this is the contrastive part.
00:44:45 So you’re contrasting against the banana.
00:44:47 So there’s always this notion of a negative and a positive.
00:44:51 Now, energy based models are like one way
00:44:54 that Jan sort of explains a lot of these methods.
00:44:57 So Jan basically, I think a couple of years
00:45:00 or more than that, like when I joined Facebook,
00:45:02 Jan used to keep mentioning this word, energy based models.
00:45:05 And of course I had no idea what he was talking about.
00:45:07 So then one day I caught him in one of the conference rooms
00:45:09 and I’m like, can you please tell me what this is?
00:45:11 So then like very patiently,
00:45:13 he sat down with like a marker and a whiteboard.
00:45:15 And his idea basically is that
00:45:18 rather than talking about probability distributions,
00:45:20 you can talk about energies of models.
00:45:21 So models are trying to minimize certain energies
00:45:24 in certain space,
00:45:24 or they’re trying to maximize a certain kind of energy.
00:45:28 And the idea basically is that
00:45:29 you can explain a lot of the contrastive models,
00:45:32 GANs, for example,
00:45:33 which are like Generative Adversarial Networks.
00:45:36 A lot of these modern learning methods
00:45:37 or VAEs, which are Variational Autoencoders,
00:45:39 you can really explain them very nicely
00:45:41 in terms of an energy function
00:45:43 that they’re trying to minimize or maximize.
00:45:45 And so by putting this common sort of language
00:45:48 for all of these models,
00:45:49 what looks very different in machine learning
00:45:51 that, oh, VAEs are very different from what GANs are,
00:45:54 are very, very different from what contrastive models are,
00:45:56 you actually get a sense of like,
00:45:57 oh, these are actually very, very related.
00:46:00 It’s just that the way or the mechanism
00:46:02 in which they’re sort of maximizing
00:46:04 or minimizing this energy function is slightly different.
00:46:07 It’s revealing the commonalities
00:46:08 between all these approaches
00:46:10 and putting a sexy word on top of it, like energy.
00:46:13 And so similarities,
00:46:14 two things that are similar have low energy.
00:46:16 Like the low energy signifying similarity.
00:46:20 Right, exactly.
00:46:21 So basically the idea is that if you were to imagine
00:46:23 like the embedding as a manifold, a 2D manifold,
00:46:26 you would get a hill or like a high sort of peak
00:46:28 in the energy manifold,
00:46:30 wherever two things are not related.
00:46:32 And basically you would have like a dip
00:46:34 where two things are related.
00:46:35 So you’d get a dip in the manifold.
00:46:37 And in the self supervised context,
00:46:40 how do you know two things are related
00:46:42 and two things are not related?
00:46:44 Right.
00:46:44 So this is where all the sort of ingenuity or tricks
00:46:46 comes in, right?
00:46:47 So for example, like you can take
00:46:50 the fill in the blank problem,
00:46:52 or you can take in the context problem.
00:46:54 And what you can say is two words
00:46:55 that are in the same context are related.
00:46:57 Two words that are in different contexts are not related.
00:47:00 For images, basically two crops
00:47:02 from the same image are related.
00:47:03 And whereas a third image is not related at all.
00:47:06 Or for a video, it can be two frames
00:47:08 from that video are related
00:47:09 because they’re likely to contain
00:47:10 the same sort of concepts in them.
00:47:12 Whereas a third frame
00:47:13 from a different video is not related.
00:47:15 So it basically is, it’s a very general term.
00:47:18 Contrastive learning is nothing really
00:47:19 to do with self supervised learning.
00:47:20 It actually is very popular in for example,
00:47:23 like any kind of metric learning
00:47:25 or any kind of embedding learning.
00:47:26 So it’s also used in supervised learning.
00:47:28 And the thing is because we are not really using labels
00:47:32 to get these positive or negative pairs,
00:47:34 it can basically also be used for self supervised learning.
00:47:37 So you mentioned one of the ideas
00:47:39 in the vision context that works
00:47:42 is to have different crops.
00:47:45 So you could think of that as a way
00:47:47 to sort of manipulating the data
00:47:49 to generate examples that are similar.
00:47:53 Obviously, there’s a bunch of other techniques.
00:47:55 You mentioned lighting as a very,
00:47:58 in images lighting is something that varies a lot
00:48:01 and you can artificially change those kinds of things.
00:48:04 There’s the whole broad field of data augmentation,
00:48:07 which manipulates images in order to increase arbitrarily
00:48:11 the size of the data set.
00:48:13 First of all, what is data augmentation?
00:48:15 And second of all, what’s the role of data augmentation
00:48:18 in self supervised learning and contrastive learning?
00:48:22 So data augmentation is just a way like you said,
00:48:24 it’s basically a way to augment the data.
00:48:26 So you have say n samples.
00:48:28 And what you do is you basically define
00:48:30 some kind of transforms for the sample.
00:48:32 So you take your say image
00:48:33 and then you define a transform
00:48:34 where you can just increase say the colors
00:48:37 like the colors or the brightness of the image
00:48:39 or increase or decrease the contrast of the image
00:48:41 for example, or take different crops of it.
00:48:44 So data augmentation is just a process
00:48:46 to like basically perturb the data
00:48:49 or like augment the data, right?
00:48:51 And so it has played a fundamental role
00:48:53 for computer vision for self supervised learning especially.
00:48:56 The way most of the current methods work
00:48:59 contrastive or otherwise is by taking an image
00:49:02 in the case of images is by taking an image
00:49:05 and then computing basically two perturbations of it.
00:49:08 So these can be two different crops of the image
00:49:11 with like different types of lighting
00:49:12 or different contrast or different colors.
00:49:15 So you jitter the colors a little bit and so on.
00:49:17 And now the idea is basically because it’s the same object
00:49:21 or because it’s like related concepts
00:49:23 in both of these perturbations,
00:49:25 you want the features from both of these perturbations
00:49:27 to be similar.
00:49:28 So now you can use a variety of different ways
00:49:31 to enforce this constraint,
00:49:32 like these features being similar.
00:49:34 You can do this by contrastive learning.
00:49:36 So basically, both of these things are positives,
00:49:38 a third sort of image is negative.
00:49:40 You can do this basically by like clustering.
00:49:43 For example, you can say that both of these images should,
00:49:46 the features from both of these images
00:49:48 should belong in the same cluster because they’re related,
00:49:50 whereas image like another image
00:49:52 should belong to a different cluster.
00:49:53 So there’s a variety of different ways
00:49:55 to basically enforce this particular constraint.
00:49:57 By the way, when you say features,
00:49:59 it means there’s a very large neural network
00:50:01 that extracting patterns from the image
00:50:03 and the kind of patterns that extracts
00:50:05 should be either identical or very similar.
00:50:08 That’s what that means.
00:50:09 So the neural network basically takes in the image
00:50:11 and then outputs a set of like,
00:50:14 basically a vector of like numbers,
00:50:16 and that’s the feature.
00:50:17 And you want this feature for both of these
00:50:20 like different crops that you computed to be similar.
00:50:22 So you want this vector to be identical
00:50:24 in its like entries, for example.
00:50:26 Be like literally close
00:50:28 in this multi dimensional space to each other.
00:50:31 And like you said,
00:50:32 close can mean part of the same cluster or something like that
00:50:35 in this large space.
00:50:37 First of all, that,
00:50:38 I wonder if there is connection
00:50:40 to the way humans learn to this,
00:50:43 almost like maybe subconsciously,
00:50:48 in order to understand a thing,
00:50:50 you kind of have to see it from two, three multiple angles.
00:50:54 I wonder, I have a lot of friends
00:50:57 who are neuroscientists maybe and cognitive scientists.
00:51:00 I wonder if that’s in there somewhere.
00:51:03 Like in order for us to place a concept in its proper place,
00:51:08 we have to basically crop it in all kinds of ways,
00:51:12 do basic data augmentation on it
00:51:14 in whatever very clever ways that the brain likes to do.
00:51:17 Right.
00:51:19 Like spinning around in our minds somehow
00:51:21 that that is very effective.
00:51:23 So I think for some of them, we like need to do it.
00:51:25 So like babies, for example, pick up objects,
00:51:27 like move them and put them close to their eye and whatnot.
00:51:30 But for certain other things,
00:51:31 actually we are good at imagining it as well, right?
00:51:33 So if you, I have never seen, for example,
00:51:35 an elephant from the top.
00:51:36 I’ve never basically looked at it from like top down.
00:51:39 But if you showed me a picture of it,
00:51:40 I could very well tell you that that’s an elephant.
00:51:43 So I think some of it, we’re just like,
00:51:45 we naturally build it or transfer it from other objects
00:51:47 that we’ve seen to imagine what it’s going to look like.
00:51:50 Has anyone done that with augmentation?
00:51:53 Like imagine all the possible things
00:51:56 that are occluded or not there,
00:51:59 but not just like normal things, like wild things,
00:52:03 but they’re nevertheless physically consistent.
00:52:06 So, I mean, people do kind of like
00:52:09 occlusion based augmentation as well.
00:52:11 So you place in like a random like box, gray box
00:52:14 to sort of mask out a certain part of the image.
00:52:17 And the thing is basically you’re kind of occluding it.
00:52:20 For example, you place it say on half of a person’s face.
00:52:23 So basically saying that, you know,
00:52:24 something below their nose is occluded
00:52:26 because it’s grayed out.
00:52:28 So, you know, I meant like, you have like, what is it?
00:52:31 A table and you can’t see behind the table.
00:52:33 And you imagine there’s a bunch of elves
00:52:37 with bananas behind the table.
00:52:38 Like, I wonder if there’s useful
00:52:40 to have a wild imagination for the network
00:52:44 because that’s possible or maybe not elves,
00:52:46 but like puppies and kittens or something like that.
00:52:49 Just have a wild imagination
00:52:51 and like constantly be generating that wild imagination.
00:52:55 Because in terms of data augmentation,
00:52:57 as currently applied, it’s super ultra, very boring.
00:53:01 It’s very basic data augmentation.
00:53:02 I wonder if there’s a benefit to being wildly imaginable
00:53:07 while trying to be consistent with physical reality.
00:53:11 I think it’s a kind of a chicken and egg problem, right?
00:53:14 Because to have like amazing data augmentation,
00:53:16 you need to understand what the scene is.
00:53:18 And what we’re trying to do data augmentation
00:53:20 to learn what a scene is anyway.
00:53:22 So it’s basically just keeps going on.
00:53:23 Before you understand it,
00:53:24 just put elves with bananas
00:53:26 until you know it’s not to be true.
00:53:29 Just like children have a wild imagination
00:53:31 until the adults ruin it all.
00:53:33 Okay, so what are the different kinds of data augmentation
00:53:36 that you’ve seen to be effective in visual intelligence?
00:53:40 For like vision,
00:53:42 it’s a lot of these image filtering operations.
00:53:44 So like blurring the image,
00:53:46 you know, all the kind of Instagram filters
00:53:48 that you can think of.
00:53:49 So like arbitrarily like make the red super red,
00:53:52 make the green super greens, like saturate the image.
00:53:55 Rotation, cropping.
00:53:56 Rotation, cropping, exactly.
00:53:58 All of these kinds of things.
00:53:59 Like I said, lighting is a really interesting one to me.
00:54:02 Like that feels like really complicated to do.
00:54:04 I mean, they don’t,
00:54:05 the augmentations that we work on aren’t like
00:54:08 that involved,
00:54:08 they’re not going to be like
00:54:09 physically realistic versions of lighting.
00:54:11 It’s not that you’re assuming
00:54:12 that there’s a light source up
00:54:13 and then you’re moving it to the right
00:54:15 and then what does the thing look like?
00:54:17 It’s really more about like brightness of the image,
00:54:19 overall brightness of the image
00:54:20 or overall contrast of the image and so on.
00:54:22 But this is a really important point to me.
00:54:25 I always thought that data augmentation
00:54:28 holds an important key
00:54:31 to big improvements in machine learning.
00:54:33 And it seems that it is an important aspect
00:54:36 of self supervised learning.
00:54:39 So I wonder if there’s big improvements to be achieved
00:54:42 on much more intelligent kinds of data augmentation.
00:54:46 For example, currently,
00:54:48 maybe you can correct me if I’m wrong,
00:54:50 data augmentation is not parameterized.
00:54:52 Yeah.
00:54:53 You’re not learning.
00:54:54 You’re not learning.
00:54:55 To me, it seems like data augmentation potentially
00:54:59 should involve more learning
00:55:02 than the learning process itself.
00:55:04 Right.
00:55:05 You’re almost like thinking of like generative kind of,
00:55:08 it’s the elves with bananas.
00:55:10 You’re trying to,
00:55:11 it’s like very active imagination
00:55:13 of messing with the world
00:55:14 and teaching that mechanism for messing with the world
00:55:17 to be realistic.
00:55:19 Right.
00:55:20 Because that feels like,
00:55:22 I mean, it’s imagination.
00:55:24 It’s just, as you said,
00:55:25 it feels like us humans are able to,
00:55:29 maybe sometimes subconsciously,
00:55:30 imagine before we see the thing,
00:55:33 imagine what we’re expecting to see,
00:55:35 like maybe several options.
00:55:37 And especially, we probably forgot,
00:55:38 but when we were younger,
00:55:40 probably the possibilities were wilder, more numerous.
00:55:44 And then as we get older,
00:55:45 we become to understand the world
00:55:47 and the possibilities of what we might see
00:55:51 becomes less and less and less.
00:55:53 So I wonder if you think there’s a lot of breakthroughs
00:55:55 yet to be had in data augmentation.
00:55:57 And maybe also can you just comment on the stuff we have,
00:55:59 is that a big part of self supervised learning?
00:56:02 Yes.
00:56:02 So data augmentation is like key to self supervised learning
00:56:05 that has like the kind of augmentation that we’re using.
00:56:08 And basically the fact that we’re trying to learn
00:56:11 these neural networks that are predicting these features
00:56:13 from images that are robust under data augmentation
00:56:17 has been the key for visual self supervised learning.
00:56:19 And they play a fairly fundamental role to it.
00:56:22 Now, the irony of all of this is that
00:56:24 for like deep learning purists will say
00:56:26 the entire point of deep learning is that
00:56:28 you feed in the pixels to the neural network
00:56:31 and it should figure out the patterns on its own.
00:56:33 So if it really wants to look at edges,
00:56:34 it should look at edges.
00:56:35 You shouldn’t really like really go
00:56:36 and handcraft these like features, right?
00:56:38 You shouldn’t go tell it that look at edges.
00:56:41 So data augmentation
00:56:42 should basically be in the same category, right?
00:56:44 Why should we tell the network
00:56:46 or tell this entire learning paradigm
00:56:48 what kinds of data augmentation that we’re looking for?
00:56:50 We are encoding a very sort of human specific bias there
00:56:55 that we know things are like,
00:56:57 if you change the contrast of the image,
00:56:59 it should still be an apple
00:57:00 or it should still see apple, not banana.
00:57:02 And basically if we change like colors,
00:57:05 it should still be the same kind of concept.
00:57:08 Of course, this is not one,
00:57:09 this is doesn’t feel like super satisfactory
00:57:12 because a lot of our human knowledge
00:57:14 or our human supervision
00:57:15 is actually going into the data augmentation.
00:57:17 So although we are calling it self supervised learning,
00:57:19 a lot of the human knowledge
00:57:21 is actually being encoded in the data augmentation process.
00:57:23 So it’s really like,
00:57:24 we’ve kind of sneaked away the supervision at the input
00:57:27 and we’re like really designing
00:57:28 these nice list of data augmentations
00:57:30 that are working very well.
00:57:31 Of course, the idea is that it’s much easier
00:57:33 to design a list of data augmentation than it is to do.
00:57:36 So humans are doing nevertheless doing less and less work
00:57:39 and maybe leveraging their creativity more and more.
00:57:42 And when we say data augmentation is not parameterized,
00:57:45 it means it’s not part of the learning process.
00:57:48 Do you think it’s possible to integrate
00:57:50 some of the data augmentation into the learning process?
00:57:53 I think so.
00:57:54 I think so.
00:57:54 And in fact, it will be really beneficial for us
00:57:57 because a lot of these data augmentations
00:57:59 that we use in vision are very extreme.
00:58:01 For example, like when you have certain concepts,
00:58:05 again, a banana, you take the banana
00:58:08 and then basically you change the color of the banana, right?
00:58:10 So you make it a purple banana.
00:58:12 Now this data augmentation process
00:58:14 is actually independent of the,
00:58:15 like it has no notion of what is present in the image.
00:58:18 So it can change this color arbitrarily.
00:58:20 It can make it a red banana as well.
00:58:22 And now what we’re doing is we’re telling
00:58:24 the neural network that this red banana
00:58:26 and so a crop of this image which has the red banana
00:58:29 and a crop of this image where I changed the color
00:58:30 to a purple banana should be,
00:58:32 the features should be the same.
00:58:34 Now bananas aren’t red or purple mostly.
00:58:36 So really the data augmentation process
00:58:38 should take into account what is present in the image
00:58:41 and what are the kinds of physical realities
00:58:43 that are possible.
00:58:43 It shouldn’t be completely independent of the image.
00:58:45 So you might get big gains if you,
00:58:48 instead of being drastic, do subtle augmentation
00:58:51 but realistic augmentation.
00:58:53 Right, realistic.
00:58:54 I’m not sure if it’s subtle, but like realistic for sure.
00:58:56 If it’s realistic, then even subtle augmentation
00:58:59 will give you big benefits.
00:59:00 Exactly, yeah.
00:59:01 And it will be like for particular domains
00:59:05 you might actually see like,
00:59:06 if for example, now we’re doing medical imaging,
00:59:08 there are going to be certain kinds
00:59:10 of like geometric augmentation
00:59:11 which are not really going to be very valid
00:59:13 for the human body.
00:59:15 So if you were to like actually loop in data augmentation
00:59:18 into the learning process,
00:59:19 it will actually be much more useful.
00:59:21 Now this actually does take us
00:59:23 to maybe a semi supervised kind of a setting
00:59:25 because you do want to understand
00:59:27 what is it that you’re trying to solve.
00:59:29 So currently self supervised learning
00:59:30 kind of operates in the wild, right?
00:59:32 So you do the self supervised learning
00:59:34 and the purists and all of us basically say that,
00:59:37 okay, this should learn useful representations
00:59:39 and they should be useful for any kind of end task,
00:59:42 no matter it’s like banana recognition
00:59:44 or like autonomous driving.
00:59:46 Now it’s a tall order.
00:59:47 Maybe the first baby step for us should be that,
00:59:50 okay, if you’re trying to loop in this data augmentation
00:59:52 into the learning process,
00:59:53 then we at least need to have some sense
00:59:56 of what we’re trying to do.
00:59:56 Are we trying to distinguish
00:59:57 between different types of bananas
00:59:59 or are we trying to distinguish between banana and apple
01:00:02 or are we trying to do all of these things at once?
01:00:04 And so some notion of like what happens at the end
01:00:07 might actually help us do much better at this side.
01:00:10 Let me ask you a ridiculous question.
01:00:14 If I were to give you like a black box,
01:00:16 like a choice to have an arbitrary large data set
01:00:19 of real natural data
01:00:22 versus really good data augmentation algorithms,
01:00:26 which would you like to train in a self supervised way on?
01:00:31 So natural data from the internet are arbitrary large,
01:00:35 so unlimited data,
01:00:37 or it’s like more controlled good data augmentation
01:00:41 on the finite data set.
01:00:43 The thing is like,
01:00:44 because our learning algorithms for vision right now
01:00:47 really rely on data augmentation,
01:00:49 even if you were to give me
01:00:50 like an infinite source of like image data,
01:00:52 I still need a good data augmentation algorithm.
01:00:54 You need something that tells you
01:00:56 that two things are similar.
01:00:57 Right.
01:00:58 And so something,
01:00:59 because you’ve given me an arbitrary large data set,
01:01:01 I still need to use data augmentation
01:01:03 to take that image construct,
01:01:05 like these two perturbations of it,
01:01:06 and then learn from it.
01:01:08 So the thing is our learning paradigm
01:01:09 is very primitive right now.
01:01:11 Yeah.
01:01:12 Even if you were to give me lots of images,
01:01:13 it’s still not really useful.
01:01:15 A good data augmentation algorithm
01:01:16 is actually going to be more useful.
01:01:18 So you can like reduce down the amount of data
01:01:21 that you give me by like 10 times,
01:01:22 but if you were to give me
01:01:23 a good data augmentation algorithm,
01:01:25 that would probably do better
01:01:26 than giving me like 10 times the size of that data,
01:01:29 but me having to rely on
01:01:30 like a very primitive data augmentation algorithm.
01:01:32 Like through tagging and all those kinds of things,
01:01:35 is there a way to discover things
01:01:37 that are semantically similar on the internet?
01:01:39 Obviously there is, but they might be extremely noisy.
01:01:42 And the difference might be farther away
01:01:45 than you would be comfortable with.
01:01:47 So, I mean, yes, tagging will help you a lot.
01:01:49 It’ll actually go a very long way
01:01:51 in figuring out what images are related or not.
01:01:54 And then, so, but then the purists would argue
01:01:57 that when you’re using human tags,
01:01:58 because these tags are like supervision,
01:02:01 is it really self supervised learning now?
01:02:03 Because you’re using human tags
01:02:05 to figure out which images are like similar.
01:02:07 Hashtag no filter means a lot of things.
01:02:10 Yes.
01:02:11 I mean, there are certain tags
01:02:12 which are going to be applicable pretty much to anything.
01:02:15 So they’re pretty useless for learning.
01:02:18 But I mean, certain tags are actually like
01:02:20 the Eiffel Tower, for example,
01:02:22 or the Taj Mahal, for example.
01:02:23 These tags are like very indicative of what’s going on.
01:02:26 And they are, I mean, they are human supervision.
01:02:29 Yeah.
01:02:30 This is one of the tasks of discovering
01:02:31 from human generated data strong signals
01:02:34 that could be leveraged for self supervision.
01:02:39 Like humans are doing so much work already.
01:02:42 Like many years ago, there was something that was called,
01:02:45 I guess, human computation back in the day.
01:02:48 Humans are doing so much work.
01:02:50 It’d be exciting to discover ways to leverage
01:02:53 the work they’re doing to teach machines
01:02:55 without any extra effort from them.
01:02:57 An example could be, like we said, driving,
01:03:00 humans driving and machines can learn from the driving.
01:03:03 I always hope that there could be some supervision signal
01:03:06 discovered in video games,
01:03:08 because there’s so many people that play video games
01:03:10 that it feels like so much effort is put into video games,
01:03:15 into playing video games,
01:03:17 and you can design video games somewhat cheaply
01:03:21 to include whatever signals you want.
01:03:24 It feels like that could be leverage somehow.
01:03:27 So people are using that.
01:03:28 Like there are actually folks right here in UT Austin,
01:03:30 like Philip Granbull is a professor at UT Austin.
01:03:33 He’s been like working on video games
01:03:36 as a source of supervision.
01:03:38 I mean, it’s really fun.
01:03:39 Like as a PhD student,
01:03:40 getting to basically play video games all day.
01:03:42 Yeah, but so I do hope that kind of thing scales
01:03:44 and like ultimately boils down to discovering
01:03:48 some undeniably very good signal.
01:03:51 It’s like masking in NLP.
01:03:54 But that said, there’s non contrastive methods.
01:03:57 What do non contrastive energy based
01:04:00 self supervised learning methods look like?
01:04:03 And why are they promising?
01:04:05 So like I said about contrastive learning,
01:04:07 you have this notion of a positive and a negative.
01:04:10 Now, the thing is, this entire learning paradigm
01:04:13 really requires access to a lot of negatives
01:04:17 to learn a good sort of feature space.
01:04:19 The idea is if I tell you, okay,
01:04:21 so a cat and a dog are similar,
01:04:23 and they’re very different from a banana.
01:04:25 The thing is, this is a fairly simple analogy, right?
01:04:28 Because bananas look visually very different
01:04:30 from what cats and dogs do.
01:04:32 So very quickly, if this is the only source
01:04:34 of supervision that I’m giving you,
01:04:36 your learning is not going to be like,
01:04:38 after a point, the neural network
01:04:39 is really not going to learn a lot.
01:04:41 Because the negative that you’re getting
01:04:42 is going to be so random.
01:04:43 So it can be, oh, a cat and a dog are very similar,
01:04:46 but they’re very different from a Volkswagen Beetle.
01:04:49 Now, like this car looks very different
01:04:51 from these animals again.
01:04:52 So the thing is in contrastive learning,
01:04:54 the quality of the negative sample really matters a lot.
01:04:58 And so what has happened is basically that
01:05:00 typically these methods that are contrastive
01:05:02 really require access to lots of negatives,
01:05:04 which becomes harder and harder to sort of scale
01:05:06 when designing a learning algorithm.
01:05:09 So that’s been one of the reasons
01:05:10 why non contrastive methods have become like popular
01:05:13 and why people think that they’re going to be more useful.
01:05:16 So a non contrastive method, for example,
01:05:18 like clustering is one non contrastive method.
01:05:20 The idea basically being that you have
01:05:22 two of these samples, so the cat and dog
01:05:25 or two crops of this image,
01:05:27 they belong to the same cluster.
01:05:30 And so essentially you’re basically doing clustering online
01:05:33 when you’re learning this network,
01:05:35 and which is very different from having access
01:05:36 to a lot of negatives explicitly.
01:05:38 The other way which has become really popular
01:05:40 is something called self distillation.
01:05:43 So the idea basically is that you have a teacher network
01:05:45 and a student network,
01:05:47 and the teacher network produces a feature.
01:05:49 So it takes in the image
01:05:51 and basically the neural network figures out the patterns
01:05:53 gets the feature out.
01:05:55 And there’s another neural network
01:05:56 which is the student neural network
01:05:57 and that also produces a feature.
01:05:59 And now all you’re doing is basically saying
01:06:01 that the features produced by the teacher network
01:06:03 and the student network should be very similar.
01:06:06 That’s it.
01:06:06 There is no notion of a negative anymore.
01:06:09 And that’s it.
01:06:10 So it’s all about similarity maximization
01:06:11 between these two features.
01:06:13 And so all I need to now do is figure out
01:06:16 how to have these two sorts of parallel networks,
01:06:18 a student network and a teacher network.
01:06:20 And basically researchers have figured out
01:06:23 very cheap methods to do this.
01:06:24 So you can actually have for free really
01:06:26 two types of neural networks.
01:06:29 They’re kind of related,
01:06:30 but they’re different enough that you can actually
01:06:32 basically have a learning problem set up.
01:06:34 So you can ensure that they always remain different enough.
01:06:38 So the thing doesn’t collapse into something boring.
01:06:41 Exactly.
01:06:41 So the main sort of enemy of self supervised learning,
01:06:44 any kind of similarity maximization technique is collapse.
01:06:47 It’s a collapse means that you learn the same feature
01:06:50 representation for all the images in the world,
01:06:53 which is completely useless.
01:06:54 Everything’s a banana.
01:06:55 Everything is a banana.
01:06:56 Everything is a cat.
01:06:57 Everything is a car.
01:06:59 And so all we need to do is basically come up with ways
01:07:02 to prevent collapse.
01:07:03 Contrastive learning is one way of doing it.
01:07:05 And then for example, like clustering or self distillation
01:07:07 or other ways of doing it.
01:07:09 We also had a recent paper where we used like
01:07:11 de correlation between like two sets of features
01:07:15 to prevent collapse.
01:07:16 So that’s inspired a little bit by like Horace Barlow’s
01:07:18 neuroscience principles.
01:07:20 By the way, I should comment that whoever counts
01:07:23 the number of times the word banana, apple, cat and dog
01:07:27 were using this conversation wins the internet.
01:07:30 I wish you luck.
01:07:32 What is Suave and the main improvement proposed
01:07:36 in the paper on supervised learning of visual features
01:07:40 by contrasting cluster assignments?
01:07:42 Suave basically is a clustering based technique,
01:07:46 which is for again, the same thing for self supervised
01:07:49 learning in vision where we have two crops.
01:07:52 And the idea basically is that you want the features
01:07:55 from these two crops of an image to lie in the same cluster
01:07:58 and basically crops that are coming from different images
01:08:02 to be in different clusters.
01:08:03 Now, typically in a sort of,
01:08:05 if you were to do this clustering,
01:08:07 you would perform clustering offline.
01:08:09 What that means is you would,
01:08:11 if you have a dataset of N examples,
01:08:13 you would run over all of these N examples,
01:08:15 get features for them, perform clustering.
01:08:17 So basically get some clusters
01:08:19 and then repeat the process again.
01:08:21 So this is offline basically because I need to do one pass
01:08:24 through the data to compute its clusters.
01:08:27 Suave is basically just a simple way of doing this online.
01:08:30 So as you’re going through the data,
01:08:31 you’re actually computing these clusters online.
01:08:34 And so of course there is like a lot of tricks involved
01:08:37 in how to do this in a robust manner without collapsing,
01:08:40 but this is this sort of key idea to it.
01:08:42 Is there a nice way to say what is the key methodology
01:08:45 of the clustering that enables that?
01:08:47 Right, so the idea basically is that
01:08:51 when you have N samples,
01:08:52 we assume that we have access to,
01:08:54 like there are always K clusters in a dataset.
01:08:57 K is a fixed number.
01:08:57 So for example, K is 3000.
01:09:00 And so if you have any,
01:09:02 when you look at any sort of small number of examples,
01:09:04 all of them must belong to one of these K clusters.
01:09:08 And we impose this equipartition constraint.
01:09:10 What this means is that basically
01:09:15 your entire set of N samples
01:09:16 should be equally partitioned into K clusters.
01:09:19 So all your K clusters are basically equal,
01:09:21 they have equal contribution to these N samples.
01:09:24 And this ensures that we never collapse.
01:09:26 So collapse can be viewed as a way
01:09:28 in which all samples belong to one cluster, right?
01:09:30 So all this, if all features become the same,
01:09:33 then you have basically just one mega cluster.
01:09:35 You don’t even have like 10 clusters or 3000 clusters.
01:09:38 So Suave basically ensures that at each point,
01:09:40 all these 3000 clusters are being used
01:09:42 in the clustering process.
01:09:45 And that’s it.
01:09:46 Basically just figure out how to do this online.
01:09:48 And again, basically just make sure
01:09:50 that two crops from the same image belong to the same cluster
01:09:54 and others don’t.
01:09:55 And the fact they have a fixed K makes things simpler.
01:09:58 Fixed K makes things simpler.
01:10:00 Our clustering is not like really hard clustering,
01:10:02 it’s soft clustering.
01:10:03 So basically you can be 0.2 to cluster number one
01:10:06 and 0.8 to cluster number two.
01:10:08 So it’s not really hard.
01:10:09 So essentially, even though we have like 3000 clusters,
01:10:12 we can actually represent a lot of clusters.
01:10:15 What is SEER, S E E R?
01:10:19 And what are the key results and insights in the paper,
01:10:23 Self Supervised Pre Training of Visual Features in the Wild?
01:10:27 What is this big, beautiful SEER system?
01:10:30 SEER, so I’ll first go to Suave
01:10:32 because Suave is actually like one
01:10:34 of the key components for SEER.
01:10:35 So Suave was, when we use Suave,
01:10:37 it was demonstrated on ImageNet.
01:10:39 So typically like self supervised methods,
01:10:42 the way we sort of operate is like in the research community,
01:10:46 we kind of cheat.
01:10:47 So we take ImageNet, which of course I talked about
01:10:49 as having lots of labels.
01:10:51 And then we throw away the labels,
01:10:52 like throw away all the hard work that went behind
01:10:54 basically the labeling process.
01:10:56 And we pretend that it is unsupervised.
01:11:00 But the problem here is that we have,
01:11:02 like when we collected these images,
01:11:05 the ImageNet dataset has a particular distribution
01:11:08 of concepts, right?
01:11:09 So these images are very curated.
01:11:11 And what that means is these images, of course,
01:11:15 belong to a certain set of noun concepts.
01:11:17 And also ImageNet has this bias that all images
01:11:20 contain an object, which is like very big
01:11:22 and it’s typically in the center.
01:11:24 So when you’re talking about a dog, it’s a well framed dog,
01:11:26 it’s towards the center of the image.
01:11:28 So a lot of the data augmentation,
01:11:29 a lot of the sort of hidden assumptions
01:11:31 in self supervised learning,
01:11:33 actually really exploit this bias of ImageNet.
01:11:37 And so, I mean, a lot of my work,
01:11:39 a lot of work from other people always uses ImageNet
01:11:42 sort of as the benchmark to show the success
01:11:44 of self supervised learning.
01:11:45 So you’re implying that there’s particular limitations
01:11:47 to this kind of dataset?
01:11:49 Yes, I mean, it’s basically because our data augmentation
01:11:51 that we designed, like all data augmentation
01:11:55 that we designed for self supervised learning in vision
01:11:57 are kind of overfit to ImageNet.
01:11:59 But you’re saying a little bit hard coded
01:12:02 like the cropping.
01:12:03 Exactly, the cropping parameters,
01:12:05 the kind of lighting that we’re using,
01:12:07 the kind of blurring that we’re using.
01:12:08 Yeah, but you would, for more in the wild dataset,
01:12:11 you would need to be clever or more careful
01:12:16 in setting the range of parameters
01:12:17 and those kinds of things.
01:12:18 So for SEER, our main goal was twofold.
01:12:21 One, basically to move away from ImageNet for training.
01:12:24 So the images that we used were like uncurated images.
01:12:27 Now there’s a lot of debate
01:12:28 whether they’re actually curated or not,
01:12:30 but I’ll talk about that later.
01:12:32 But the idea was basically,
01:12:33 these are going to be random internet images
01:12:36 that we’re not going to filter out
01:12:37 based on like particular categories.
01:12:40 So we did not say that, oh, images that belong to dogs
01:12:42 and cats should be the only images
01:12:44 that come in this dataset, banana.
01:12:47 And basically, other images should be thrown out.
01:12:50 So we didn’t do any of that.
01:12:51 So these are random internet images.
01:12:53 And of course, it also goes back to like the problem
01:12:56 of scale that you talked about.
01:12:57 So these were basically about a billion or so images.
01:13:00 And for context ImageNet,
01:13:01 the ImageNet version that we use
01:13:02 was 1 million images earlier.
01:13:04 So this is basically going like
01:13:05 three orders of magnitude more.
01:13:07 The idea was basically to see
01:13:08 if we can train a very large convolutional model
01:13:11 in a self supervised way on this uncurated,
01:13:14 but really large set of images.
01:13:16 And how well would this model do?
01:13:18 So is self supervised learning really overfit to ImageNet
01:13:21 or can it actually work in the wild?
01:13:23 And it was also out of curiosity,
01:13:25 what kind of things will this model learn?
01:13:27 Will it actually be able to still figure out
01:13:30 different types of objects and so on?
01:13:32 Would there be particular kinds of tasks
01:13:33 that would actually do better than an ImageNet train model?
01:13:38 And so for Sear, one of our main findings was that
01:13:40 we can actually train very large models
01:13:43 in a completely self supervised way
01:13:44 on lots of internet images
01:13:46 without really necessarily filtering them out.
01:13:48 Which was in itself a good thing
01:13:49 because it’s a fairly simple process, right?
01:13:51 So you get images which are uploaded
01:13:54 and you basically can immediately use them
01:13:55 to train a model in an unsupervised way.
01:13:57 You don’t really need to sit and filter them out.
01:13:59 These images can be cartoons, these can be memes,
01:14:02 these can be actual pictures uploaded by people.
01:14:04 And you don’t really care about what these images are.
01:14:06 You don’t even care about what concepts they contain.
01:14:08 So this was a very sort of simple setup.
01:14:10 What image selection mechanism would you say
01:14:12 is there like inherent in some aspect of the process?
01:14:18 So you’re kind of implying that there’s almost none,
01:14:21 but what is there would you say if you were to introspect?
01:14:24 Right, so it’s not like uncurated can basically
01:14:28 like one way of imagining uncurated
01:14:30 is basically you have like cameras
01:14:32 that can take pictures at random viewpoints.
01:14:35 When people upload pictures to the internet,
01:14:37 they are typically going to care about the framing of it.
01:14:40 They’re not going to upload, say,
01:14:41 the picture of a zoomed in wall, for example.
01:14:43 Well, when you say internet, do you mean social networks?
01:14:46 Yes. Okay.
01:14:47 So these are not going to be like pictures
01:14:48 of like a zoomed in table or a zoomed in wall.
01:14:51 So it’s not really completely uncurated
01:14:53 because people do have the like photographer’s bias
01:14:55 where they do want to keep things
01:14:57 towards the center a little bit,
01:14:58 or like really have like nice looking things
01:15:01 and so on in the picture.
01:15:02 So that’s the kind of bias that typically exists
01:15:05 in this data set and also the user base, right?
01:15:07 You’re not going to get lots of pictures
01:15:09 from different parts of the world
01:15:10 because there are certain parts of the world
01:15:12 where people may not actually be uploading
01:15:14 a lot of pictures to the internet
01:15:15 or may not even have access to a lot of internet.
01:15:17 So this is a giant data set and a giant neural network.
01:15:21 I don’t think we’ve talked about what architectures
01:15:24 work well for SSL, for self supervised learning.
01:15:29 For SEER and for SWAB, we were using convolutional networks,
01:15:32 but recently in a work called Dyno,
01:15:34 we’ve basically started using transformers for vision.
01:15:36 Both seem to work really well, Connets and transformers.
01:15:39 And depending on what you want to do,
01:15:41 you might choose to use a particular formulation.
01:15:43 So for SEER, it was a Connet.
01:15:45 It was particularly a RegNet model,
01:15:47 which was also a work from Facebook.
01:15:49 RegNets are like really good when it comes to compute
01:15:52 versus like accuracy.
01:15:54 So because it was a very efficient model,
01:15:56 compute and memory wise efficient,
01:15:59 and basically it worked really well in terms of scaling.
01:16:02 So we used a very large RegNet model
01:16:04 and trained it on a billion images.
01:16:05 Can you maybe quickly comment on what RegNets are?
01:16:09 It comes from this paper, Designing Network Design Spaces.
01:16:13 This is a super interesting concept
01:16:15 that emphasizes how to create efficient neural networks,
01:16:18 large neural networks.
01:16:19 So one of the sort of key takeaways from this paper,
01:16:21 which the authors, like whenever you hear them
01:16:23 present this work, they keep saying is,
01:16:26 a lot of neural networks are characterized
01:16:27 in terms of flops, right?
01:16:29 Flops basically being the floating point operations.
01:16:31 And people really love to use flops to say,
01:16:33 this model is like really computationally heavy,
01:16:36 or like our model is computationally cheap and so on.
01:16:39 Now it turns out that flops are really not a good indicator
01:16:41 of how well a particular network is,
01:16:43 like how efficient it is really.
01:16:45 And what a better indicator is, is the activation
01:16:49 or the memory that is being used by this particular model.
01:16:52 And so designing, like one of the key findings
01:16:55 from this paper was basically that you need to design
01:16:57 network families or neural network architectures
01:17:00 that are actually very efficient in the memory space as well,
01:17:02 not just in terms of pure flops.
01:17:04 So RegNet is basically a network architecture family
01:17:07 that came out of this paper that is particularly good
01:17:10 at both flops and the sort of memory required for it.
01:17:13 And of course it builds upon like earlier work,
01:17:15 like ResNet being like the sort of more popular inspiration
01:17:18 for it, where you have residual connections.
01:17:20 But one of the things in this work is basically
01:17:22 they also use like squeeze excitation blocks.
01:17:25 So it’s a lot of nice sort of technical innovation
01:17:27 in all of this from prior work,
01:17:28 and a lot of the ingenuity of these particular authors
01:17:31 in how to combine these multiple building blocks.
01:17:34 But the key constraint was optimize for both flops
01:17:36 and memory when you’re basically doing this,
01:17:38 don’t just look at flops.
01:17:39 And that allows you to what have a,
01:17:42 sort of have very large networks through this process,
01:17:47 can optimize for low, like for efficiency, for low memory.
01:17:51 Also in just in terms of pure hardware,
01:17:53 they fit very well on GPU memory.
01:17:55 So they can be like really powerful neural network
01:17:57 architectures with lots of parameters, lots of flops,
01:18:00 but also because they’re like efficient in terms of
01:18:02 the amount of memory that they’re using,
01:18:04 you can actually fit a lot of these on like a,
01:18:06 you can fit a very large model on a single GPU for example.
01:18:09 Would you say that the choice of architecture
01:18:14 matters more than the choice of maybe data augmentation
01:18:17 techniques?
01:18:18 Is there a possibility to say what matters more?
01:18:21 You kind of imply that you can probably go really far
01:18:24 with just using basic conv nuts.
01:18:27 All right, I think like data and data augmentation,
01:18:30 the algorithm being used for the self supervised training
01:18:33 matters a lot more than the particular kind of architecture.
01:18:36 With different types of architecture,
01:18:37 you will get different like properties in the resulting
01:18:40 sort of representation.
01:18:41 But really, I mean, the secret sauce is in the augmentation
01:18:44 and the algorithm being used to train them.
01:18:47 The architectures, I mean, at this point,
01:18:49 a lot of them perform very similarly,
01:18:51 depending on like the particular task that you care about,
01:18:53 they have certain advantages and disadvantages.
01:18:56 Is there something interesting to be said about what it
01:18:58 takes with Sears to train a giant neural network?
01:19:01 You’re talking about a huge amount of data,
01:19:04 a huge neural network.
01:19:05 Is there something interesting to be said of how to
01:19:08 effectively train something like that fast?
01:19:11 Lots of GPUs.
01:19:13 Okay.
01:19:15 I mean, so the model was like a billion parameters.
01:19:18 And it was trained on a billion images.
01:19:20 So if like, basically the same number of parameters
01:19:23 as the number of images, and it took a while.
01:19:26 I don’t remember the exact number, it’s in the paper,
01:19:28 but it took a while.
01:19:31 I guess I’m trying to get at is,
01:19:34 when you’re thinking of scaling this kind of thing,
01:19:38 I mean, one of the exciting possibilities of self
01:19:42 supervised learning is the several orders of magnitude
01:19:45 scaling of everything, both the neural network
01:19:49 and the size of the data.
01:19:50 And so the question is,
01:19:52 do you think there’s some interesting tricks to do large
01:19:56 scale distributed compute,
01:19:57 or is that really outside of even deep learning?
01:20:00 That’s more about like hardware engineering.
01:20:04 I think more and more there is like this,
01:20:07 a lot of like systems are designed,
01:20:10 basically taking into account
01:20:11 the machine learning needs, right?
01:20:12 So because whenever you’re doing this kind of
01:20:14 distributed training, there is a lot of intercommunication
01:20:17 between nodes.
01:20:17 So like gradients or the model parameters are being passed.
01:20:20 So you really want to minimize communication costs
01:20:22 when you really want to scale these models up.
01:20:25 You want basically to be able to do as much,
01:20:29 like as limited amount of communication as possible.
01:20:31 So currently like a dominant paradigm
01:20:33 is synchronized sort of training.
01:20:35 So essentially after every sort of gradient step,
01:20:38 all you basically have like a synchronization step
01:20:41 between all the sort of compute chips
01:20:43 that you’re going on with.
01:20:45 I think asynchronous training was popular,
01:20:47 but it doesn’t seem to perform as well.
01:20:50 But in general, I think that’s sort of the,
01:20:53 I guess it’s outside my scope as well.
01:20:55 But the main thing is like minimize the amount of
01:21:00 synchronization steps that you have.
01:21:01 That has been the key takeaway, at least in my experience.
01:21:04 The others I have no idea about, how to design the chip.
01:21:06 Yeah, there’s very few things that I see Jim Keller’s eyes
01:21:11 light up as much as talking about giant computers doing
01:21:15 like that fast communication that you’re talking to well
01:21:18 when they’re training machine learning systems.
01:21:21 What is VSSL, V I S S L, the PyTorch based SSL library?
01:21:27 What are the use cases that you might have?
01:21:30 VSSL basically was born out of a lot of us at Facebook
01:21:33 are doing the self supervised learning research.
01:21:35 So it’s a common framework in which we have like a lot of
01:21:38 self supervised learning methods implemented for vision.
01:21:41 It’s also, it has in itself like a benchmark of tasks
01:21:45 that you can evaluate the self supervised representations on.
01:21:48 So the use case for it is basically for anyone who’s either
01:21:51 trying to evaluate their self supervised model
01:21:53 or train their self supervised model,
01:21:56 or a researcher who’s trying to build
01:21:57 a new self supervised technique.
01:21:59 So it’s basically supposed to be all of these things.
01:22:01 So as a researcher before VSSL, for example,
01:22:04 or like when we started doing this work fairly seriously
01:22:06 at Facebook, it was very hard for us to go and implement
01:22:09 every self supervised learning model,
01:22:11 test it out in a like sort of consistent manner.
01:22:14 The experimental setup was very different
01:22:16 across different groups.
01:22:18 Even when someone said that they were reporting
01:22:20 image net accuracy, it could mean lots of different things.
01:22:23 So with VSSL, we tried to really sort of standardize that
01:22:25 as much as possible.
01:22:26 And there was a paper like we did in 2019
01:22:28 just about benchmarking.
01:22:29 And so VSSL basically builds upon a lot of this kind of work
01:22:32 that we did about like benchmarking.
01:22:35 And then every time we try to like,
01:22:37 we come up with a self supervised learning method,
01:22:39 a lot of us try to push that into VSSL as well,
01:22:41 just so that it basically is like the central piece
01:22:43 where a lot of these methods can reside.
01:22:46 Just out of curiosity, people may be,
01:22:49 so certainly outside of Facebook, but just researchers,
01:22:52 or just even people that know how to program in Python
01:22:54 and know how to use PyTorch, what would be the use case?
01:22:58 What would be a fun thing to play around with VSSL on?
01:23:01 Like what’s a fun thing to play around
01:23:04 with self supervised learning on, would you say?
01:23:07 Is there a good Hello World program?
01:23:09 Like is it always about big size that’s important to have,
01:23:14 or is there fun little smaller case playgrounds
01:23:18 to play around with?
01:23:19 So we’re trying to like push something towards that.
01:23:22 I think there are a few setups out there,
01:23:24 but nothing like super standard on the smaller scale.
01:23:26 I mean, ImageNet in itself is actually pretty big also.
01:23:29 So that is not something
01:23:31 which is like feasible for a lot of people.
01:23:33 But we are trying to like push up
01:23:34 with like smaller sort of use cases.
01:23:36 The thing is, at a smaller scale,
01:23:39 a lot of the observations
01:23:40 or a lot of the algorithms that work
01:23:41 don’t necessarily translate into the medium
01:23:43 or the larger scale.
01:23:45 So it’s really tricky to come up
01:23:46 with a good small scale setup
01:23:47 where a lot of your empirical observations
01:23:49 will really translate to the other setup.
01:23:51 So it’s been really challenging.
01:23:53 I’ve been trying to do that for a little bit as well
01:23:54 because it does take time to train stuff on ImageNet.
01:23:56 It does take time to train on like more images,
01:23:59 but pretty much every time I’ve tried to do that,
01:24:02 it’s been unsuccessful
01:24:03 because all the observations I draw
01:24:04 from my set of experiments on a smaller data set
01:24:07 don’t translate into ImageNet
01:24:09 or like don’t translate into another sort of data set.
01:24:11 So it’s been hard for us to figure this one out,
01:24:14 but it’s an important problem.
01:24:15 So there’s this really interesting idea
01:24:17 of learning across multiple modalities.
01:24:20 You have a CVPR 2021 best paper candidate
01:24:26 titled audio visual instance discrimination
01:24:29 with cross modal agreement.
01:24:31 What are the key results, insights in this paper
01:24:33 and what can you say in general
01:24:35 about the promise and power of multimodal learning?
01:24:37 For this paper, it actually came as a little bit
01:24:40 of a shock to me at how well it worked.
01:24:41 So I can describe what the problem set up was.
01:24:44 So it’s been used in the past by lots of folks
01:24:46 like for example, Andrew Owens from MIT,
01:24:48 Alyosha Efros from Berkeley,
01:24:49 Andrew Zisserman from Oxford.
01:24:51 So a lot of these people have been
01:24:52 sort of showing results in this.
01:24:53 Of course, I was aware of this result,
01:24:55 but I wasn’t really sure how well it would work in practice
01:24:58 for like other sort of downstream tasks.
01:25:00 So the results kept getting better.
01:25:02 And I wasn’t sure if like a lot of our insights
01:25:04 from self supervised learning would translate
01:25:05 into this multimodal learning problem.
01:25:08 So multimodal learning is when you have like,
01:25:12 when you have multiple modalities.
01:25:14 That’s not even cool.
01:25:15 Okay, so the particular modalities
01:25:19 that we worked on in this work were audio and video.
01:25:22 So the idea was basically, if you have a video,
01:25:23 you have its corresponding audio track.
01:25:25 And you want to use both of these signals,
01:25:27 the audio signal and the video signal
01:25:29 to learn a good representation for video
01:25:31 and good representation for audio.
01:25:32 Like this podcast.
01:25:33 Like this podcast, exactly.
01:25:35 So what we did in this work was basically train
01:25:38 two different neural networks,
01:25:39 one on the video signal, one on the audio signal.
01:25:41 And what we wanted is basically the features
01:25:43 that we get from both of these neural networks
01:25:45 should be similar.
01:25:46 So it should basically be able to produce
01:25:48 the same kinds of features from the video
01:25:51 and the same kinds of features from the audio.
01:25:53 Now, why is this useful?
01:25:54 Well, for a lot of these objects that we have,
01:25:56 there is a characteristic sound, right?
01:25:58 So trains, when they go by,
01:25:59 they make a particular kind of sound.
01:26:00 Boats make a particular kind of sound.
01:26:02 People, when they’re jumping around,
01:26:03 will like shout, whatever.
01:26:06 Bananas don’t make a sound.
01:26:07 So where you can’t learn anything about bananas there.
01:26:09 Or when humans mentioned bananas.
01:26:11 Well, yes, when they say the word banana, then.
01:26:13 So you can’t trust basically anything
01:26:15 that comes out of a human’s mouth as a source,
01:26:17 that source of audio is useless.
01:26:19 The typical use case is basically like,
01:26:20 for example, someone playing a musical instrument.
01:26:22 So guitars have a particular kind of sound and so on.
01:26:24 So because a lot of these things are correlated,
01:26:27 the idea in multimodal learning
01:26:28 is to take these two kinds of modalities,
01:26:30 video and audio, and learn a common embedding space,
01:26:33 a common feature space where both of these
01:26:35 related modalities can basically be close together.
01:26:38 And again, you use contrastive learning for this.
01:26:40 So in contrastive learning, basically the video
01:26:43 and the corresponding audio are positives.
01:26:45 And you can take any other video or any other audio
01:26:48 and that becomes a negative.
01:26:49 And so basically that’s it.
01:26:51 It’s just a simple application of contrastive learning.
01:26:53 The main sort of finding from this work for us
01:26:56 was basically that you can actually learn
01:26:58 very, very powerful feature representations,
01:27:00 very, very powerful video representations.
01:27:02 So you can learn the sort of video network
01:27:05 that we ended up learning can actually be used
01:27:07 for downstream, for example, recognizing human actions
01:27:11 or recognizing different types of sounds, for example.
01:27:14 So this was sort of the key finding.
01:27:17 Can you give kind of an example of a human action
01:27:20 or like just so we can build up intuition
01:27:23 of what kind of thing?
01:27:24 Right, so there is this data set called kinetics,
01:27:26 for example, which has like 400 different types
01:27:28 of human actions.
01:27:29 So people jumping, people doing different kinds of sports
01:27:32 or different types of swimming.
01:27:34 So like different strokes and swimming, golf and so on.
01:27:37 So there are like just different types of actions
01:27:39 right there.
01:27:40 And the point is this kind of video network
01:27:42 that you learn in a self supervised way
01:27:44 can be used very easily to kind of recognize
01:27:46 these different types of actions.
01:27:48 It can also be used for recognizing
01:27:50 different types of objects.
01:27:53 And what we did is we tried to visualize
01:27:54 whether the network can figure out
01:27:56 where the sound is coming from.
01:27:57 So basically, give it a video
01:27:59 and basically play say of a person just strumming a guitar,
01:28:03 but of course, there is no audio in this.
01:28:04 And now you give it this sound of a guitar.
01:28:07 And you ask like basically try to visualize
01:28:08 where the network thinks the sound is coming from.
01:28:12 And that can kind of basically draw like
01:28:14 when you visualize it,
01:28:15 you can see that it’s basically focusing on the guitar.
01:28:17 Yeah, that’s surreal.
01:28:18 And the same thing, for example,
01:28:20 for certain people’s voices,
01:28:21 like famous celebrities voices,
01:28:22 it can actually figure out where their mouth is.
01:28:26 So it can actually distinguish different people’s voices,
01:28:28 for example, a little bit as well.
01:28:30 Without that ever being annotated in any way.
01:28:33 Right, so this is all what it had discovered.
01:28:35 We never pointed out that this is a guitar
01:28:38 and this is the kind of sound it produces.
01:28:40 It can actually naturally figure that out
01:28:41 because it’s seen so many correlations of this sound
01:28:44 coming with this kind of like an object
01:28:46 that it basically learns to associate this sound
01:28:49 with this kind of an object.
01:28:50 Yeah, that’s really fascinating, right?
01:28:52 That’s really interesting.
01:28:53 So the idea with this kind of network
01:28:55 is then you then fine tune it for a particular task.
01:28:57 So this is forming like a really good knowledge base
01:29:01 within a neural network based on which you could then
01:29:04 the train a little bit more to accomplish a specific task.
01:29:07 Well, so you don’t need a lot of videos of humans
01:29:11 doing actions annotated.
01:29:12 You can just use a few of them to basically get your.
01:29:16 How much insight do you draw from the fact
01:29:18 that it can figure out where the sound is coming from?
01:29:23 I’m trying to see, so that’s kind of very,
01:29:26 it’s very CVPR beautiful, right?
01:29:28 It’s a cool little insight.
01:29:30 I wonder how profound that is.
01:29:33 Does it speak to the idea that multiple modalities
01:29:39 are somehow much bigger than the sum of their parts?
01:29:44 Or is it really, really useful to have multiple modalities?
01:29:48 Or is it just that cool thing that there’s parts
01:29:50 of our world that can be revealed like effectively
01:29:57 through multiple modalities,
01:29:58 but most of it is really all about vision
01:30:01 or about one of the modalities.
01:30:03 I would say a little tending more towards the second part.
01:30:07 So most of it can be sort of figured out with one modality,
01:30:10 but having an extra modality always helps you.
01:30:13 So in this case, for example,
01:30:14 like one thing is when you’re,
01:30:17 if you observe someone cutting something
01:30:19 and you don’t have any sort of sound there,
01:30:21 whether it’s an apple or whether it’s an onion,
01:30:25 it’s very hard to figure that out.
01:30:26 But if you hear someone cutting it,
01:30:28 it’s very easy to figure it out because apples and onions
01:30:30 make a very different kind of characteristics
01:30:33 on when they’re cut.
01:30:34 So you really figure this out based on audio,
01:30:36 it’s much easier.
01:30:38 So your life will become much easier
01:30:40 when you have access to different kinds of modalities.
01:30:42 And the other thing is, so I like to relate it in this way,
01:30:45 it may be like completely wrong,
01:30:46 but the distributional hypothesis in NLP,
01:30:49 where context basically gives kind of meaning to that word,
01:30:53 sound kind of does that too.
01:30:55 So if you have the same sound,
01:30:57 so that’s the same context across different videos,
01:30:59 you’re very likely to be observing the same kind of concept.
01:31:03 So that’s the kind of reason
01:31:04 why it figures out the guitar thing, right?
01:31:06 It observed the same sound across multiple different videos
01:31:09 and it figures out maybe this is the common factor
01:31:11 that’s actually doing it.
01:31:13 I wonder, I used to have this argument with my dad a bunch
01:31:17 for creating general intelligence,
01:31:19 whether smell is an important,
01:31:22 like if that’s important sensory information,
01:31:25 mostly we’re talking about like falling in love
01:31:27 with an AI system and for him,
01:31:30 smell and touch are important.
01:31:31 And I was arguing that it’s not at all.
01:31:33 It’s important, it’s nice and everything,
01:31:35 but like you can fall in love with just language really,
01:31:38 but a voice is very powerful and vision is next
01:31:41 and smell is not that important.
01:31:43 Can I ask you about this process of active learning?
01:31:46 You mentioned interactivity.
01:31:49 Right.
01:31:50 Is there some value
01:31:52 within the self supervised learning context
01:31:57 to select parts of the data in intelligent ways
01:32:02 such that they would most benefit the learning process?
01:32:06 So I think so.
01:32:07 I mean, I know I’m talking to an active learning fan here,
01:32:10 so of course I know the answer.
01:32:12 First you were talking bananas
01:32:14 and now you’re talking about active learning.
01:32:15 I love it.
01:32:16 I think Yannakun told me that active learning
01:32:18 is not that interesting.
01:32:20 I think back then I didn’t want to argue with him too much,
01:32:24 but when we talk again,
01:32:26 we’re gonna spend three hours arguing about active learning.
01:32:28 My sense was you can go extremely far with active learning,
01:32:32 perhaps farther than anything else.
01:32:34 Like to me, there’s this kind of intuition
01:32:37 that similar to data augmentation,
01:32:40 you can get a lot from the data,
01:32:45 from intelligent optimized usage of the data.
01:32:50 I’m trying to speak generally in such a way
01:32:53 that includes data augmentation
01:32:55 and active learning,
01:32:57 that there’s something about maybe interactive exploration
01:32:59 of the data that at least is part
01:33:03 of the solution to intelligence, like an important part.
01:33:07 I don’t know what your thoughts are
01:33:08 on active learning in general.
01:33:09 I actually really like active learning.
01:33:10 So back in the day we did this largely ignored CVPR paper
01:33:14 called learning by asking questions.
01:33:16 So the idea was basically you would train an agent
01:33:18 that would ask a question about the image.
01:33:20 It would get an answer
01:33:21 and basically then it would update itself.
01:33:23 It would see the next image.
01:33:24 It would decide what’s the next hardest question
01:33:26 that I can ask to learn the most.
01:33:28 And the idea was basically because it was being smart
01:33:31 about the kinds of questions it was asking,
01:33:33 it would learn in fewer samples.
01:33:35 It would be more efficient at using data.
01:33:37 And we did find to some extent
01:33:39 that it was actually better than randomly asking questions.
01:33:42 Kind of weird thing about active learning
01:33:43 is it’s also a chicken and egg problem
01:33:45 because when you look at an image,
01:33:47 to ask a good question about the image,
01:33:48 you need to understand something about the image.
01:33:50 You can’t ask a completely arbitrarily random question.
01:33:53 It may not even apply to that particular image.
01:33:55 So there is some amount of understanding or knowledge
01:33:57 that basically keeps getting built
01:33:59 when you’re doing active learning.
01:34:01 So I think active learning by itself is really good.
01:34:04 And the main thing we need to figure out is basically
01:34:07 how do we come up with a technique
01:34:09 to first model what the model knows
01:34:13 and also model what the model does not know.
01:34:16 I think that’s the sort of beauty of it.
01:34:18 Because when you know that there are certain things
01:34:20 that you don’t know anything about,
01:34:22 asking a question about those concepts
01:34:23 is actually going to bring you the most value.
01:34:26 And I think that’s the sort of key challenge.
01:34:28 Now, self supervised learning by itself,
01:34:29 like selecting data for it and so on,
01:34:31 that’s actually really useful.
01:34:32 But I think that’s a very narrow view
01:34:33 of looking at active learning.
01:34:35 If you look at it more broadly,
01:34:36 it is basically about if the model has a knowledge
01:34:40 about N concepts,
01:34:41 and it is weak basically about certain things.
01:34:43 So it needs to ask questions
01:34:45 either to discover new concepts
01:34:46 or to basically increase its knowledge
01:34:49 about these N concepts.
01:34:50 So at that level, it’s a very powerful technique.
01:34:53 I actually do think it’s going to be really useful.
01:34:56 Even in like simple things such as like data labeling,
01:34:59 it’s super useful.
01:35:00 So here is like one simple way
01:35:02 that you can use active learning.
01:35:04 For example, you have your self supervised model,
01:35:06 which is very good at predicting similarities
01:35:08 and dissimilarities between things.
01:35:10 And so if you label a picture as basically say a banana,
01:35:15 now you know that all the images
01:35:17 that are very similar to this image
01:35:19 are also likely to contain bananas.
01:35:21 So probably when you want to understand
01:35:24 what else is a banana,
01:35:25 you’re not going to use these other images.
01:35:26 You’re actually going to use an image
01:35:28 that is not completely dissimilar,
01:35:31 but somewhere in between,
01:35:32 which is not super similar to this image,
01:35:33 but not super dissimilar either.
01:35:35 And that’s going to tell you a lot more
01:35:37 about what this concept of a banana is.
01:35:39 So that’s kind of a heuristic.
01:35:41 I wonder if it’s possible to also learn ways
01:35:46 to discover the most likely,
01:35:50 the most beneficial image.
01:35:52 So like, so not just looking a thing
01:35:54 that’s somewhat similar to a banana,
01:35:58 but not exactly similar,
01:35:59 but have some kind of more complicated learning system,
01:36:03 like learned discovering mechanism
01:36:07 that tells you what image to look for.
01:36:09 Like how, yeah, like actually in a self supervised way,
01:36:14 learning strictly a function that says,
01:36:17 is this image going to be very useful to me
01:36:20 given what I currently know?
01:36:22 I think there’s a lot of synergy there.
01:36:23 It’s just, I think, yeah, it’s going to be explored.
01:36:27 I think very much related to that.
01:36:29 I kind of think of what Tesla Autopilot is doing
01:36:33 currently as kind of active learning.
01:36:36 There’s something that Andre Capati and their team
01:36:39 are calling a data engine.
01:36:41 So you’re basically deploying a bunch of instantiations
01:36:45 of a neural network into the wild,
01:36:47 and they’re collecting a bunch of edge cases
01:36:50 that are then sent back for annotation for particular,
01:36:53 and edge cases as defined as near failure
01:36:56 or some weirdness on a particular task
01:36:59 that’s then sent back.
01:37:01 It’s that not exactly a banana,
01:37:04 but almost the banana cases sent back for annotation.
01:37:07 And then there’s this loop that keeps going
01:37:09 and you keep retraining and retraining.
01:37:11 And the active learning step there,
01:37:13 or whatever you want to call it,
01:37:14 is the cars themselves that are sending you back the data.
01:37:19 Like, what the hell happened here?
01:37:20 This was weird.
01:37:22 What are your thoughts about that sort of deployment
01:37:26 of neural networks in the wild?
01:37:28 Another way to ask a question from first is your thoughts.
01:37:31 And maybe if you want to comment,
01:37:33 is there applications for autonomous driving,
01:37:36 like computer vision based autonomous driving,
01:37:40 applications of self supervised learning
01:37:42 in the context of computer vision based autonomous driving?
01:37:47 So I think so.
01:37:48 I think for self supervised learning
01:37:49 to be used in autonomous driving,
01:37:50 there are lots of opportunities.
01:37:51 I mean, just like pure consistency in predictions
01:37:54 is one way, right?
01:37:55 So because you have this nice sequence of data
01:38:00 that is coming in, a video stream of it,
01:38:02 associated of course with the actions
01:38:04 that say the car took,
01:38:05 you can form a very nice predictive model
01:38:07 of what’s happening.
01:38:08 So for example, like all the way,
01:38:11 like one way possibly in which how they’re figuring out
01:38:14 what data to get labeled is basically
01:38:15 through prediction uncertainty, right?
01:38:17 So you predict that the car was going to turn right.
01:38:20 So this was the action that was going to happen,
01:38:21 say in the shadow mode.
01:38:23 And now the driver turned left.
01:38:24 And this is a really big surprise.
01:38:27 So basically by forming these good predictive models,
01:38:30 you are, I mean, these are kind of self supervised models.
01:38:32 Prediction models are basically being trained
01:38:34 just by looking at what’s going to happen next
01:38:36 and asking them to predict what’s going to happen next.
01:38:38 So I would say this is really like one use
01:38:40 of self supervised learning.
01:38:42 It’s a predictive model
01:38:43 and you’re learning a predictive model
01:38:44 basically just by looking at what data you have.
01:38:46 Is there something about that active learning context
01:38:49 that you find insights from?
01:38:53 Like that kind of deployment of the system,
01:38:54 seeing cases where it doesn’t perform as you expected
01:38:59 and then retraining the system based on that?
01:39:01 I think that, I mean, that really resonates with me.
01:39:03 It’s super smart to do it that way.
01:39:05 Because I mean, the thing is with any kind
01:39:08 of like practical system, like autonomous driving,
01:39:11 there are those edge cases that are the things
01:39:13 that are actually the problem, right?
01:39:14 I mean, highway driving or like freeway driving
01:39:17 has basically been like,
01:39:19 there has been a lot of success in that particular part
01:39:21 of autonomous driving for a long time.
01:39:22 I would say like since the eighties or something.
01:39:25 Now the point is all these failure cases
01:39:28 are the sort of reason why autonomous driving
01:39:30 hasn’t become like super, super mainstream and available
01:39:33 like in every possible car right now.
01:39:35 And so basically by really scaling this problem out
01:39:38 by really trying to get all of these edge cases out
01:39:40 as quickly as possible,
01:39:41 and then just like using those to improve your model,
01:39:43 that’s super smart.
01:39:45 And prediction uncertainty to do that
01:39:47 is like one really nice way of doing it.
01:39:49 Let me put you on the spot.
01:39:52 So we mentioned offline Jitendra,
01:39:55 he thinks that the Tesla computer vision approach
01:39:58 or really any approach for autonomous driving
01:40:00 is very far away.
01:40:02 How many years away,
01:40:05 if you have to bet all your money on it,
01:40:06 are we to solving autonomous driving
01:40:09 with this kind of computer vision only
01:40:12 machine learning based approach?
01:40:13 Okay, so what does solving autonomous driving mean?
01:40:15 Does it mean solving it in the US?
01:40:17 Does it mean solving it in India?
01:40:18 Because I can tell you
01:40:19 that very different types of driving happening.
01:40:21 Not India, not Russia.
01:40:23 In the United States, autonomous,
01:40:26 so what solving means is when the car says it has control,
01:40:31 it is fully liable.
01:40:34 You can go to sleep, it’s driving by itself.
01:40:37 So this is highway and city driving,
01:40:39 but not everywhere, but mostly everywhere.
01:40:42 And it’s, let’s say significantly better,
01:40:45 like say five times less accidents than humans.
01:40:50 Sufficiently safer such that the public feels
01:40:53 like that transition is enticing beneficial
01:40:57 both for our safety and financial
01:40:59 and all those kinds of things.
01:41:01 Okay, so first disclaimer,
01:41:02 I’m not an expert in autonomous driving.
01:41:04 So let me put it out there.
01:41:05 I would say like at least five to 10 years.
01:41:09 This would be my guess from now.
01:41:12 Yeah, I’m actually very impressed.
01:41:14 Like when I sat in a friend’s Tesla recently
01:41:16 and of course, like looking on that screen,
01:41:20 it basically shows all the detections and everything.
01:41:22 The car is doing as you’re driving by
01:41:24 and that’s super distracting for me as a person
01:41:26 because all I keep looking at is like the bounding boxes
01:41:29 in the cars it’s tracking and it’s really impressive.
01:41:31 Like especially when it’s raining and it’s able to do that,
01:41:34 that was the most impressive part for me.
01:41:36 It’s actually able to get through rain and do that.
01:41:38 And one of the reasons why like a lot of us believed
01:41:41 and I would put myself in that category
01:41:44 is LIDAR based sort of technology for autonomous driving
01:41:47 was the key driver, right?
01:41:48 So Waymo was using it for the longest time.
01:41:50 And Tesla then decided to go this completely other route
01:41:53 that we are not going to even use LIDAR.
01:41:55 So their initial system I think was camera and radar based
01:41:58 and now they’re actually moving
01:41:59 to a completely like vision based system.
01:42:02 And so that was just like, it sounded completely crazy.
01:42:04 Like LIDAR is very useful in cases
01:42:07 where you have low visibility.
01:42:09 Of course it comes with its own set of complications.
01:42:11 But now to see that happen in like on a live Tesla
01:42:15 that basically just proves everyone wrong
01:42:16 I would say in a way.
01:42:18 And that’s just working really well.
01:42:20 I think there were also like a lot of advancements
01:42:22 in camera technology.
01:42:23 Now there were like, I know at CMU when I was there
01:42:26 there was a particular kind of camera
01:42:27 that had been developed that was really good
01:42:30 at basically low visibility setting.
01:42:32 So like lots of snow and lots of rain
01:42:34 it could actually still have a very reasonable visibility.
01:42:37 And I think there are lots of these kinds of innovations
01:42:39 that will happen on the sensor side itself
01:42:40 which is actually going to make this very easy
01:42:42 in the future.
01:42:43 And so maybe that’s actually why I’m more optimistic
01:42:46 about vision based self, like autonomous driving.
01:42:49 I was going to call it self supervised driving, but.
01:42:51 Vision based autonomous driving.
01:42:53 That’s the reason I’m quite optimistic about it
01:42:55 because I think there are going to be lots
01:42:56 of these advances on the sensor side itself.
01:42:58 So acquiring this data
01:43:00 we’re actually going to get much better about it.
01:43:02 And then of course, once we’re able to scale out
01:43:05 and get all of these edge cases in
01:43:06 as like Andre described
01:43:08 I think that’s going to make us go very far away.
01:43:11 Yeah, so it’s funny.
01:43:13 I’m very much with you on the five to 10 years
01:43:16 maybe 10 years
01:43:17 but you made it, I’m not sure how you made it sound
01:43:21 but for some people that seem
01:43:23 that might seem like really far away.
01:43:25 And then for other people, it might seem like very close.
01:43:30 There’s a lot of fundamental questions
01:43:32 about how much game theory is in this whole thing.
01:43:36 So like, how much is this simply a collision avoidance
01:43:41 problem and how much of it is you still interacting
01:43:45 with other humans in the scene
01:43:46 and you’re trying to create an experience
01:43:48 that’s compelling.
01:43:49 So you want to get from point A to point B quickly
01:43:53 you want to navigate the scene in a safe way
01:43:55 but you also want to show some level of aggression
01:43:58 because well, certainly this is why you’re screwed in India
01:44:02 because you have to show aggression.
01:44:03 Or Jersey or New Jersey.
01:44:04 Or Jersey, right.
01:44:05 So like, or New York or basically any major city
01:44:11 but I think it’s probably Elon
01:44:13 that I talked the most about this
01:44:14 which is a surprise to the level of which
01:44:17 they’re not considering human beings
01:44:20 as a huge problem in this, as a source of problem.
01:44:22 Like the driving is fundamentally a robot on robot
01:44:29 versus the environment problem
01:44:31 versus like you can just consider humans
01:44:33 not part of the problem.
01:44:35 I used to think humans are almost certainly
01:44:38 have to be modeled really well.
01:44:41 Pedestrians and cyclists and humans inside other cars
01:44:44 you have to have like mental models for them.
01:44:46 You cannot just see it as objects
01:44:48 but more and more it’s like the
01:44:51 it’s the same kind of intuition breaking thing
01:44:53 that’s self supervised learning does, which is
01:44:57 well maybe through the learning
01:44:58 you’ll get all the human like human information you need.
01:45:04 Right?
01:45:04 Like maybe you’ll get it just with enough data.
01:45:07 You don’t need to have explicit good models
01:45:09 of human behavior.
01:45:10 Maybe you get it through the data.
01:45:12 So, I mean my skepticism also just knowing
01:45:14 a lot of automotive companies
01:45:16 and how difficult it is to be innovative.
01:45:18 I was skeptical that they would be able at scale
01:45:22 to convert the driving scene across the world
01:45:27 into digital form such that you can create
01:45:30 this data engine at scale.
01:45:33 And the fact that Tesla is at least getting there
01:45:36 or are already there makes me think that
01:45:41 it’s now starting to be coupled
01:45:43 to this self supervised learning vision
01:45:47 which is like if that’s gonna work
01:45:49 if through purely this process you can get really far
01:45:52 then maybe you can solve driving that way.
01:45:54 I don’t know.
01:45:55 I tend to believe we don’t give enough credit
01:46:00 to the how amazing humans are both at driving
01:46:05 and at supervising autonomous systems.
01:46:09 And also we don’t, this is, I wish we were.
01:46:13 I wish there was much more driver sensing inside Teslas
01:46:17 and much deeper consideration of human factors
01:46:21 like understanding psychology and drowsiness
01:46:24 and all those kinds of things
01:46:26 when the car does more and more of the work.
01:46:28 How to keep utilizing the little human supervision
01:46:32 that are needed to keep this whole thing safe.
01:46:35 I mean it’s a fascinating dance of human robot interaction.
01:46:38 To me autonomous driving for a long time
01:46:42 is a human robot interaction problem.
01:46:45 It is not a robotics problem or computer vision problem.
01:46:48 Like you have to have a human in the loop.
01:46:50 But so which is why I think it’s 10 years plus.
01:46:53 But I do think there’ll be a bunch of cities and contexts
01:46:56 where geo restricted it will work really, really damn well.
01:47:02 So I think for me that gets five if I’m being optimistic
01:47:05 and it’s going to be five for a lot of cases
01:47:07 and 10 plus, yeah, I agree with you.
01:47:09 10 plus basically if we want to recover most of the,
01:47:13 say, contiguous United States or something.
01:47:15 Oh, interesting.
01:47:16 So my optimistic is five and pessimistic is 30.
01:47:20 30.
01:47:21 I have a long tail on this one.
01:47:22 I haven’t watched enough driving videos.
01:47:24 I’ve watched enough pedestrians to think like we may be,
01:47:29 like there’s a small part of me still, not a small,
01:47:31 like a pretty big part of me that thinks
01:47:34 we will have to build AGI to solve driving.
01:47:37 Oh, well.
01:47:38 Like there’s something to me,
01:47:39 like because humans are part of the picture,
01:47:41 deeply part of the picture,
01:47:44 and also human society is part of the picture
01:47:46 in that human life is at stake.
01:47:47 Anytime a robot kills a human,
01:47:50 it’s not clear to me that that’s not a problem
01:47:54 that machine learning will also have to solve.
01:47:56 Like it has to, you have to integrate that
01:47:59 into the whole thing.
01:48:00 Just like Facebook or social networks,
01:48:03 one thing is to say how to make
01:48:04 a really good recommender system.
01:48:06 And then the other thing is to integrate
01:48:08 into that recommender system,
01:48:10 all the journalists that will write articles
01:48:12 about that recommender system.
01:48:13 Like you have to consider the society
01:48:15 within which the AI system operates.
01:48:18 And in order to, and like politicians too,
01:48:21 this is the regulatory stuff for autonomous driving.
01:48:24 It’s kind of fascinating that the more successful
01:48:26 your AI system becomes,
01:48:28 the more it gets integrated in society
01:48:31 and the more precious politicians
01:48:33 and the public and the clickbait journalists
01:48:36 and all the different fascinating forces
01:48:38 of our society start acting on it.
01:48:40 And then it’s no longer how good you are
01:48:42 at doing the initial task.
01:48:43 It’s also how good you are at navigating human nature,
01:48:47 which is a fascinating space.
01:48:49 What do you think are the limits of deep learning?
01:48:52 If you allow me, we’ll zoom out a little bit
01:48:54 into the big question of artificial intelligence.
01:48:58 You said dark matter of intelligence is self supervised
01:49:02 learning, but there could be more.
01:49:04 What do you think the limits of self supervised learning
01:49:07 and just learning in general, deep learning are?
01:49:10 I think like for deep learning in particular,
01:49:12 because self supervised learning is I would say
01:49:14 a little bit more vague right now.
01:49:16 So I wouldn’t, like for something that’s so vague,
01:49:18 it’s hard to predict what its limits are going to be.
01:49:21 But like I said, I think anywhere you want to interact
01:49:25 with human self supervised learning kind of hits a boundary
01:49:27 very quickly because you need to have an interface
01:49:29 to be able to communicate with the human.
01:49:31 So really like if you have just like vacuous concepts
01:49:35 or like just like nebulous concepts discovered
01:49:37 by a network, it’s very hard to communicate those
01:49:39 with the human without like inserting some kind
01:49:41 of human knowledge or some kind of like human bias there.
01:49:45 In general, I think for deep learning,
01:49:47 the biggest challenge is just like data efficiency.
01:49:50 Even with self supervised learning,
01:49:52 even with anything else, if you just see
01:49:54 a single concept once, like one image of like,
01:49:59 I don’t know, whatever you want to call it,
01:50:01 like any concept, it’s really hard for these methods
01:50:03 to generalize by looking at just one or two samples
01:50:07 of things and that has been a real challenge.
01:50:09 I think that’s actually why like these edge cases,
01:50:11 for example, for Tesla are actually that important.
01:50:14 Because if you see just one instance of the car failing
01:50:18 and if you just annotate that and you get that
01:50:20 into your data set, you have like very limited guarantee
01:50:23 that it’s not going to happen again.
01:50:25 And you’re actually going to be able to recognize
01:50:26 this kind of instance in a very different scenario.
01:50:28 So like when it was snowing, so you got that thing labeled
01:50:31 when it was snowing, but now when it’s raining,
01:50:33 you’re actually not able to get it.
01:50:34 Or you basically have the same scenario
01:50:36 in a different part of the world.
01:50:37 So the lighting was different or so on.
01:50:39 So it’s just really hard for these models,
01:50:41 like deep learning especially to do that.
01:50:42 What’s your intuition?
01:50:43 How do we solve handwritten digit recognition problem
01:50:47 when we only have one example for each number?
01:50:51 It feels like humans are using something like learning.
01:50:54 Right.
01:50:55 I think we are good at transferring knowledge a little bit.
01:50:59 We are just better at like for a lot of these problems
01:51:02 where we are generalizing from a single sample
01:51:04 or recognizing from a single sample,
01:51:06 we are using a lot of our own domain knowledge
01:51:08 and a lot of our like inductive bias
01:51:10 into that one sample to generalize it.
01:51:12 So I’ve never seen you write the number nine, for example.
01:51:15 And if you were to write it, I would still get it.
01:51:17 And if you were to write a different kind of alphabet
01:51:19 and like write it in two different ways,
01:51:20 I would still probably be able to figure out
01:51:22 that these are the same two characters.
01:51:24 It’s just that I have been very used
01:51:26 to seeing handwritten digits in my life.
01:51:29 The other sort of problem with any deep learning system
01:51:31 or any kind of machine learning system is like,
01:51:33 it’s guarantees, right?
01:51:34 There are no guarantees for it.
01:51:35 Now you can argue that humans also don’t have any guarantees.
01:51:38 Like there is no guarantee that I can recognize a cat
01:51:41 in every scenario.
01:51:42 I’m sure there are going to be lots of cats
01:51:43 that I don’t recognize, lots of scenarios
01:51:45 in which I don’t recognize cats in general.
01:51:48 But I think from just a sort of application perspective,
01:51:52 you do need guarantees, right?
01:51:54 We call these things algorithms.
01:51:56 Now algorithms, like traditional CS algorithms
01:51:59 have guarantees.
01:51:59 Sorting is a guarantee.
01:52:01 If you were to call sort on a particular array of numbers,
01:52:05 you are guaranteed that it’s going to be sorted.
01:52:07 Otherwise it’s a bug.
01:52:09 Now for machine learning,
01:52:10 it’s very hard to characterize this.
01:52:12 We know for a fact that a cat recognition model
01:52:15 is not going to recognize cats,
01:52:17 every cat in the world in every circumstance.
01:52:19 I think most people would agree with that statement,
01:52:22 but we are still okay with it.
01:52:23 We still don’t call this as a bug.
01:52:25 Whereas in traditional computer science
01:52:26 or traditional science,
01:52:27 like if you have this kind of failure case existing,
01:52:29 then you think of it as like something is wrong.
01:52:33 I think there is this sort of notion
01:52:34 of nebulous correctness for machine learning.
01:52:37 And that’s something we just need to be very comfortable
01:52:38 with.
01:52:39 And for deep learning,
01:52:40 or like for a lot of these machine learning algorithms,
01:52:42 it’s not clear how do we characterize
01:52:44 this notion of correctness.
01:52:46 I think limitation in our understanding,
01:52:48 or at least a limitation in our phrasing of this.
01:52:51 And if we were to come up with better ways
01:52:53 to understand this limitation,
01:52:55 then it would actually help us a lot.
01:52:57 Do you think there’s a distinction
01:52:58 between the concept of learning
01:53:01 and the concept of reasoning?
01:53:04 Do you think it’s possible for neural networks to reason?
01:53:10 So I think of it slightly differently.
01:53:11 So for me, learning is whenever
01:53:14 I can like make a snap judgment.
01:53:16 So if you show me a picture of a dog,
01:53:17 I can immediately say it’s a dog.
01:53:18 But if you give me like a puzzle,
01:53:20 like whatever a Goldsberg machine
01:53:23 of like things going to happen,
01:53:24 then I have to reason because I’ve never,
01:53:26 it’s a very complicated setup.
01:53:27 I’ve never seen that particular setup.
01:53:29 And I really need to draw and like imagine in my head
01:53:32 what’s going to happen to figure it out.
01:53:34 So I think, yes, neural networks are really good
01:53:36 at recognition, but they’re not very good at reasoning.
01:53:41 Because they have seen something before
01:53:44 or seen something similar before, they’re very good
01:53:46 at making those sort of snap judgments.
01:53:48 But if you were to give them a very complicated thing
01:53:50 that they’ve not seen before,
01:53:52 they have very limited ability right now
01:53:55 to compose different things.
01:53:56 Like, oh, I’ve seen this particular part before.
01:53:58 I’ve seen this particular part before.
01:54:00 And now probably like this is how
01:54:01 they’re going to work in tandem.
01:54:02 It’s very hard for them to come up
01:54:04 with these kinds of things.
01:54:05 Well, there’s a certain aspect to reasoning
01:54:08 that you can maybe convert into the process of programming.
01:54:11 And so there’s the whole field of program synthesis
01:54:14 and people have been applying machine learning
01:54:17 to the problem of program synthesis.
01:54:18 And the question is, can they, the step of composition,
01:54:22 why can’t that be learned?
01:54:25 You know, this step of like building things on top of you,
01:54:29 like little intuitions, concepts on top of each other,
01:54:33 can that be learnable?
01:54:35 What’s your intuition there?
01:54:36 Or like, I guess similar set of techniques,
01:54:39 do you think that will be applicable?
01:54:42 So I think it is, of course, it is learnable
01:54:44 because like we are prime examples of machines
01:54:47 that have like, or individuals that have learned this, right?
01:54:49 Like humans have learned this.
01:54:51 So it is, of course, it is a technique
01:54:52 that is very easy to learn.
01:54:55 I think where we are kind of hitting a wall
01:54:58 basically with like current machine learning
01:55:00 is the fact that when the network learns
01:55:03 all of this information,
01:55:04 we basically are not able to figure out
01:55:07 how well it’s going to generalize to an unseen thing.
01:55:10 And we have no, like a priori, no way of characterizing that.
01:55:15 And I think that’s basically telling us a lot about,
01:55:18 like a lot about the fact that we really don’t know
01:55:20 what this model has learned and how well it’s basically,
01:55:22 because we don’t know how well it’s going to transfer.
01:55:25 There’s also a sense in which it feels like
01:55:28 we humans may not be aware of how much like background,
01:55:34 how good our background model is,
01:55:36 how much knowledge we just have slowly building
01:55:39 on top of each other.
01:55:41 It feels like neural networks
01:55:42 are constantly throwing stuff out.
01:55:43 Like you’ll do some incredible thing
01:55:45 where you’re learning a particular task in computer vision,
01:55:49 you celebrate your state of the art successes
01:55:51 and you throw that out.
01:55:52 Like, it feels like it’s,
01:55:54 you’re never using stuff you’ve learned
01:55:56 for your future successes in other domains.
01:56:00 And humans are obviously doing that exceptionally well,
01:56:03 still throwing stuff away in their mind,
01:56:05 but keeping certain kernels of truth.
01:56:07 Right, so I think we’re like,
01:56:09 continual learning is sort of the paradigm
01:56:11 for this in machine learning.
01:56:11 And I don’t think it’s a very well explored paradigm.
01:56:15 We have like things in deep learning, for example,
01:56:17 catastrophic forgetting is like one of the standard things.
01:56:20 The thing basically being that if you teach a network
01:56:23 like to recognize dogs,
01:56:24 and now you teach that same network to recognize cats,
01:56:27 it basically forgets how to recognize dogs.
01:56:29 So it forgets very quickly.
01:56:30 I mean, and whereas a human,
01:56:32 if you were to teach someone to recognize dogs
01:56:34 and then to recognize cats,
01:56:35 they don’t forget immediately how to recognize these dogs.
01:56:38 I think that’s basically sort of what you’re trying to get.
01:56:40 Yeah, I just, I wonder if like
01:56:42 the long term memory mechanisms
01:56:44 or the mechanisms that store not just memories,
01:56:47 but concepts that allow you to the reason
01:56:54 and compose concepts,
01:56:57 if those things will look very different
01:56:59 than neural networks,
01:56:59 or if you can do that within a single neural network
01:57:02 with some particular sort of architecture quirks,
01:57:06 that seems to be a really open problem.
01:57:07 And of course I go up and down on that
01:57:09 because there’s something so compelling to the symbolic AI
01:57:14 or to the ideas of logic based sort of expert systems.
01:57:20 You have like human interpretable facts
01:57:22 that built on top of each other.
01:57:24 It’s really annoying like with self supervised learning
01:57:27 that the AI is not very explainable.
01:57:31 Like you can’t like understand
01:57:33 all the beautiful things it has learned.
01:57:35 You can’t ask it like questions,
01:57:38 but then again, maybe that’s a stupid thing
01:57:40 for us humans to want.
01:57:42 Right, I think whenever we try to like understand it,
01:57:45 we are putting our own subjective human bias into it.
01:57:47 Yeah.
01:57:48 And I think that’s the sort of problem
01:57:50 with self supervised learning,
01:57:51 the goal is that it should learn naturally from the data.
01:57:54 So now if you try to understand it,
01:57:55 you are using your own preconceived notions
01:57:58 of what this model has learned.
01:58:00 And that’s the problem.
01:58:03 High level question.
01:58:04 What do you think it takes to build a system
01:58:07 with superhuman, maybe let’s say human level
01:58:10 or superhuman level general intelligence?
01:58:13 We’ve already kind of started talking about this,
01:58:15 but what’s your intuition?
01:58:17 Like, does this thing have to have a body?
01:58:20 Does it have to interact richly with the world?
01:58:25 Does it have to have some more human elements
01:58:27 like self awareness?
01:58:30 I think emotion.
01:58:32 I think emotion is something which is like,
01:58:35 it’s not really attributed typically
01:58:37 in standard machine learning.
01:58:38 It’s not something we think about,
01:58:39 like there is NLP, there is vision,
01:58:41 there is no like emotion.
01:58:42 Emotion is never a part of all of this.
01:58:44 And that just seems a little bit weird to me.
01:58:47 I think the reason basically being that there is surprise
01:58:50 and like, basically emotion is like one of the reasons
01:58:53 emotions arise is like what happens
01:58:55 and what do you expect to happen, right?
01:58:57 There is like a mismatch between these things.
01:58:59 And so that gives rise to like,
01:59:01 I can either be surprised or I can be saddened
01:59:03 or I can be happy and all of this.
01:59:05 And so this basically indicates
01:59:07 that I already have a predictive model in my head
01:59:10 and something that I predicted or something
01:59:11 that I thought was likely to happen.
01:59:13 And then there was something that I observed
01:59:15 that happened that there was a disconnect
01:59:16 between these two things.
01:59:18 And that basically is like maybe one of the reasons
01:59:21 like you have a lot of emotions.
01:59:24 Yeah, I think, so I talk to people a lot about them
01:59:26 like Lisa Feldman Barrett.
01:59:29 I think that’s an interesting concept of emotion
01:59:31 but I have a sense that emotion primarily
01:59:36 in the way we think about it,
01:59:38 which is the display of emotion
01:59:40 is a communication mechanism between humans.
01:59:43 So it’s a part of basically human to human interaction,
01:59:48 an important part, but just the part.
01:59:50 So it’s like, I would throw it into the full mix
01:59:55 of communication.
01:59:58 And to me, communication can be done with objects
02:00:01 that don’t look at all like humans.
02:00:04 Okay.
02:00:05 I’ve seen our ability to anthropomorphize
02:00:07 our ability to connect with things that look like a Roomba
02:00:10 our ability to connect.
02:00:12 First of all, let’s talk about other biological systems
02:00:14 like dogs, our ability to love things
02:00:17 that are very different than humans.
02:00:19 But they do display emotion, right?
02:00:20 I mean, dogs do display emotion.
02:00:23 So they don’t have to be anthropomorphic
02:00:25 for them to like display the kind of emotions
02:00:27 that we don’t.
02:00:28 Exactly.
02:00:29 So, I mean, but then the word emotion starts to lose.
02:00:33 So then we have to be, I guess specific, but yeah.
02:00:36 So have rich flavorful communication.
02:00:39 Communication, yeah.
02:00:40 Yeah, so like, yes, it’s full of emotion.
02:00:43 It’s full of wit and humor and moods
02:00:49 and all those kinds of things, yeah.
02:00:50 So you’re talking about like flavor.
02:00:53 Flavor, yeah.
02:00:54 Okay, let’s call it that.
02:00:55 So there’s content and then there is flavor
02:00:57 and I’m talking about the flavor.
02:00:58 Do you think it needs to have a body?
02:01:00 Do you think like to interact with the physical world?
02:01:02 Do you think you can understand the physical world
02:01:04 without being able to directly interact with it?
02:01:07 I don’t think so, yeah.
02:01:08 I think at some point we will need to bite the bullet
02:01:10 and actually interact with the physical,
02:01:12 as much as I like working on like passive computer vision
02:01:15 where I just like sit in my arm chair
02:01:17 and look at videos and learn.
02:01:19 I do think that we will need to have some kind of embodiment
02:01:22 or some kind of interaction
02:01:24 to figure out things about the world.
02:01:26 What about consciousness?
02:01:28 Do you think, how often do you think about consciousness
02:01:32 when you think about your work?
02:01:34 You could think of it
02:01:35 as the more simple thing of self awareness,
02:01:38 of being aware that you are a perceiving,
02:01:43 sensing, acting thing in this world.
02:01:46 Or you can think about the bigger version of that,
02:01:50 which is consciousness,
02:01:51 which is having it feel like something to be that entity,
02:01:57 the subjective experience of being in this world.
02:01:59 So I think of self awareness a little bit more
02:02:01 than like the broader goal of it,
02:02:03 because I think self awareness is pretty critical
02:02:06 for like any kind of like any kind of AGI
02:02:09 or whatever you want to call it that we build,
02:02:10 because it needs to contextualize what it is
02:02:13 and what role it’s playing
02:02:15 with respect to all the other things that exist around it.
02:02:17 I think that requires self awareness.
02:02:19 It needs to understand that it’s an autonomous car, right?
02:02:23 And what does that mean?
02:02:24 What are its limitations?
02:02:26 What are the things that it is supposed to do and so on?
02:02:29 What is its role in some way?
02:02:30 Or, I mean, these are the kinds of things
02:02:34 that we kind of expect from it, I would say.
02:02:36 And so that’s the level of self awareness
02:02:39 that’s, I would say, basically required at least,
02:02:42 if not more than that.
02:02:44 Yeah, I tend to, on the emotion side,
02:02:46 believe that it has to have,
02:02:48 it has to be able to display consciousness.
02:02:52 Display consciousness, what do you mean by that?
02:02:54 Meaning like for us humans to connect with each other
02:02:57 or to connect with other living entities,
02:03:01 I think we need to feel,
02:03:04 like in order for us to truly feel
02:03:06 like that there’s another being there,
02:03:09 we have to believe that they’re conscious.
02:03:11 And so we won’t ever connect with something
02:03:14 that doesn’t have elements of consciousness.
02:03:17 Now I tend to think that that’s easier to achieve
02:03:21 than it may sound,
02:03:23 because we anthropomorphize stuff so hard.
02:03:25 Like you have a mug that just like has wheels
02:03:28 and like rotates every once in a while and makes a sound.
02:03:31 I think a couple of days in,
02:03:34 especially if you don’t hang out with humans,
02:03:39 you might start to believe that mug on wheels is conscious.
02:03:42 So I think we anthropomorphize pretty effectively
02:03:44 as human beings.
02:03:46 But I do think that it’s in the same bucket
02:03:49 that we’ll call emotion,
02:03:50 that show that you’re,
02:03:54 I think of consciousness as the capacity to suffer.
02:03:58 And if you’re an entity that’s able to feel things
02:04:02 in the world and to communicate that to others,
02:04:06 I think that’s a really powerful way
02:04:08 to interact with humans.
02:04:10 And in order to create an AGI system,
02:04:13 I believe you should be able to richly interact with humans.
02:04:18 Like humans would need to want to interact with you.
02:04:21 Like it can’t be like,
02:04:22 it’s the self supervised learning versus like,
02:04:27 like the robot shouldn’t have to pay you
02:04:29 to interact with me.
02:04:30 So like it should be a natural fun thing.
02:04:33 And then you’re going to scale up significantly
02:04:36 how much interaction it gets.
02:04:39 It’s the Alexa prize,
02:04:40 which they were trying to get me to be a judge
02:04:43 on their contest.
02:04:44 Let’s see if I want to do that.
02:04:46 But their challenge is to talk to you,
02:04:50 make the human sufficiently interested
02:04:53 that the human keeps talking for 20 minutes.
02:04:56 To Alexa?
02:04:57 To Alexa, yeah.
02:04:58 And right now they’re not even close to that
02:05:00 because it just gets so boring when you’re like,
02:05:02 when the intelligence is not there,
02:05:04 it gets very not interesting to talk to it.
02:05:06 And so the robot needs to be interesting.
02:05:08 And one of the ways it can be interesting
02:05:10 is display the capacity to love, to suffer.
02:05:14 And I would say that essentially means
02:05:17 the capacity to display consciousness.
02:05:20 Like it is an entity, much like a human being.
02:05:25 Of course, what that really means,
02:05:27 I don’t know if that’s fundamentally a robotics problem
02:05:30 or some kind of problem that we’re not yet even aware.
02:05:33 Like if it is truly a hard problem of consciousness,
02:05:36 I tend to maybe optimistically think it’s a,
02:05:38 we can pretty effectively fake it till we make it.
02:05:42 So we can display a lot of human like elements for a while.
02:05:46 And that will be sufficient to form
02:05:49 really close connections with humans.
02:05:52 What’s used the most beautiful idea
02:05:53 in self supervised learning?
02:05:55 Like when you sit back with, I don’t know,
02:05:59 with a glass of wine and an armchair
02:06:03 and just at a fireplace,
02:06:06 just thinking how beautiful this world that you get
02:06:08 to explore is, what do you think
02:06:10 is the especially beautiful idea?
02:06:13 The fact that like object level,
02:06:16 what objects are and some notion of objectness emerges
02:06:19 from these models by just like self supervised learning.
02:06:23 So for example, like one of the things like the dyno paper
02:06:28 that I was a part of at Facebook is the object sort
02:06:33 of boundaries emerge from these representations.
02:06:35 So if you have like a dog running in the field,
02:06:38 the boundaries around the dog,
02:06:39 the network is basically able to figure out
02:06:42 what the boundaries of this dog are automatically.
02:06:45 And it was never trained to do that.
02:06:47 It was never trained to, no one taught it
02:06:50 that this is a dog and these pixels belong to a dog.
02:06:52 It’s able to group these things together automatically.
02:06:55 So that’s one.
02:06:56 I think in general, that entire notion that this dumb idea
02:07:00 that you take like these two crops of an image
02:07:01 and then you say that the features should be similar,
02:07:04 that has resulted in something like this,
02:07:06 like the model is able to figure out
02:07:07 what the dog pixels are and so on.
02:07:10 That just seems like so surprising.
02:07:13 And I mean, I don’t think a lot of us even understand
02:07:16 how that is happening really.
02:07:18 And it’s something we are taking for granted,
02:07:20 maybe like a lot in terms of how we’re setting up
02:07:23 these algorithms, but it’s just,
02:07:24 it’s a very beautiful and powerful idea.
02:07:26 So it’s really fundamentally telling us something about
02:07:30 that there is so much signal in the pixels
02:07:32 that we can be super dumb about it.
02:07:34 How about how we are setting up
02:07:35 the self sequencing problem.
02:07:37 And despite being like super dumb about it,
02:07:39 we’ll actually get very good,
02:07:41 like we’ll actually get something that is able to do
02:07:44 very like surprising things.
02:07:45 I wonder if there’s other like objectness
02:07:48 of other concepts that can emerge.
02:07:51 I don’t know if you follow Francois Chollet,
02:07:53 he had the competition for intelligence
02:07:56 that basically it’s kind of like an IQ test,
02:07:59 but for machines, but for an IQ test,
02:08:02 you have to have a few concepts that you want to apply.
02:08:05 One of them is objectness.
02:08:07 I wonder if those concepts can emerge
02:08:11 through self supervised learning on billions of images.
02:08:14 I think something like object permanence
02:08:16 can definitely emerge, right?
02:08:17 So that’s like a fundamental concept which we have,
02:08:20 maybe not through images, through video,
02:08:21 but that’s another concept that should be emerging from it
02:08:25 because it’s not something that,
02:08:26 like if we don’t teach humans that this isn’t,
02:08:29 this is like about this concept of object permanence,
02:08:31 it actually emerges.
02:08:32 And the same thing for like animals, like dogs,
02:08:34 I think actually permanence automatically
02:08:36 is something that they are born with.
02:08:38 So I think it should emerge from the data.
02:08:40 It should emerge basically very quickly.
02:08:42 I wonder if ideas like symmetry, rotation,
02:08:45 these kinds of things might emerge.
02:08:47 So I think rotation, probably yes.
02:08:50 Yeah, rotation, yes.
02:08:51 I mean, there’s some constraints in the architecture itself,
02:08:55 but it’s interesting if all of them could be,
02:08:59 like counting was another one, being able to kind of
02:09:04 understand that there’s multiple objects
02:09:06 of the same kind in the image and be able to count them.
02:09:10 I wonder if all of that could be,
02:09:11 if constructed correctly, they can emerge
02:09:14 because then you can transfer those concepts
02:09:16 to then interpret images at a deeper level.
02:09:20 Right.
02:09:21 Counting, I do believe, I mean, it should be possible.
02:09:24 You don’t know like yet,
02:09:25 but I do think it’s not that far in the realm of possibility.
02:09:29 Yeah, that’d be interesting
02:09:30 if using self supervised learning on images
02:09:33 can then be applied to then solving those kinds of IQ tests,
02:09:36 which seem currently to be kind of impossible.
02:09:40 What idea do you believe might be true
02:09:43 that most people think is not true
02:09:46 or don’t agree with you on?
02:09:48 Is there something like that?
02:09:50 So this is going to be a little controversial,
02:09:52 but okay, sure.
02:09:53 I don’t believe in simulation.
02:09:55 Like actually using simulation to do things very much.
02:09:58 Just to clarify, because this is a podcast
02:10:01 where you talk about, are we living in a simulation often?
02:10:03 You’re referring to using simulation to construct worlds
02:10:08 that you then leverage for machine learning.
02:10:10 Right, yeah.
02:10:11 For example, like one example would be like
02:10:13 to train an autonomous car driving system.
02:10:15 You basically first build a simulator,
02:10:17 which builds like the environment of the world.
02:10:19 And then you basically have a lot of like,
02:10:22 you train your machine learning system in that.
02:10:25 So I believe it is possible,
02:10:27 but I think it’s a really expensive way of doing things.
02:10:30 And at the end of it, you do need the real world.
02:10:33 So I’m not sure.
02:10:35 So maybe for certain settings,
02:10:36 like maybe the payout is so large,
02:10:38 like for autonomous driving, the payout is so large
02:10:40 that you can actually invest that much money to build it.
02:10:43 But I think as a general sort of principle,
02:10:45 it does not apply to a lot of concepts.
02:10:47 You can’t really build simulations of everything.
02:10:49 Not only because like one, it’s expensive,
02:10:51 because second, it’s also not possible for a lot of things.
02:10:54 So in general, like there’s a lot of work
02:10:59 on like using synthetic data and like synthetic simulators.
02:11:02 I generally am not very, like I don’t believe in that.
02:11:05 So you’re saying it’s very challenging visually,
02:11:09 like to correctly like simulate the visual,
02:11:11 like the lighting, all those kinds of things.
02:11:13 I mean, all these companies that you have, right?
02:11:15 So like Pixar and like whatever,
02:11:17 all these companies are,
02:11:19 all this like computer graphics stuff
02:11:21 is really about accurately,
02:11:22 a lot of them is about like accurately trying to figure out
02:11:26 how the lighting is and like how things reflect off
02:11:28 of one another and so on,
02:11:30 and like how sparkly things look and so on.
02:11:32 So it’s a very hard problem.
02:11:34 So do we really need to solve that first
02:11:37 to be able to like do computer vision?
02:11:39 Probably not.
02:11:40 And for me, in the context of autonomous driving,
02:11:44 it’s very tempting to be able to use simulation, right?
02:11:48 Because it’s a safety critical application,
02:11:50 but the other limitation of simulation that perhaps
02:11:54 is a bigger one than the visual limitation
02:11:58 is the behavior of objects.
02:12:00 So you’re ultimately interested in edge cases.
02:12:03 And the question is,
02:12:05 how well can you generate edge cases in simulation,
02:12:08 especially with human behavior?
02:12:11 I think another problem is like for autonomous driving,
02:12:13 it’s a constantly changing world.
02:12:15 So say autonomous driving like in 10 years from now,
02:12:18 like there are lots of autonomous cars,
02:12:20 but they’re still going to be humans.
02:12:22 So now there are 50% of the agents say, which are humans,
02:12:25 50% of the agents that are autonomous,
02:12:26 like car driving agents.
02:12:28 So now the mixture has changed.
02:12:30 So now the kinds of behaviors that you actually expect
02:12:32 from the other agents or other cars on the road
02:12:35 are actually going to be very different.
02:12:36 And as the proportion of the number of autonomous cars
02:12:39 to humans keeps changing,
02:12:40 this behavior will actually change a lot.
02:12:42 So now if you were to build a simulator based on
02:12:44 just like right now to build them today,
02:12:46 you don’t have that many autonomous cars on the road.
02:12:48 So you would try to like make all of the other agents
02:12:50 in that simulator behave as humans,
02:12:52 but that’s not really going to hold true 10, 15, 20,
02:12:55 30 years from now.
02:12:57 Do you think we’re living in a simulation?
02:12:59 No.
02:13:01 How hard is it?
02:13:02 This is why I think it’s an interesting question.
02:13:04 How hard is it to build a video game,
02:13:07 like virtual reality game where it is so real,
02:13:12 forget like ultra realistic to where
02:13:15 you can’t tell the difference,
02:13:17 but like it’s so nice that you just want to stay there.
02:13:20 You just want to stay there and you don’t want to come back.
02:13:24 Do you think that’s doable within our lifetime?
02:13:29 Within our lifetime, probably.
02:13:31 Yeah.
02:13:32 I eat healthy, I live long.
02:13:33 Does that make you sad that there’ll be like
02:13:39 like population of kids that basically spend 95%,
02:13:44 99% of their time in a virtual world?
02:13:50 Very, very hard question to answer.
02:13:53 For certain people, it might be something
02:13:55 that they really derive a lot of value out of,
02:13:58 derive a lot of enjoyment and like happiness out of,
02:14:00 and maybe the real world wasn’t giving them that.
02:14:03 That’s why they did that.
02:14:03 So maybe it is good for certain people.
02:14:05 So ultimately, if it maximizes happiness,
02:14:09 Right, I think if.
02:14:10 Or we could judge.
02:14:11 Yeah, I think if it’s making people happy,
02:14:12 maybe it’s okay.
02:14:14 Again, I think this is a very hard question.
02:14:18 So like you’ve been a part of a lot of amazing papers.
02:14:23 What advice would you give to somebody
02:14:25 on what it takes to write a good paper?
02:14:29 Grad students writing papers now,
02:14:31 is there common things that you’ve learned along the way
02:14:34 that you think it takes,
02:14:35 both for a good idea and a good paper?
02:14:39 Right, so I think both of these have picked up
02:14:44 from like lots of people I’ve worked with in the past.
02:14:46 So one of them is picking the right problem
02:14:48 to work on in research is as important
02:14:51 as like finding the solution to it.
02:14:53 So I mean, there are multiple reasons for this.
02:14:56 So one is that there are certain problems
02:14:59 that can actually be solved in a particular timeframe.
02:15:02 So now say you want to work on finding the meaning of life.
02:15:06 This is a great problem.
02:15:07 I think most people will agree with that.
02:15:09 But do you believe that your talents
02:15:12 and like the energy that you’ll spend on it
02:15:13 will make some kind of meaningful progress
02:15:17 in your lifetime?
02:15:18 If you are optimistic about it, then go ahead.
02:15:21 That’s why I started this podcast.
02:15:22 I keep asking people about the meaning of life.
02:15:24 I’m hoping by episode like 2.20, I’ll figure it out.
02:15:27 Oh, not too many episodes to go.
02:15:30 All right, cool.
02:15:31 Maybe today, I don’t know, but you’re right.
02:15:33 So that seems intractable at the moment.
02:15:36 Right, so I think it’s just the fact of like,
02:15:39 if you’re starting a PhD, for example,
02:15:41 what is one problem that you want to focus on
02:15:43 that you do think is interesting enough,
02:15:45 and you will be able to make a reasonable amount
02:15:47 of headway into it that you think you’ll be doing a PhD for?
02:15:50 So in that kind of a timeframe.
02:15:53 So that’s one.
02:15:53 Of course, there’s the second part,
02:15:54 which is what excites you genuinely.
02:15:56 So you shouldn’t just pick problems
02:15:57 that you are not excited about,
02:15:59 because as a grad student or as a researcher,
02:16:01 you really need to be passionate about it
02:16:03 to continue doing that,
02:16:04 because there are so many other things
02:16:05 that you could be doing in life.
02:16:07 So you really need to believe in that
02:16:08 to be able to do that for that long.
02:16:10 In terms of papers, I think the one thing
02:16:12 that I’ve learned is,
02:16:15 like in the past, whenever I used to write things,
02:16:17 and even now, whenever I do that,
02:16:18 I try to cram in a lot of things into the paper,
02:16:21 whereas what really matters
02:16:22 is just pushing one simple idea, that’s it.
02:16:25 That’s all because the paper is going to be like,
02:16:29 whatever, eight or nine pages.
02:16:32 If you keep cramming in lots of ideas,
02:16:34 it’s really hard for the single thing
02:16:36 that you believe in to stand out.
02:16:38 So if you really try to just focus,
02:16:40 especially in terms of writing,
02:16:41 really try to focus on one particular idea
02:16:43 and articulate it out in multiple different ways,
02:16:46 it’s far more valuable to the reader as well,
02:16:49 and basically to the reader, of course,
02:16:51 because they get to,
02:16:53 they know that this particular idea
02:16:54 is associated with this paper,
02:16:56 and also for you, because you have,
02:16:59 when you write about a particular idea in different ways,
02:17:01 you think about it more deeply.
02:17:02 So as a grad student, I used to always wait to it,
02:17:06 maybe in the last week or whatever, to write the paper,
02:17:08 because I used to always believe
02:17:10 that doing the experiments
02:17:11 was actually the bigger part of research than writing.
02:17:13 And my advisor always told me
02:17:15 that you should start writing very early on,
02:17:16 and I thought, oh, it doesn’t matter,
02:17:17 I don’t know what he’s talking about.
02:17:19 But I think more and more I realized that’s the case.
02:17:22 Whenever I write something that I’m doing,
02:17:24 I actually think much better about it.
02:17:26 And so if you start writing early on,
02:17:28 you actually, I think, get better ideas,
02:17:31 or at least you figure out holes in your theory,
02:17:33 or particular experiments that you should run
02:17:36 to plug those holes, and so on.
02:17:38 Yeah, I’m continually surprised
02:17:40 how many really good papers throughout history
02:17:43 are quite short and quite simple.
02:17:48 And there’s a lesson to that.
02:17:50 If you want to dream about writing a paper
02:17:52 that changes the world,
02:17:54 and you wanna go by example, they’re usually simple.
02:17:58 And that’s, it’s not cramming,
02:18:01 or it’s focusing on one idea, and thinking deeply.
02:18:07 And you’re right that the writing process itself
02:18:10 reveals the idea.
02:18:12 It challenges you to really think about what is the idea
02:18:15 that explains it, the thread that ties it all together.
02:18:19 And so a lot of famous researchers I know
02:18:21 actually would start off, like, first they were,
02:18:24 even before the experiments were in,
02:18:27 a lot of them would actually start
02:18:28 with writing the introduction of the paper,
02:18:30 with zero experiments in.
02:18:32 Because that at least helps them figure out
02:18:33 what they’re trying to solve,
02:18:35 and how it fits in the context of things right now.
02:18:38 And that would really guide their entire research.
02:18:40 So a lot of them would actually first write in intros
02:18:42 with zero experiments in,
02:18:43 and that’s how they would start projects.
02:18:46 Some basic questions about people maybe
02:18:49 that are more like beginners in this field.
02:18:51 What’s the best programming language to learn
02:18:54 if you’re interested in machine learning?
02:18:56 I would say Python,
02:18:57 just because it’s the easiest one to learn.
02:19:00 And also a lot of like programming
02:19:03 and machine learning happens in Python.
02:19:05 So if you don’t know any other programming language,
02:19:07 Python is actually going to get you a long way.
02:19:09 Yeah, it seems like sort of a,
02:19:11 it’s a toss up question because it seems like Python
02:19:14 is so much dominating the space now.
02:19:16 But I wonder if there’s an interesting alternative.
02:19:18 Obviously there’s like Swift,
02:19:19 and there’s a lot of interesting alternatives popping up,
02:19:22 even JavaScript.
02:19:23 So I, or are more like for the data science applications.
02:19:28 But it seems like Python more and more
02:19:31 is actually being used to teach like introduction
02:19:34 to programming at universities.
02:19:35 So it just combines everything very nicely.
02:19:39 Even harder question.
02:19:41 What are the pros and cons of PyTorch versus TensorFlow?
02:19:46 I see.
02:19:48 Okay.
02:19:49 You can go with no comment.
02:19:51 So a disclaimer to this is that the last time
02:19:53 I used TensorFlow was probably like four years ago.
02:19:56 And so it was right when it had come out
02:19:58 because so I started on like deep learning in 2014 or so,
02:20:02 and the dominant sort of framework for us then
02:20:06 for vision was Cafe, which was out of Berkeley.
02:20:09 And we used Cafe a lot, it was really nice.
02:20:12 And then TensorFlow came in,
02:20:13 which was basically like Python first.
02:20:15 So Cafe was mainly C++,
02:20:17 and it had like very loose kind of Python binding.
02:20:19 So Python wasn’t really the first language you would use.
02:20:21 You would really use either MATLAB or C++
02:20:24 like get stuff done in like Cafe.
02:20:28 And then Python of course became popular a little bit later.
02:20:30 So TensorFlow was basically around that time.
02:20:32 So 2015, 2016 is when I last used it.
02:20:36 It’s been a while.
02:20:37 And then what, did you use Torch or did you?
02:20:40 So then I moved to LuaTorch, which was the torch in Lua.
02:20:44 And then in 2017, I think basically pretty much
02:20:46 to PyTorch completely.
02:20:48 Oh, interesting.
02:20:49 So you went to Lua, cool.
02:20:50 Yeah.
02:20:51 Huh, so you were there before it was cool.
02:20:54 Yeah, I mean, so LuaTorch was really good
02:20:56 because it actually allowed you
02:20:59 to do a lot of different kinds of things.
02:21:01 So which Cafe was very rigid in terms of its structure.
02:21:03 Like you would create a neural network once and that’s it.
02:21:06 Whereas if you wanted like very dynamic graphs and so on,
02:21:09 it was very hard to do that.
02:21:10 And LuaTorch was much more friendly
02:21:11 for all of these things.
02:21:13 Okay, so in terms of PyTorch and TensorFlow,
02:21:15 my personal bias is PyTorch
02:21:17 just because I’ve been using it longer
02:21:19 and I’m more familiar with it.
02:21:20 And also that PyTorch is much easier to debug
02:21:23 is what I find because it’s imperative in nature
02:21:26 compared to like TensorFlow, which is not imperative.
02:21:28 But that’s telling you a lot that basically
02:21:30 the imperative design is sort of a way
02:21:33 in which a lot of people are taught programming
02:21:35 and that’s what actually makes debugging easier for them.
02:21:38 So like I learned programming in C, C++.
02:21:40 And so for me, imperative way of programming is more natural.
02:21:44 Do you think it’s good to have
02:21:45 kind of these two communities, this kind of competition?
02:21:48 I think PyTorch is kind of more and more
02:21:50 becoming dominant in the research community,
02:21:52 but TensorFlow is still very popular
02:21:54 in the more sort of application machine learning community.
02:21:57 So do you think it’s good to have
02:21:59 that kind of split in code bases?
02:22:02 Or so like the benefit there is the competition challenges
02:22:06 the library developers to step up to a game.
02:22:09 But the downside is there’s these code bases
02:22:12 that are in different libraries.
02:22:15 Right, so I think the downside is that,
02:22:17 I mean, for a lot of research code
02:22:18 that’s released in one framework
02:22:19 and if you’re using the other one,
02:22:20 it’s really hard to like really build on top of it.
02:22:23 But thankfully the open source community
02:22:25 in machine learning is amazing.
02:22:27 So whenever like something pops up in TensorFlow,
02:22:30 you wait a few days and someone who’s like super sharp
02:22:33 will actually come and translate that particular code
02:22:35 based into PyTorch and basically have figured that
02:22:38 all the nooks and crannies out.
02:22:39 So the open source community is amazing
02:22:41 and they really like figure out this gap.
02:22:44 So I think in terms of like having these two frameworks
02:22:47 or multiple, I think of course there are different use cases
02:22:49 so there are going to be benefits
02:22:51 to using one or the other framework.
02:22:52 And like you said, I think competition is just healthy
02:22:54 because both of these frameworks keep
02:22:57 or like all of these frameworks really sort of
02:22:59 keep learning from each other
02:23:00 and keep incorporating different things
02:23:01 to just make them better and better.
02:23:03 What advice would you have for someone
02:23:06 new to machine learning, you know,
02:23:09 maybe just started or haven’t even started
02:23:11 but are curious about it and who want to get in the field?
02:23:14 Don’t be afraid to get your hands dirty.
02:23:16 I think that’s the main thing.
02:23:17 So if something doesn’t work,
02:23:19 like really drill into why things are not working.
02:23:22 Can you elaborate what your hands dirty means?
02:23:24 Right, so for example, like if an algorithm,
02:23:27 if you try to train the network and it’s not converging,
02:23:29 whatever, rather than trying to like Google the answer
02:23:32 or trying to do something,
02:23:33 like really spend those like five, eight, 10, 15, 20,
02:23:36 whatever number of hours really trying
02:23:37 to figure it out yourself.
02:23:39 Because in that process, you’ll actually learn a lot more.
02:23:41 Yeah.
02:23:42 Googling is of course like a good way to solve it
02:23:44 when you need a quick answer.
02:23:45 But I think initially, especially like when you’re starting
02:23:48 out, it’s much nicer to like figure things out by yourself.
02:23:51 And I just say that from experience
02:23:52 because like when I started out,
02:23:54 there were not a lot of resources.
02:23:55 So we would like in the lab, a lot of us,
02:23:57 like we would look up to senior students
02:23:59 and then the senior students were of course busy
02:24:01 and they would be like, hey, why don’t you go figure it out?
02:24:03 Because I just don’t have the time.
02:24:04 I’m working on my dissertation or whatever.
02:24:06 I’ll find a PhD students.
02:24:07 And so then we would sit down
02:24:08 and like just try to figure it out.
02:24:10 And that I think really helped me.
02:24:12 That has really helped me figure a lot of things out.
02:24:15 I think in general, if I were to generalize that,
02:24:18 I feel like persevering through any kind of struggle
02:24:22 on a thing you care about is good.
02:24:25 So you’re basically, you try to make it seem
02:24:27 like it’s good to spend time debugging,
02:24:30 but really any kind of struggle, whatever form that takes,
02:24:33 it could be just Googling a lot.
02:24:36 Just basically anything, just sticking with it
02:24:38 and going through the hard thing that could take a form
02:24:41 of implementing stuff from scratch.
02:24:43 It could take the form of re implementing
02:24:45 with different libraries
02:24:46 or different programming languages.
02:24:49 It could take a lot of different forms,
02:24:50 but struggle is good for the soul.
02:24:53 So like in Pittsburgh, where I did my PhD,
02:24:55 the thing was it used to snow a lot.
02:24:58 And so when it was snowed, you really couldn’t do much.
02:25:00 So the thing that a lot of people said
02:25:02 was snow builds character.
02:25:05 Because when it’s snowing, you can’t do anything else.
02:25:07 You focus on work.
02:25:09 Do you have advice in general for people,
02:25:10 you’ve already exceptionally successful, you’re young,
02:25:13 but do you have advice for young people starting out
02:25:15 in college or maybe in high school?
02:25:18 Advice for their career, advice for their life,
02:25:21 how to pave a successful path in career and life?
02:25:25 I would say just be hungry.
02:25:27 Always be hungry for what you want.
02:25:29 And I think I’ve been inspired by a lot of people
02:25:33 who are just driven and who really go for what they want,
02:25:36 no matter what, like you shouldn’t want it,
02:25:39 you should need it.
02:25:40 So if you need something,
02:25:41 you basically go towards the ends to make it work.
02:25:44 How do you know when you come across a thing
02:25:47 that’s like you need?
02:25:51 I think there’s not going to be any single thing
02:25:53 that you’re going to need.
02:25:53 There are going to be different types of things
02:25:54 that you need, but whenever you need something,
02:25:56 you just go push for it.
02:25:57 And of course, once you, you may not get it,
02:26:00 or you may find that this was not even the thing
02:26:01 that you were looking for, it might be a different thing.
02:26:03 But the point is like you’re pushing through things
02:26:06 and that actually brings a lot of skills
02:26:08 and builds a certain kind of attitude
02:26:12 which will probably help you get the other thing
02:26:15 once you figure out what’s really the thing that you want.
02:26:18 Yeah, I think a lot of people are,
02:26:20 I’ve noticed, kind of afraid of that
02:26:22 is because one, it’s a fear of commitment.
02:26:24 And two, there’s so many amazing things in this world,
02:26:26 you almost don’t want to miss out
02:26:28 on all the other amazing things
02:26:29 by committing to this one thing.
02:26:31 So I think a lot of it has to do with just
02:26:32 allowing yourself to notice that thing
02:26:37 and just go all the way with it.
02:26:41 I mean, I also like failure, right?
02:26:43 So I know this is like super cheesy that failure
02:26:47 is something that you should be prepared for and so on,
02:26:49 but I do think, I mean, especially in research,
02:26:52 for example, failure is something that happens
02:26:54 almost every day is like experiments failing
02:26:58 and not working.
02:26:59 And so you really need to be so used to it.
02:27:02 You need to have a thick skin,
02:27:03 but, and only basically through,
02:27:06 like when you get through it is when you find
02:27:07 the one thing that’s actually working.
02:27:09 So Thomas Edison was like one person like that, right?
02:27:11 So I really, like when I was a kid,
02:27:13 I used to really read about how he found like the filament,
02:27:17 the light bulb filament.
02:27:18 And then he, I think his thing was like,
02:27:20 he tried 990 things that didn’t work
02:27:23 or something of the sort.
02:27:24 And then they asked him like, so what did you learn?
02:27:26 Because all of these were failed experiments.
02:27:28 And then he says, oh, these 990 things don’t work.
02:27:31 And I know that.
02:27:32 Did you know that?
02:27:33 I mean, that’s really inspiring.
02:27:35 So you spent a few years on this earth
02:27:38 performing a self supervised kind of learning process.
02:27:43 Have you figured out the meaning of life yet?
02:27:46 I told you I’m doing this podcast
02:27:47 to try to get the answer.
02:27:49 I’m hoping you could tell me,
02:27:50 what do you think the meaning of it all is?
02:27:54 I don’t think I figured this out.
02:27:55 No, I have no idea.
02:27:57 Do you think AI will help us figure it out
02:28:02 or do you think there’s no answer?
02:28:03 The whole point is to keep searching.
02:28:05 I think, yeah, I think it’s an endless sort of quest for us.
02:28:08 I don’t think AI will help us there.
02:28:10 This is like a very hard, hard, hard question
02:28:13 which so many humans have tried to answer.
02:28:15 Well, that’s the interesting thing
02:28:16 about the difference between AI and humans.
02:28:19 Humans don’t seem to know what the hell they’re doing.
02:28:21 And AI is almost always operating
02:28:23 under well defined objective functions.
02:28:28 And I wonder whether our lack of ability
02:28:33 to define good longterm objective functions
02:28:37 or introspect what is the objective function
02:28:40 under which we operate, if that’s a feature or a bug.
02:28:44 I would say it’s a feature
02:28:45 because then everyone actually has very different kinds
02:28:47 of objective functions that they’re optimizing
02:28:49 and those objective functions evolve
02:28:51 and change dramatically through the course
02:28:53 of their life.
02:28:54 That’s actually what makes us interesting, right?
02:28:56 If otherwise, like if everyone was doing
02:28:58 the exact same thing, that would be pretty boring.
02:29:00 We do want like people with different kinds
02:29:02 of perspectives, also people evolve continuously.
02:29:06 That’s like, I would say the biggest feature of being human.
02:29:09 And then we get to like the ones that die
02:29:11 because they do something stupid.
02:29:12 We get to watch that, see it and learn from it.
02:29:15 And as a species, we take that lesson
02:29:20 and become better and better
02:29:22 because of all the dumb people in the world
02:29:24 that died doing something wild and beautiful.
02:29:29 Ishan, thank you so much for this incredible conversation.
02:29:31 We did a depth first search through the space
02:29:37 of machine learning and it was fun and fascinating.
02:29:41 So it’s really an honor to meet you
02:29:43 and it was a really awesome conversation.
02:29:45 Thanks for coming down today and talking with me.
02:29:48 Thanks Lex, I mean, I’ve listened to you.
02:29:50 I told you it was unreal for me to actually meet you
02:29:52 in person and I’m so happy to be here, thank you.
02:29:55 Thanks man.
02:29:56 Thanks for listening to this conversation
02:29:58 with Ishan Misra and thank you to Onnit,
02:30:01 The Information, Grammarly and Athletic Greens.
02:30:05 Check them out in the description to support this podcast.
02:30:08 And now let me leave you with some words
02:30:10 from Arthur C. Clarke.
02:30:12 Any sufficiently advanced technology
02:30:14 is indistinguishable from magic.
02:30:18 Thank you for listening and hope to see you next time.