Ilya Sutskever: Deep Learning #94

Transcript

00:00:00 The following is a conversation with Ilya Sotskever,

00:00:03 cofounder and chief scientist of OpenAI,

00:00:06 one of the most cited computer scientists in history

00:00:09 with over 165,000 citations,

00:00:13 and to me, one of the most brilliant and insightful minds

00:00:17 ever in the field of deep learning.

00:00:20 There are very few people in this world

00:00:21 who I would rather talk to and brainstorm with

00:00:24 about deep learning, intelligence, and life in general

00:00:27 than Ilya, on and off the mic.

00:00:30 This was an honor and a pleasure.

00:00:33 This conversation was recorded

00:00:35 before the outbreak of the pandemic.

00:00:37 For everyone feeling the medical, psychological,

00:00:39 and financial burden of this crisis,

00:00:41 I’m sending love your way.

00:00:43 Stay strong, we’re in this together, we’ll beat this thing.

00:00:47 This is the Artificial Intelligence Podcast.

00:00:49 If you enjoy it, subscribe on YouTube,

00:00:51 review it with five stars on Apple Podcast,

00:00:54 support it on Patreon,

00:00:55 or simply connect with me on Twitter

00:00:57 at lexfriedman, spelled F R I D M A N.

00:01:00 As usual, I’ll do a few minutes of ads now

00:01:03 and never any ads in the middle

00:01:04 that can break the flow of the conversation.

00:01:06 I hope that works for you

00:01:07 and doesn’t hurt the listening experience.

00:01:10 This show is presented by Cash App,

00:01:13 the number one finance app in the App Store.

00:01:15 When you get it, use code LEXPODCAST.

00:01:18 Cash App lets you send money to friends,

00:01:20 buy Bitcoin, invest in the stock market

00:01:23 with as little as $1.

00:01:25 Since Cash App allows you to buy Bitcoin,

00:01:27 let me mention that cryptocurrency

00:01:29 in the context of the history of money is fascinating.

00:01:33 I recommend Ascent of Money as a great book on this history.

00:01:36 Both the book and audio book are great.

00:01:39 Debits and credits on ledgers

00:01:41 started around 30,000 years ago.

00:01:43 The US dollar created over 200 years ago,

00:01:47 and Bitcoin, the first decentralized cryptocurrency,

00:01:50 released just over 10 years ago.

00:01:52 So given that history,

00:01:53 cryptocurrency is still very much in its early days

00:01:55 of development, but it’s still aiming to

00:01:58 and just might redefine the nature of money.

00:02:01 So again, if you get Cash App from the App Store

00:02:04 or Google Play and use the code LEXPODCAST,

00:02:08 you get $10 and Cash App will also donate $10 to FIRST,

00:02:12 an organization that is helping advance robotics

00:02:14 and STEM education for young people around the world.

00:02:18 And now here’s my conversation with Ilya Satsgever.

00:02:22 You were one of the three authors with Alex Kaszewski,

00:02:26 Geoff Hinton of the famed AlexNet paper

00:02:30 that is arguably the paper that marked

00:02:33 the big catalytic moment

00:02:35 that launched the deep learning revolution.

00:02:37 At that time, take us back to that time,

00:02:39 what was your intuition about neural networks,

00:02:42 about the representational power of neural networks?

00:02:46 And maybe you could mention how did that evolve

00:02:48 over the next few years up to today,

00:02:51 over the 10 years?

00:02:53 Yeah, I can answer that question.

00:02:55 At some point in about 2010 or 2011,

00:03:00 I connected two facts in my mind.

00:03:02 Basically, the realization was this,

00:03:07 at some point we realized that we can train very large,

00:03:11 I shouldn’t say very, tiny by today’s standards,

00:03:13 but large and deep neural networks

00:03:16 end to end with backpropagation.

00:03:18 At some point, different people obtained this result.

00:03:22 I obtained this result.

00:03:23 The first moment in which I realized

00:03:26 that deep neural networks are powerful

00:03:28 was when James Martens invented

00:03:30 the Hessian free optimizer in 2010.

00:03:33 And he trained a 10 layer neural network end to end

00:03:37 without pre training from scratch.

00:03:41 And when that happened, I thought this is it.

00:03:43 Because if you can train a big neural network,

00:03:45 a big neural network can represent very complicated function.

00:03:49 Because if you have a neural network with 10 layers,

00:03:52 it’s as though you allow the human brain

00:03:55 to run for some number of milliseconds.

00:03:58 Neuron firings are slow.

00:04:00 And so in maybe 100 milliseconds,

00:04:03 your neurons only fire 10 times.

00:04:04 So it’s also kind of like 10 layers.

00:04:06 And in 100 milliseconds,

00:04:08 you can perfectly recognize any object.

00:04:10 So I thought, so I already had the idea then

00:04:13 that we need to train a very big neural network

00:04:16 on lots of supervised data.

00:04:18 And then it must succeed

00:04:19 because we can find the best neural network.

00:04:21 And then there’s also theory

00:04:22 that if you have more data than parameters,

00:04:24 you won’t overfit.

00:04:25 Today, we know that actually this theory is very incomplete

00:04:28 and you won’t overfit even if you have less data

00:04:29 than parameters, but definitely,

00:04:31 if you have more data than parameters, you won’t overfit.

00:04:33 So the fact that neural networks

00:04:34 were heavily overparameterized wasn’t discouraging to you?

00:04:39 So you were thinking about the theory

00:04:41 that the number of parameters,

00:04:43 the fact that there’s a huge number of parameters is okay?

00:04:45 Is it gonna be okay?

00:04:46 I mean, there was some evidence before that it was okayish,

00:04:48 but the theory was most,

00:04:49 the theory was that if you had a big data set

00:04:51 and a big neural net, it was going to work.

00:04:53 The overparameterization just didn’t really

00:04:55 figure much as a problem.

00:04:57 I thought, well, with images,

00:04:57 you’re just gonna add some data augmentation

00:04:59 and it’s gonna be okay.

00:05:00 So where was any doubt coming from?

00:05:02 The main doubt was, can we train a bigger,

00:05:04 will we have enough computer train

00:05:05 a big enough neural net?

00:05:06 With backpropagation.

00:05:07 Backpropagation I thought would work.

00:05:09 The thing which wasn’t clear

00:05:10 was whether there would be enough compute

00:05:12 to get a very convincing result.

00:05:14 And then at some point, Alex Kerchevsky wrote

00:05:15 these insanely fast CUDA kernels

00:05:17 for training convolutional neural nets.

00:05:19 Net was bam, let’s do this.

00:05:20 Let’s get image in it and it’s gonna be the greatest thing.

00:05:23 Was your intuition, most of your intuition

00:05:25 from empirical results by you and by others?

00:05:29 So like just actually demonstrating

00:05:31 that a piece of program can train

00:05:33 a 10 layer neural network?

00:05:34 Or was there some pen and paper

00:05:37 or marker and whiteboard thinking intuition?

00:05:41 Like, cause you just connected a 10 layer

00:05:43 large neural network to the brain.

00:05:45 So you just mentioned the brain.

00:05:46 So in your intuition about neural networks

00:05:49 does the human brain come into play as a intuition builder?

00:05:53 Definitely.

00:05:54 I mean, you gotta be precise with these analogies

00:05:57 between artificial neural networks and the brain.

00:06:00 But there is no question that the brain is a huge source

00:06:04 of intuition and inspiration for deep learning researchers

00:06:07 since all the way from Rosenblatt in the 60s.

00:06:10 Like if you look at the whole idea of a neural network

00:06:13 is directly inspired by the brain.

00:06:15 You had people like McCallum and Pitts who were saying,

00:06:18 hey, you got these neurons in the brain.

00:06:22 And hey, we recently learned about the computer

00:06:23 and automata.

00:06:24 Can we use some ideas from the computer and automata

00:06:26 to design some kind of computational object

00:06:28 that’s going to be simple, computational

00:06:31 and kind of like the brain and they invented the neuron.

00:06:34 So they were inspired by it back then.

00:06:35 Then you had the convolutional neural network from Fukushima

00:06:38 and then later Yann LeCun who said, hey,

00:06:40 if you limit the receptive fields of a neural network,

00:06:42 it’s going to be especially suitable for images

00:06:45 as it turned out to be true.

00:06:46 So there was a very small number of examples

00:06:49 where analogies to the brain were successful.

00:06:52 And I thought, well, probably an artificial neuron

00:06:55 is not that different from the brain

00:06:56 if it’s cleaned hard enough.

00:06:57 So let’s just assume it is and roll with it.

00:07:00 So now we’re not at a time where deep learning

00:07:02 is very successful.

00:07:03 So let us squint less and say, let’s open our eyes

00:07:08 and say, what do you use an interesting difference

00:07:12 between the human brain?

00:07:13 Now, I know you’re probably not an expert

00:07:16 neither in your scientists and your biologists,

00:07:18 but loosely speaking, what’s the difference

00:07:20 between the human brain and artificial neural networks?

00:07:22 That’s interesting to you for the next decade or two.

00:07:26 That’s a good question to ask.

00:07:27 What is an interesting difference between the neurons

00:07:29 between the brain and our artificial neural networks?

00:07:32 So I feel like today, artificial neural networks,

00:07:37 so we all agree that there are certain dimensions

00:07:39 in which the human brain vastly outperforms our models.

00:07:43 But I also think that there are some ways

00:07:44 in which our artificial neural networks

00:07:46 have a number of very important advantages over the brain.

00:07:50 Looking at the advantages versus disadvantages

00:07:52 is a good way to figure out what is the important difference.

00:07:55 So the brain uses spikes, which may or may not be important.

00:08:00 Yeah, it’s a really interesting question.

00:08:01 Do you think it’s important or not?

00:08:03 That’s one big architectural difference

00:08:06 between artificial neural networks.

00:08:08 It’s hard to tell, but my prior is not very high

00:08:11 and I can say why.

00:08:13 There are people who are interested

00:08:14 in spiking neural networks.

00:08:15 And basically what they figured out

00:08:17 is that they need to simulate

00:08:19 the non spiking neural networks in spikes.

00:08:22 And that’s how they’re gonna make them work.

00:08:24 If you don’t simulate the non spiking neural networks

00:08:26 in spikes, it’s not going to work

00:08:27 because the question is why should it work?

00:08:29 And that connects to questions around back propagation

00:08:31 and questions around deep learning.

00:08:34 You’ve got this giant neural network.

00:08:36 Why should it work at all?

00:08:38 Why should the learning rule work at all?

00:08:43 It’s not a self evident question,

00:08:44 especially if you, let’s say if you were just starting

00:08:47 in the field and you read the very early papers,

00:08:49 you can say, hey, people are saying,

00:08:51 let’s build neural networks.

00:08:53 That’s a great idea because the brain is a neural network.

00:08:55 So it would be useful to build neural networks.

00:08:58 Now let’s figure out how to train them.

00:09:00 It should be possible to train them probably, but how?

00:09:03 And so the big idea is the cost function.

00:09:07 That’s the big idea.

00:09:08 The cost function is a way of measuring the performance

00:09:11 of the system according to some measure.

00:09:14 By the way, that is a big, actually let me think,

00:09:17 is that one, a difficult idea to arrive at

00:09:21 and how big of an idea is that?

00:09:22 That there’s a single cost function.

00:09:27 Sorry, let me take a pause.

00:09:28 Is supervised learning a difficult concept to come to?

00:09:33 I don’t know.

00:09:34 All concepts are very easy in retrospect.

00:09:36 Yeah, that’s what it seems trivial now,

00:09:38 but I, because the reason I asked that,

00:09:40 and we’ll talk about it, is there other things?

00:09:43 Is there things that don’t necessarily have a cost function,

00:09:47 maybe have many cost functions

00:09:48 or maybe have dynamic cost functions

00:09:50 or maybe a totally different kind of architectures?

00:09:54 Because we have to think like that

00:09:55 in order to arrive at something new, right?

00:09:57 So the only, so the good examples of things

00:09:59 which don’t have clear cost functions are GANs.

00:10:03 Right. And a GAN, you have a game.

00:10:05 So instead of thinking of a cost function,

00:10:08 where you wanna optimize,

00:10:09 where you know that you have an algorithm gradient descent,

00:10:12 which will optimize the cost function,

00:10:13 and then you can reason about the behavior of your system

00:10:16 in terms of what it optimizes.

00:10:18 With a GAN, you say, I have a game

00:10:20 and I’ll reason about the behavior of the system

00:10:22 in terms of the equilibrium of the game.

00:10:24 But it’s all about coming up with these mathematical objects

00:10:26 that help us reason about the behavior of our system.

00:10:30 Right, that’s really interesting.

00:10:31 Yeah, so GAN is the only one, it’s kind of a,

00:10:33 the cost function is emergent from the comparison.

00:10:36 It’s, I don’t know if it has a cost function.

00:10:38 I don’t know if it’s meaningful

00:10:39 to talk about the cost function of a GAN.

00:10:41 It’s kind of like the cost function of biological evolution

00:10:44 or the cost function of the economy.

00:10:45 It’s, you can talk about regions

00:10:49 to which it will go towards, but I don’t think,

00:10:55 I don’t think the cost function analogy is the most useful.

00:10:57 So if evolution doesn’t, that’s really interesting.

00:11:00 So if evolution doesn’t really have a cost function,

00:11:02 like a cost function based on its,

00:11:06 something akin to our mathematical conception

00:11:09 of a cost function, then do you think cost functions

00:11:12 in deep learning are holding us back?

00:11:15 Yeah, so you just kind of mentioned that cost function

00:11:18 is a nice first profound idea.

00:11:21 Do you think that’s a good idea?

00:11:23 Do you think it’s an idea we’ll go past?

00:11:26 So self play starts to touch on that a little bit

00:11:29 in reinforcement learning systems.

00:11:31 That’s right.

00:11:32 Self play and also ideas around exploration

00:11:34 where you’re trying to take action

00:11:36 that surprise a predictor.

00:11:39 I’m a big fan of cost functions.

00:11:40 I think cost functions are great

00:11:41 and they serve us really well.

00:11:42 And I think that whenever we can do things

00:11:44 with cost functions, we should.

00:11:45 And you know, maybe there is a chance

00:11:47 that we will come up with some,

00:11:49 yet another profound way of looking at things

00:11:51 that will involve cost functions in a less central way.

00:11:54 But I don’t know, I think cost functions are,

00:11:55 I mean, I would not bet against cost functions.

00:12:01 Is there other things about the brain

00:12:04 that pop into your mind that might be different

00:12:06 and interesting for us to consider

00:12:09 in designing artificial neural networks?

00:12:12 So we talked about spiking a little bit.

00:12:14 I mean, one thing which may potentially be useful,

00:12:16 I think people, neuroscientists have figured out

00:12:18 something about the learning rule of the brain

00:12:20 or I’m talking about spike time independent plasticity

00:12:22 and it would be nice if some people

00:12:24 would just study that in simulation.

00:12:26 Wait, sorry, spike time independent plasticity?

00:12:28 Yeah, that’s right.

00:12:29 What’s that?

00:12:30 STD.

00:12:31 It’s a particular learning rule that uses spike timing

00:12:33 to figure out how to determine how to update the synapses.

00:12:37 So it’s kind of like if a synapse fires into the neuron

00:12:40 before the neuron fires,

00:12:42 then it strengthens the synapse,

00:12:44 and if the synapse fires into the neurons

00:12:46 shortly after the neuron fired,

00:12:47 then it weakens the synapse.

00:12:49 Something along this line.

00:12:50 I’m 90% sure it’s right, so if I said something wrong here,

00:12:54 don’t get too angry.

00:12:57 But you sounded brilliant while saying it.

00:12:59 But the timing, that’s one thing that’s missing.

00:13:02 The temporal dynamics is not captured.

00:13:05 I think that’s like a fundamental property of the brain

00:13:08 is the timing of the timing of the timing

00:13:12 of the signals.

00:13:13 Well, you have recurrent neural networks.

00:13:15 But you think of that as this,

00:13:18 I mean, that’s a very crude, simplified,

00:13:21 what’s that called?

00:13:23 There’s a clock, I guess, to recurrent neural networks.

00:13:27 It’s, this seems like the brain is the general,

00:13:30 the continuous version of that,

00:13:31 the generalization where all possible timings are possible,

00:13:36 and then within those timings is contained some information.

00:13:39 You think recurrent neural networks,

00:13:42 the recurrence in recurrent neural networks

00:13:45 can capture the same kind of phenomena as the timing

00:13:51 that seems to be important for the brain,

00:13:54 in the firing of neurons in the brain?

00:13:56 I mean, I think recurrent neural networks are amazing,

00:14:00 and they can do, I think they can do anything

00:14:03 we’d want them to, we’d want a system to do.

00:14:07 Right now, recurrent neural networks

00:14:09 have been superseded by transformers,

00:14:10 but maybe one day they’ll make a comeback,

00:14:12 maybe they’ll be back, we’ll see.

00:14:15 Let me, on a small tangent, say,

00:14:17 do you think they’ll be back?

00:14:19 So, so much of the breakthroughs recently

00:14:21 that we’ll talk about on natural language processing

00:14:24 and language modeling has been with transformers

00:14:28 that don’t emphasize recurrence.

00:14:30 Do you think recurrence will make a comeback?

00:14:33 Well, some kind of recurrence, I think very likely.

00:14:37 Recurrent neural networks, as they’re typically thought of

00:14:41 for processing sequences, I think it’s also possible.

00:14:44 What is, to you, a recurrent neural network?

00:14:47 In generally speaking, I guess,

00:14:49 what is a recurrent neural network?

00:14:50 You have a neural network which maintains

00:14:52 a high dimensional hidden state,

00:14:54 and then when an observation arrives,

00:14:56 it updates its high dimensional hidden state

00:14:59 through its connections in some way.

00:15:03 So do you think, that’s what expert systems did, right?

00:15:08 Symbolic AI, the knowledge based,

00:15:12 growing a knowledge base is maintaining a hidden state,

00:15:17 which is its knowledge base,

00:15:18 and is growing it by sequential processing.

00:15:20 Do you think of it more generally in that way,

00:15:22 or is it simply, is it the more constrained form

00:15:28 of a hidden state with certain kind of gating units

00:15:31 that we think of as today with LSTMs and that?

00:15:34 I mean, the hidden state is technically

00:15:36 what you described there, the hidden state

00:15:37 that goes inside the LSTM or the RNN or something like this.

00:15:41 But then what should be contained,

00:15:43 if you want to make the expert system analogy,

00:15:46 I’m not, I mean, you could say that

00:15:49 the knowledge is stored in the connections,

00:15:51 and then the short term processing

00:15:53 is done in the hidden state.

00:15:56 Yes, could you say that?

00:15:58 So sort of, do you think there’s a future of building

00:16:01 large scale knowledge bases within the neural networks?

00:16:05 Definitely.

00:16:09 So we’re gonna pause on that confidence,

00:16:11 because I want to explore that.

00:16:12 Well, let me zoom back out and ask,

00:16:16 back to the history of ImageNet.

00:16:19 Neural networks have been around for many decades,

00:16:21 as you mentioned.

00:16:22 What do you think were the key ideas

00:16:24 that led to their success,

00:16:25 that ImageNet moment and beyond,

00:16:28 the success in the past 10 years?

00:16:32 Okay, so the question is,

00:16:33 to make sure I didn’t miss anything,

00:16:35 the key ideas that led to the success

00:16:37 of deep learning over the past 10 years.

00:16:39 Exactly, even though the fundamental thing

00:16:42 behind deep learning has been around for much longer.

00:16:45 So the key idea about deep learning,

00:16:51 or rather the key fact about deep learning

00:16:53 before deep learning started to be successful,

00:16:58 is that it was underestimated.

00:17:01 People who worked in machine learning

00:17:02 simply didn’t think that neural networks could do much.

00:17:06 People didn’t believe that large neural networks

00:17:08 could be trained.

00:17:10 People thought that, well, there was lots of,

00:17:13 there was a lot of debate going on in machine learning

00:17:15 about what are the right methods and so on.

00:17:17 And people were arguing because there were no,

00:17:21 there was no way to get hard facts.

00:17:23 And by that, I mean, there were no benchmarks

00:17:25 which were truly hard that if you do really well on them,

00:17:28 then you can say, look, here’s my system.

00:17:32 That’s when you switch from,

00:17:35 that’s when this field becomes a little bit more

00:17:37 of an engineering field.

00:17:38 So in terms of deep learning,

00:17:39 to answer the question directly,

00:17:42 the ideas were all there.

00:17:43 The thing that was missing was a lot of supervised data

00:17:46 and a lot of compute.

00:17:49 Once you have a lot of supervised data and a lot of compute,

00:17:52 then there is a third thing which is needed as well.

00:17:54 And that is conviction.

00:17:56 Conviction that if you take the right stuff,

00:17:59 which already exists, and apply and mix it

00:18:01 with a lot of data and a lot of compute,

00:18:03 that it will in fact work.

00:18:05 And so that was the missing piece.

00:18:07 It was, you had the, you needed the data,

00:18:10 you needed the compute, which showed up in terms of GPUs,

00:18:14 and you needed the conviction to realize

00:18:15 that you need to mix them together.

00:18:18 So that’s really interesting.

00:18:19 So I guess the presence of compute

00:18:23 and the presence of supervised data

00:18:26 allowed the empirical evidence to do the convincing

00:18:29 of the majority of the computer science community.

00:18:32 So I guess there’s a key moment with Jitendra Malik

00:18:36 and Alex Alyosha Efros who were very skeptical, right?

00:18:42 And then there’s a Jeffrey Hinton

00:18:43 that was the opposite of skeptical.

00:18:46 And there was a convincing moment.

00:18:48 And I think ImageNet had served as that moment.

00:18:50 That’s right.

00:18:51 And they represented this kind of,

00:18:52 were the big pillars of computer vision community,

00:18:55 kind of the wizards got together,

00:18:59 and then all of a sudden there was a shift.

00:19:01 And it’s not enough for the ideas to all be there

00:19:05 and the compute to be there,

00:19:06 it’s for it to convince the cynicism that existed.

00:19:11 It’s interesting that people just didn’t believe

00:19:14 for a couple of decades.

00:19:15 Yeah, well, but it’s more than that.

00:19:18 It’s kind of, when put this way,

00:19:20 it sounds like, well, those silly people

00:19:23 who didn’t believe, what were they missing?

00:19:25 But in reality, things were confusing

00:19:27 because neural networks really did not work on anything.

00:19:30 And they were not the best method

00:19:31 on pretty much anything as well.

00:19:33 And it was pretty rational to say,

00:19:35 yeah, this stuff doesn’t have any traction.

00:19:39 And that’s why you need to have these very hard tasks

00:19:42 which produce undeniable evidence.

00:19:44 And that’s how we make progress.

00:19:46 And that’s why the field is making progress today

00:19:48 because we have these hard benchmarks

00:19:50 which represent true progress.

00:19:52 And so, and this is why we are able to avoid endless debate.

00:19:58 So incredibly you’ve contributed

00:20:00 some of the biggest recent ideas in AI

00:20:03 in computer vision, language, natural language processing,

00:20:07 reinforcement learning, sort of everything in between,

00:20:11 maybe not GANs.

00:20:12 But there may not be a topic you haven’t touched.

00:20:16 And of course, the fundamental science of deep learning.

00:20:19 What is the difference to you between vision, language,

00:20:24 and as in reinforcement learning, action,

00:20:26 as learning problems?

00:20:28 And what are the commonalities?

00:20:29 Do you see them as all interconnected?

00:20:31 Are they fundamentally different domains

00:20:33 that require different approaches?

00:20:38 Okay, that’s a good question.

00:20:39 Machine learning is a field with a lot of unity,

00:20:41 a huge amount of unity.

00:20:44 In fact. What do you mean by unity?

00:20:45 Like overlap of ideas?

00:20:48 Overlap of ideas, overlap of principles.

00:20:50 In fact, there’s only one or two or three principles

00:20:52 which are very, very simple.

00:20:54 And then they apply in almost the same way,

00:20:57 in almost the same way to the different modalities,

00:20:59 to the different problems.

00:21:01 And that’s why today, when someone writes a paper

00:21:04 on improving optimization of deep learning and vision,

00:21:07 it improves the different NLP applications

00:21:09 and it improves the different

00:21:10 reinforcement learning applications.

00:21:12 Reinforcement learning.

00:21:13 So I would say that computer vision

00:21:15 and NLP are very similar to each other.

00:21:18 Today they differ in that they have

00:21:20 slightly different architectures.

00:21:22 We use transformers in NLP

00:21:23 and we use convolutional neural networks in vision.

00:21:26 But it’s also possible that one day this will change

00:21:28 and everything will be unified with a single architecture.

00:21:31 Because if you go back a few years ago

00:21:33 in natural language processing,

00:21:36 there were a huge number of architectures

00:21:39 for every different tiny problem had its own architecture.

00:21:43 Today, there’s just one transformer

00:21:45 for all those different tasks.

00:21:47 And if you go back in time even more,

00:21:49 you had even more and more fragmentation

00:21:51 and every little problem in AI

00:21:53 had its own little subspecialization

00:21:55 and sub, you know, little set of collection of skills,

00:21:58 people who would know how to engineer the features.

00:22:00 Now it’s all been subsumed by deep learning.

00:22:02 We have this unification.

00:22:04 And so I expect vision to become unified

00:22:06 with natural language as well.

00:22:08 Or rather, I shouldn’t say expect, I think it’s possible.

00:22:10 I don’t wanna be too sure because

00:22:12 I think on the convolutional neural net

00:22:13 is very computationally efficient.

00:22:15 RL is different.

00:22:16 RL does require slightly different techniques

00:22:18 because you really do need to take action.

00:22:20 You really need to do something about exploration.

00:22:23 Your variance is much higher.

00:22:26 But I think there is a lot of unity even there.

00:22:28 And I would expect, for example, that at some point

00:22:29 there will be some broader unification

00:22:33 between RL and supervised learning

00:22:35 where somehow the RL will be making decisions

00:22:37 to make the supervised learning go better.

00:22:38 And it will be, I imagine, one big black box

00:22:41 and you just throw, you know, you shovel things into it

00:22:44 and it just figures out what to do

00:22:46 with whatever you shovel at it.

00:22:48 I mean, reinforcement learning has some aspects

00:22:50 of language and vision combined almost.

00:22:55 There’s elements of a long term memory

00:22:57 that you should be utilizing

00:22:58 and there’s elements of a really rich sensory space.

00:23:03 So it seems like the union of the two or something like that.

00:23:08 I’d say something slightly differently.

00:23:10 I’d say that reinforcement learning is neither,

00:23:12 but it naturally interfaces

00:23:14 and integrates with the two of them.

00:23:17 Do you think action is fundamentally different?

00:23:19 So yeah, what is interesting about,

00:23:21 what is unique about policy of learning to act?

00:23:26 Well, so one example, for instance,

00:23:27 is that when you learn to act,

00:23:29 you are fundamentally in a non stationary world

00:23:33 because as your actions change,

00:23:35 the things you see start changing.

00:23:38 You experience the world in a different way.

00:23:41 And this is not the case for

00:23:43 the more traditional static problem

00:23:44 where you have some distribution

00:23:46 and you just apply a model to that distribution.

00:23:49 You think it’s a fundamentally different problem

00:23:51 or is it just a more difficult generalization

00:23:55 of the problem of understanding?

00:23:57 I mean, it’s a question of definitions almost.

00:23:59 There is a huge amount of commonality for sure.

00:24:02 You take gradients, you try, you take gradients.

00:24:04 We try to approximate gradients in both cases.

00:24:06 In the case of reinforcement learning,

00:24:08 you have some tools to reduce the variance of the gradients.

00:24:11 You do that.

00:24:13 There’s lots of commonality.

00:24:13 Use the same neural net in both cases.

00:24:16 You compute the gradient, you apply Adam in both cases.

00:24:20 So, I mean, there’s lots in common for sure,

00:24:24 but there are some small differences

00:24:26 which are not completely insignificant.

00:24:28 It’s really just a matter of your point of view,

00:24:30 what frame of reference,

00:24:32 how much do you wanna zoom in or out

00:24:35 as you look at these problems?

00:24:37 Which problem do you think is harder?

00:24:39 So people like Noam Chomsky believe

00:24:41 that language is fundamental to everything.

00:24:43 So it underlies everything.

00:24:45 Do you think language understanding is harder

00:24:48 than visual scene understanding or vice versa?

00:24:52 I think that asking if a problem is hard is slightly wrong.

00:24:56 I think the question is a little bit wrong

00:24:57 and I wanna explain why.

00:24:59 So what does it mean for a problem to be hard?

00:25:04 Okay, the non interesting dumb answer to that

00:25:07 is there’s a benchmark

00:25:10 and there’s a human level performance on that benchmark

00:25:13 and how is the effort required

00:25:16 to reach the human level benchmark.

00:25:19 So from the perspective of how much

00:25:20 until we get to human level on a very good benchmark.

00:25:25 Yeah, I understand what you mean by that.

00:25:28 So what I was going to say that a lot of it depends on,

00:25:32 once you solve a problem, it stops being hard

00:25:34 and that’s always true.

00:25:35 And so whether something is hard or not depends

00:25:38 on what our tools can do today.

00:25:39 So you say today through human level,

00:25:43 language understanding and visual perception are hard

00:25:46 in the sense that there is no way

00:25:48 of solving the problem completely in the next three months.

00:25:52 So I agree with that statement.

00:25:53 Beyond that, my guess would be as good as yours,

00:25:56 I don’t know.

00:25:57 Oh, okay, so you don’t have a fundamental intuition

00:26:00 about how hard language understanding is.

00:26:02 I think, I know I changed my mind.

00:26:04 I’d say language is probably going to be harder.

00:26:06 I mean, it depends on how you define it.

00:26:09 Like if you mean absolute top notch,

00:26:11 100% language understanding, I’ll go with language.

00:26:16 But then if I show you a piece of paper with letters on it,

00:26:18 is that, you see what I mean?

00:26:21 You have a vision system,

00:26:22 you say it’s the best human level vision system.

00:26:25 I show you, I open a book and I show you letters.

00:26:28 Will it understand how these letters form into word

00:26:30 and sentences and meaning?

00:26:32 Is this part of the vision problem?

00:26:33 Where does vision end and language begin?

00:26:36 Yeah, so Chomsky would say it starts at language.

00:26:38 So vision is just a little example of the kind

00:26:40 of a structure and fundamental hierarchy of ideas

00:26:46 that’s already represented in our brains somehow

00:26:49 that’s represented through language.

00:26:51 But where does vision stop and language begin?

00:26:57 That’s a really interesting question.

00:27:07 So one possibility is that it’s impossible

00:27:09 to achieve really deep understanding in either images

00:27:14 or language without basically using the same kind of system.

00:27:18 So you’re going to get the other for free.

00:27:21 I think it’s pretty likely that yes,

00:27:23 if we can get one, our machine learning is probably

00:27:25 that good that we can get the other.

00:27:27 But I’m not 100% sure.

00:27:30 And also, I think a lot of it really does depend

00:27:34 on your definitions.

00:27:36 Definitions of?

00:27:37 Of like perfect vision.

00:27:40 Because reading is vision, but should it count?

00:27:44 Yeah, to me, so my definition is if a system looked

00:27:47 at an image and then a system looked at a piece of text

00:27:52 and then told me something about that

00:27:56 and I was really impressed.

00:27:58 That’s relative.

00:27:59 You’ll be impressed for half an hour

00:28:01 and then you’re gonna say, well, I mean,

00:28:02 all the systems do that, but here’s the thing they don’t do.

00:28:05 Yeah, but I don’t have that with humans.

00:28:07 Humans continue to impress me.

00:28:08 Is that true?

00:28:10 Well, the ones, okay, so I’m a fan of monogamy.

00:28:14 So I like the idea of marrying somebody,

00:28:16 being with them for several decades.

00:28:18 So I believe in the fact that yes, it’s possible

00:28:20 to have somebody continuously giving you

00:28:24 pleasurable, interesting, witty new ideas, friends.

00:28:28 Yeah, I think so.

00:28:29 They continue to surprise you.

00:28:32 The surprise, it’s that injection of randomness.

00:28:37 It seems to be a nice source of, yeah, continued inspiration,

00:28:47 like the wit, the humor.

00:28:48 I think, yeah, that would be,

00:28:53 it’s a very subjective test,

00:28:54 but I think if you have enough humans in the room.

00:28:58 Yeah, I understand what you mean.

00:29:00 Yeah, I feel like I misunderstood

00:29:02 what you meant by impressing you.

00:29:02 I thought you meant to impress you with its intelligence,

00:29:06 with how well it understands an image.

00:29:10 I thought you meant something like,

00:29:11 I’m gonna show it a really complicated image

00:29:13 and it’s gonna get it right.

00:29:14 And you’re gonna say, wow, that’s really cool.

00:29:15 Our systems of January 2020 have not been doing that.

00:29:19 Yeah, no, I think it all boils down to like

00:29:23 the reason people click like on stuff on the internet,

00:29:26 which is like, it makes them laugh.

00:29:28 So it’s like humor or wit or insight.

00:29:32 I’m sure we’ll get that as well.

00:29:35 So forgive the romanticized question,

00:29:38 but looking back to you,

00:29:40 what is the most beautiful or surprising idea

00:29:43 in deep learning or AI in general you’ve come across?

00:29:46 So I think the most beautiful thing about deep learning

00:29:49 is that it actually works.

00:29:51 And I mean it, because you got these ideas,

00:29:53 you got the little neural network,

00:29:54 you got the back propagation algorithm.

00:29:58 And then you’ve got some theories as to,

00:30:00 this is kind of like the brain.

00:30:02 So maybe if you make it large,

00:30:03 if you make the neural network large

00:30:04 and you train it on a lot of data,

00:30:05 then it will do the same function that the brain does.

00:30:09 And it turns out to be true, that’s crazy.

00:30:12 And now we just train these neural networks

00:30:14 and you make them larger and they keep getting better.

00:30:16 And I find it unbelievable.

00:30:17 I find it unbelievable that this whole AI stuff

00:30:20 with neural networks works.

00:30:22 Have you built up an intuition of why?

00:30:24 Are there a lot of bits and pieces of intuitions,

00:30:27 of insights of why this whole thing works?

00:30:31 I mean, some, definitely.

00:30:33 While we know that optimization, we now have good,

00:30:37 we’ve had lots of empirical,

00:30:40 huge amounts of empirical reasons

00:30:42 to believe that optimization should work

00:30:44 on most problems we care about.

00:30:47 Do you have insights of why?

00:30:48 So you just said empirical evidence.

00:30:50 Is most of your sort of empirical evidence

00:30:56 kind of convinces you?

00:30:58 It’s like evolution is empirical.

00:31:00 It shows you that, look,

00:31:01 this evolutionary process seems to be a good way

00:31:03 to design organisms that survive in their environment,

00:31:08 but it doesn’t really get you to the insights

00:31:11 of how the whole thing works.

00:31:13 I think a good analogy is physics.

00:31:16 You know how you say, hey, let’s do some physics calculation

00:31:19 and come up with some new physics theory

00:31:20 and make some prediction.

00:31:21 But then you got around the experiment.

00:31:23 You know, you got around the experiment, it’s important.

00:31:26 So it’s a bit the same here,

00:31:27 except that maybe sometimes the experiment

00:31:29 came before the theory.

00:31:31 But it still is the case.

00:31:32 You know, you have some data

00:31:33 and you come up with some prediction.

00:31:35 You say, yeah, let’s make a big neural network.

00:31:36 Let’s train it.

00:31:37 And it’s going to work much better than anything before it.

00:31:39 And it will in fact continue to get better

00:31:41 as you make it larger.

00:31:42 And it turns out to be true.

00:31:43 That’s amazing when a theory is validated like this.

00:31:47 It’s not a mathematical theory.

00:31:48 It’s more of a biological theory almost.

00:31:51 So I think there are not terrible analogies

00:31:53 between deep learning and biology.

00:31:55 I would say it’s like the geometric mean

00:31:57 of biology and physics.

00:31:58 That’s deep learning.

00:32:00 The geometric mean of biology and physics.

00:32:03 I think I’m going to need a few hours

00:32:05 to wrap my head around that.

00:32:07 Because just to find the geometric,

00:32:10 just to find the set of what biology represents.

00:32:16 Well, in biology, things are really complicated.

00:32:19 Theories are really, really,

00:32:21 it’s really hard to have good predictive theory.

00:32:22 And in physics, the theories are too good.

00:32:25 In physics, people make these super precise theories

00:32:27 which make these amazing predictions.

00:32:29 And in machine learning, we’re kind of in between.

00:32:31 Kind of in between, but it’d be nice

00:32:33 if machine learning somehow helped us

00:32:35 discover the unification of the two

00:32:37 as opposed to sort of the in between.

00:32:40 But you’re right.

00:32:41 That’s, you’re kind of trying to juggle both.

00:32:44 So do you think there are still beautiful

00:32:46 and mysterious properties in neural networks

00:32:48 that are yet to be discovered?

00:32:50 Definitely.

00:32:51 I think that we are still massively underestimating

00:32:53 deep learning.

00:32:55 What do you think it will look like?

00:32:56 Like what, if I knew, I would have done it, you know?

00:33:01 So, but if you look at all the progress

00:33:04 from the past 10 years, I would say most of it,

00:33:07 I would say there’ve been a few cases

00:33:08 where some were things that felt like really new ideas

00:33:12 showed up, but by and large it was every year

00:33:15 we thought, okay, deep learning goes this far.

00:33:17 Nope, it actually goes further.

00:33:19 And then the next year, okay, now this is peak deep learning.

00:33:22 We are really done.

00:33:23 Nope, it goes further.

00:33:24 It just keeps going further each year.

00:33:26 So that means that we keep underestimating,

00:33:27 we keep not understanding it.

00:33:29 It has surprising properties all the time.

00:33:31 Do you think it’s getting harder and harder?

00:33:33 To make progress?

00:33:34 Need to make progress?

00:33:36 It depends on what you mean.

00:33:36 I think the field will continue to make very robust progress

00:33:39 for quite a while.

00:33:41 I think for individual researchers,

00:33:42 especially people who are doing research,

00:33:46 it can be harder because there is a very large number

00:33:48 of researchers right now.

00:33:50 I think that if you have a lot of compute,

00:33:51 then you can make a lot of very interesting discoveries,

00:33:54 but then you have to deal with the challenge

00:33:57 of managing a huge compute cluster

00:34:01 to run your experiments.

00:34:02 It’s a little bit harder.

00:34:03 So I’m asking all these questions

00:34:04 that nobody knows the answer to,

00:34:06 but you’re one of the smartest people I know,

00:34:08 so I’m gonna keep asking.

00:34:10 So let’s imagine all the breakthroughs

00:34:12 that happen in the next 30 years in deep learning.

00:34:15 Do you think most of those breakthroughs

00:34:17 can be done by one person with one computer?

00:34:22 Sort of in the space of breakthroughs,

00:34:23 do you think compute will be,

00:34:26 compute and large efforts will be necessary?

00:34:32 I mean, I can’t be sure.

00:34:34 When you say one computer, you mean how large?

00:34:36 You’re clever.

00:34:40 I mean, one GPU.

00:34:42 I see.

00:34:43 I think it’s pretty unlikely.

00:34:47 I think it’s pretty unlikely.

00:34:48 I think that there are many,

00:34:51 the stack of deep learning is starting to be quite deep.

00:34:54 If you look at it, you’ve got all the way from the ideas,

00:34:59 the systems to build the data sets,

00:35:02 the distributed programming,

00:35:04 the building the actual cluster,

00:35:06 the GPU programming, putting it all together.

00:35:09 So now the stack is getting really deep

00:35:10 and I think it becomes,

00:35:12 it can be quite hard for a single person

00:35:14 to become, to be world class

00:35:15 in every single layer of the stack.

00:35:17 What about the, what like Vlad and Ravapnik

00:35:21 really insist on is taking MNIST

00:35:23 and trying to learn from very few examples.

00:35:26 So being able to learn more efficiently.

00:35:29 Do you think that’s, there’ll be breakthroughs in that space

00:35:32 that would, may not need the huge compute?

00:35:34 I think there will be a large number of breakthroughs

00:35:37 in general that will not need a huge amount of compute.

00:35:40 So maybe I should clarify that.

00:35:42 I think that some breakthroughs will require a lot of compute

00:35:45 and I think building systems which actually do things

00:35:48 will require a huge amount of compute.

00:35:50 That one is pretty obvious.

00:35:51 If you want to do X and X requires a huge neural net,

00:35:54 you gotta get a huge neural net.

00:35:56 But I think there will be lots of,

00:35:59 I think there is lots of room for very important work

00:36:02 being done by small groups and individuals.

00:36:05 Can you maybe sort of on the topic

00:36:07 of the science of deep learning,

00:36:10 talk about one of the recent papers

00:36:12 that you released, the Deep Double Descent,

00:36:15 where bigger models and more data hurt.

00:36:18 I think it’s a really interesting paper.

00:36:19 Can you describe the main idea?

00:36:22 Yeah, definitely.

00:36:23 So what happened is that some,

00:36:25 over the years, some small number of researchers noticed

00:36:28 that it is kind of weird that when you make

00:36:30 the neural network larger, it works better

00:36:32 and it seems to go in contradiction

00:36:33 with statistical ideas.

00:36:34 And then some people made an analysis showing

00:36:36 that actually you got this double descent bump.

00:36:38 And what we’ve done was to show that double descent occurs

00:36:42 for pretty much all practical deep learning systems.

00:36:46 And that it’ll be also, so can you step back?

00:36:51 What’s the X axis and the Y axis of a double descent plot?

00:36:55 Okay, great.

00:36:57 So you can look, you can do things like,

00:37:02 you can take your neural network

00:37:04 and you can start increasing its size slowly

00:37:07 while keeping your data set fixed.

00:37:10 So if you increase the size of the neural network slowly,

00:37:14 and if you don’t do early stopping,

00:37:16 that’s a pretty important detail,

00:37:20 then when the neural network is really small,

00:37:22 you make it larger,

00:37:23 you get a very rapid increase in performance.

00:37:26 Then you continue to make it larger.

00:37:27 And at some point performance will get worse.

00:37:30 And it gets the worst exactly at the point

00:37:33 at which it achieves zero training error,

00:37:36 precisely zero training loss.

00:37:38 And then as you make it larger,

00:37:39 it starts to get better again.

00:37:41 And it’s kind of counterintuitive

00:37:42 because you’d expect deep learning phenomena

00:37:44 to be monotonic.

00:37:46 And it’s hard to be sure what it means,

00:37:50 but it also occurs in the case of linear classifiers.

00:37:53 And the intuition basically boils down to the following.

00:37:57 When you have a large data set and a small model,

00:38:03 then small, tiny random,

00:38:05 so basically what is overfitting?

00:38:07 Overfitting is when your model is somehow very sensitive

00:38:12 to the small random unimportant stuff in your data set.

00:38:16 In the training data.

00:38:17 In the training data set, precisely.

00:38:19 So if you have a small model and you have a big data set,

00:38:23 and there may be some random thing,

00:38:24 some training cases are randomly in the data set

00:38:27 and others may not be there,

00:38:29 but the small model is kind of insensitive

00:38:31 to this randomness because it’s the same,

00:38:34 there is pretty much no uncertainty about the model

00:38:37 when the data set is large.

00:38:38 So, okay.

00:38:39 So at the very basic level to me,

00:38:41 it is the most surprising thing

00:38:43 that neural networks don’t overfit every time very quickly

00:38:51 before ever being able to learn anything.

00:38:54 The huge number of parameters.

00:38:56 So here is, so there is one way, okay.

00:38:57 So maybe, so let me try to give the explanation

00:39:00 and maybe that will be, that will work.

00:39:02 So you’ve got a huge neural network.

00:39:03 Let’s suppose you’ve got, you have a huge neural network,

00:39:07 you have a huge number of parameters.

00:39:09 And now let’s pretend everything is linear,

00:39:11 which is not, let’s just pretend.

00:39:13 Then there is this big subspace

00:39:15 where your neural network achieves zero error.

00:39:18 And SGD is going to find approximately the point.

00:39:21 That’s right.

00:39:22 Approximately the point with the smallest norm

00:39:24 in that subspace.

00:39:26 Okay.

00:39:27 And that can also be proven to be insensitive

00:39:30 to the small randomness in the data

00:39:33 when the dimensionality is high.

00:39:35 But when the dimensionality of the data

00:39:37 is equal to the dimensionality of the model,

00:39:39 then there is a one to one correspondence

00:39:41 between all the data sets and the models.

00:39:44 So small changes in the data set

00:39:45 actually lead to large changes in the model.

00:39:47 And that’s why performance gets worse.

00:39:48 So this is the best explanation more or less.

00:39:52 So then it would be good for the model

00:39:54 to have more parameters, so to be bigger than the data.

00:39:58 That’s right.

00:39:59 But only if you don’t early stop.

00:40:00 If you introduce early stop in your regularization,

00:40:02 you can make the double descent bump

00:40:04 almost completely disappear.

00:40:06 What is early stop?

00:40:07 Early stopping is when you train your model

00:40:09 and you monitor your validation performance.

00:40:13 And then if at some point validation performance

00:40:15 starts to get worse, you say, okay, let’s stop training.

00:40:17 We are good enough.

00:40:20 So the magic happens after that moment.

00:40:23 So you don’t want to do the early stopping.

00:40:25 Well, if you don’t do the early stopping,

00:40:26 you get the very pronounced double descent.

00:40:29 Do you have any intuition why this happens?

00:40:31 Double descent?

00:40:32 Oh, sorry, early stopping?

00:40:33 No, the double descent.

00:40:34 So the…

00:40:35 Well, yeah, so I try…

00:40:36 Let’s see.

00:40:37 The intuition is basically is this,

00:40:39 that when the data set has as many degrees of freedom

00:40:44 as the model, then there is a one to one correspondence

00:40:47 between them.

00:40:48 And so small changes to the data set

00:40:50 lead to noticeable changes in the model.

00:40:53 So your model is very sensitive to all the randomness.

00:40:55 It is unable to discard it.

00:40:57 Whereas it turns out that when you have

00:41:01 a lot more data than parameters

00:41:03 or a lot more parameters than data,

00:41:05 the resulting solution will be insensitive

00:41:07 to small changes in the data set.

00:41:09 Oh, so it’s able to, let’s nicely put,

00:41:12 discard the small changes, the randomness.

00:41:14 The randomness, exactly.

00:41:15 The spurious correlation which you don’t want.

00:41:19 Jeff Hinton suggested we need to throw back propagation.

00:41:22 We already kind of talked about this a little bit,

00:41:23 but he suggested that we need to throw away

00:41:25 back propagation and start over.

00:41:28 I mean, of course some of that is a little bit

00:41:32 wit and humor, but what do you think?

00:41:34 What could be an alternative method

00:41:36 of training neural networks?

00:41:37 Well, the thing that he said precisely is that

00:41:40 to the extent that you can’t find back propagation

00:41:42 in the brain, it’s worth seeing if we can learn something

00:41:45 from how the brain learns.

00:41:47 But back propagation is very useful

00:41:48 and we should keep using it.

00:41:50 Oh, you’re saying that once we discover

00:41:52 the mechanism of learning in the brain,

00:41:54 or any aspects of that mechanism,

00:41:56 we should also try to implement that in neural networks?

00:41:59 If it turns out that we can’t find

00:42:00 back propagation in the brain.

00:42:01 If we can’t find back propagation in the brain.

00:42:06 Well, so I guess your answer to that is

00:42:10 back propagation is pretty damn useful.

00:42:12 So why are we complaining?

00:42:14 I mean, I personally am a big fan of back propagation.

00:42:16 I think it’s a great algorithm because it solves

00:42:18 an extremely fundamental problem,

00:42:20 which is finding a neural circuit

00:42:24 subject to some constraints.

00:42:27 And I don’t see that problem going away.

00:42:28 So that’s why I really, I think it’s pretty unlikely

00:42:33 that we’ll have anything which is going to be

00:42:35 dramatically different.

00:42:37 It could happen, but I wouldn’t bet on it right now.

00:42:41 So let me ask a sort of big picture question.

00:42:45 Do you think neural networks can be made

00:42:49 to reason?

00:42:50 Why not?

00:42:52 Well, if you look, for example, at AlphaGo or AlphaZero,

00:42:57 the neural network of AlphaZero plays Go,

00:43:00 which we all agree is a game that requires reasoning,

00:43:04 better than 99.9% of all humans.

00:43:07 Just the neural network, without the search,

00:43:09 just the neural network itself.

00:43:11 Doesn’t that give us an existence proof

00:43:14 that neural networks can reason?

00:43:15 To push back and disagree a little bit,

00:43:18 we all agree that Go is reasoning.

00:43:20 I think I agree, I don’t think it’s a trivial,

00:43:24 so obviously reasoning like intelligence

00:43:27 is a loose gray area term a little bit.

00:43:31 Maybe you disagree with that.

00:43:32 But yes, I think it has some of the same elements

00:43:36 of reasoning.

00:43:37 Reasoning is almost like akin to search, right?

00:43:41 There’s a sequential element of reasoning

00:43:45 of stepwise consideration of possibilities

00:43:51 and sort of building on top of those possibilities

00:43:54 in a sequential manner until you arrive at some insight.

00:43:57 So yeah, I guess playing Go is kind of like that.

00:44:00 And when you have a single neural network

00:44:02 doing that without search, it’s kind of like that.

00:44:04 So there’s an existence proof

00:44:06 in a particular constrained environment

00:44:08 that a process akin to what many people call reasoning

00:44:13 exists, but more general kind of reasoning.

00:44:17 So off the board.

00:44:18 There is one other existence proof.

00:44:20 Oh boy, which one?

00:44:22 Us humans?

00:44:23 Yes.

00:44:23 Okay, all right, so do you think the architecture

00:44:29 that will allow neural networks to reason

00:44:33 will look similar to the neural network architectures

00:44:37 we have today?

00:44:38 I think it will.

00:44:39 I think, well, I don’t wanna make

00:44:41 two overly definitive statements.

00:44:44 I think it’s definitely possible

00:44:45 that the neural networks that will produce

00:44:48 the reasoning breakthroughs of the future

00:44:50 will be very similar to the architectures that exist today.

00:44:53 Maybe a little bit more recurrent,

00:44:55 maybe a little bit deeper.

00:44:57 But these neural nets are so insanely powerful.

00:45:02 Why wouldn’t they be able to learn to reason?

00:45:05 Humans can reason.

00:45:07 So why can’t neural networks?

00:45:09 So do you think the kind of stuff we’ve seen

00:45:11 neural networks do is a kind of just weak reasoning?

00:45:14 So it’s not a fundamentally different process.

00:45:16 Again, this is stuff nobody knows the answer to.

00:45:19 So when it comes to our neural networks,

00:45:23 the thing which I would say is that neural networks

00:45:25 are capable of reasoning.

00:45:28 But if you train a neural network on a task

00:45:30 which doesn’t require reasoning, it’s not going to reason.

00:45:34 This is a well known effect where the neural network

00:45:36 will solve the problem that you pose in front of it

00:45:41 in the easiest way possible.

00:45:44 Right, that takes us to one of the brilliant sort of ways

00:45:51 you’ve described neural networks,

00:45:52 which is you’ve referred to neural networks

00:45:55 as the search for small circuits

00:45:57 and maybe general intelligence

00:46:01 as the search for small programs,

00:46:04 which I found as a metaphor very compelling.

00:46:06 Can you elaborate on that difference?

00:46:09 Yeah, so the thing which I said precisely was that

00:46:13 if you can find the shortest program

00:46:17 that outputs the data at your disposal,

00:46:20 then you will be able to use it

00:46:22 to make the best prediction possible.

00:46:25 And that’s a theoretical statement

00:46:27 which can be proved mathematically.

00:46:29 Now, you can also prove mathematically

00:46:31 that finding the shortest program

00:46:33 which generates some data is not a computable operation.

00:46:38 No finite amount of compute can do this.

00:46:42 So then with neural networks,

00:46:46 neural networks are the next best thing

00:46:47 that actually works in practice.

00:46:50 We are not able to find the best,

00:46:52 the shortest program which generates our data,

00:46:55 but we are able to find a small,

00:46:58 but now that statement should be amended,

00:47:01 even a large circuit which fits our data in some way.

00:47:05 Well, I think what you meant by the small circuit

00:47:07 is the smallest needed circuit.

00:47:10 Well, the thing which I would change now,

00:47:12 back then I really haven’t fully internalized

00:47:14 the overparameterized results.

00:47:17 The things we know about overparameterized neural nets,

00:47:20 now I would phrase it as a large circuit

00:47:24 whose weights contain a small amount of information,

00:47:27 which I think is what’s going on.

00:47:29 If you imagine the training process of a neural network

00:47:31 as you slowly transmit entropy

00:47:33 from the dataset to the parameters,

00:47:37 then somehow the amount of information in the weights

00:47:41 ends up being not very large,

00:47:42 which would explain why they generalize so well.

00:47:45 So the large circuit might be one that’s helpful

00:47:49 for the generalization.

00:47:51 Yeah, something like this.

00:47:54 But do you see it important to be able to try

00:48:00 to learn something like programs?

00:48:02 I mean, if we can, definitely.

00:48:04 I think it’s kind of, the answer is kind of yes,

00:48:08 if we can do it, we should do things that we can do it.

00:48:11 It’s the reason we are pushing on deep learning,

00:48:15 the fundamental reason, the root cause

00:48:18 is that we are able to train them.

00:48:21 So in other words, training comes first.

00:48:23 We’ve got our pillar, which is the training pillar.

00:48:27 And now we’re trying to contort our neural networks

00:48:30 around the training pillar.

00:48:30 We gotta stay trainable.

00:48:31 This is an invariant we cannot violate.

00:48:36 And so being trainable means starting from scratch,

00:48:40 knowing nothing, you can actually pretty quickly

00:48:42 converge towards knowing a lot.

00:48:44 Or even slowly.

00:48:45 But it means that given the resources at your disposal,

00:48:50 you can train the neural net

00:48:52 and get it to achieve useful performance.

00:48:55 Yeah, that’s a pillar we can’t move away from.

00:48:57 That’s right.

00:48:58 Because if you say, hey, let’s find the shortest program,

00:49:01 well, we can’t do that.

00:49:02 So it doesn’t matter how useful that would be.

00:49:06 We can’t do it.

00:49:07 So we won’t.

00:49:08 So do you think, you kind of mentioned

00:49:09 that the neural networks are good at finding small circuits

00:49:12 or large circuits.

00:49:14 Do you think then the matter of finding small programs

00:49:17 is just the data?

00:49:19 No.

00:49:20 So the, sorry, not the size or the type of data.

00:49:25 Sort of ask, giving it programs.

00:49:28 Well, I think the thing is that right now,

00:49:31 finding, there are no good precedents

00:49:34 of people successfully finding programs really well.

00:49:38 And so the way you’d find programs

00:49:40 is you’d train a deep neural network to do it basically.

00:49:44 Right.

00:49:45 Which is the right way to go about it.

00:49:48 But there’s not good illustrations of that.

00:49:50 It hasn’t been done yet.

00:49:51 But in principle, it should be possible.

00:49:56 Can you elaborate a little bit,

00:49:58 what’s your answer in principle?

00:50:00 Put another way, you don’t see why it’s not possible.

00:50:04 Well, it’s kind of like more, it’s more a statement of,

00:50:09 I think that it’s, I think that it’s unwise

00:50:12 to bet against deep learning.

00:50:13 And if it’s a cognitive function

00:50:16 that humans seem to be able to do,

00:50:18 then it doesn’t take too long

00:50:21 for some deep neural net to pop up that can do it too.

00:50:25 Yeah, I’m there with you.

00:50:27 I’ve stopped betting against neural networks at this point

00:50:33 because they continue to surprise us.

00:50:35 What about long term memory?

00:50:37 Can neural networks have long term memory?

00:50:39 Something like knowledge bases.

00:50:42 So being able to aggregate important information

00:50:45 over long periods of time that would then serve

00:50:49 as useful sort of representations of state

00:50:54 that you can make decisions by,

00:50:57 so have a long term context

00:50:59 based on which you’re making the decision.

00:51:01 So in some sense, the parameters already do that.

00:51:06 The parameters are an aggregation of the neural,

00:51:09 of the entirety of the neural nets experience,

00:51:10 and so they count as long term knowledge.

00:51:15 And people have trained various neural nets

00:51:17 to act as knowledge bases and, you know,

00:51:20 investigated with, people have investigated

00:51:22 language models as knowledge bases.

00:51:23 So there is work there.

00:51:27 Yeah, but in some sense, do you think in every sense,

00:51:29 do you think there’s a, it’s all just a matter of coming up

00:51:35 with a better mechanism of forgetting the useless stuff

00:51:38 and remembering the useful stuff?

00:51:40 Because right now, I mean, there’s not been mechanisms

00:51:43 that do remember really long term information.

00:51:46 What do you mean by that precisely?

00:51:48 Precisely, I like the word precisely.

00:51:51 So I’m thinking of the kind of compression of information

00:51:58 the knowledge bases represent.

00:52:00 Sort of creating a, now I apologize for my sort of

00:52:05 human centric thinking about what knowledge is,

00:52:08 because neural networks aren’t interpretable necessarily

00:52:12 with the kind of knowledge they have discovered.

00:52:15 But a good example for me is knowledge bases,

00:52:18 being able to build up over time something like

00:52:21 the knowledge that Wikipedia represents.

00:52:24 It’s a really compressed, structured knowledge base.

00:52:30 Obviously not the actual Wikipedia or the language,

00:52:34 but like a semantic web, the dream that semantic web

00:52:37 represented, so it’s a really nice compressed knowledge base

00:52:40 or something akin to that in the noninterpretable sense

00:52:44 as neural networks would have.

00:52:46 Well, the neural networks would be noninterpretable

00:52:48 if you look at their weights, but their outputs

00:52:50 should be very interpretable.

00:52:52 Okay, so yeah, how do you make very smart neural networks

00:52:55 like language models interpretable?

00:52:58 Well, you ask them to generate some text

00:53:00 and the text will generally be interpretable.

00:53:02 Do you find that the epitome of interpretability,

00:53:04 like can you do better?

00:53:06 Like can you add, because you can’t, okay,

00:53:08 I’d like to know what does it know and what doesn’t it know?

00:53:12 I would like the neural network to come up with examples

00:53:15 where it’s completely dumb and examples

00:53:18 where it’s completely brilliant.

00:53:20 And the only way I know how to do that now

00:53:22 is to generate a lot of examples and use my human judgment.

00:53:26 But it would be nice if a neural network

00:53:28 had some self awareness about it.

00:53:31 Yeah, 100%, I’m a big believer in self awareness

00:53:34 and I think that, I think neural net self awareness

00:53:39 will allow for things like the capabilities,

00:53:42 like the ones you described, like for them to know

00:53:44 what they know and what they don’t know

00:53:47 and for them to know where to invest

00:53:48 to increase their skills most optimally.

00:53:50 And to your question of interpretability,

00:53:52 there are actually two answers to that question.

00:53:54 One answer is, you know, we have the neural net

00:53:56 so we can analyze the neurons and we can try to understand

00:53:59 what the different neurons and different layers mean.

00:54:02 And you can actually do that

00:54:03 and OpenAI has done some work on that.

00:54:05 But there is a different answer, which is that,

00:54:10 I would say that’s the human centric answer where you say,

00:54:13 you know, you look at a human being, you can’t read,

00:54:16 how do you know what a human being is thinking?

00:54:18 You ask them, you say, hey, what do you think about this?

00:54:20 What do you think about that?

00:54:22 And you get some answers.

00:54:24 The answers you get are sticky in the sense

00:54:26 you already have a mental model.

00:54:28 You already have a mental model of that human being.

00:54:32 You already have an understanding of like a big conception

00:54:37 of that human being, how they think, what they know,

00:54:40 how they see the world and then everything you ask,

00:54:42 you’re adding onto that.

00:54:45 And that stickiness seems to be,

00:54:49 that’s one of the really interesting qualities

00:54:51 of the human being is that information is sticky.

00:54:55 You don’t, you seem to remember the useful stuff,

00:54:57 aggregate it well and forget most of the information

00:55:00 that’s not useful, that process.

00:55:02 But that’s also pretty similar to the process

00:55:05 that neural networks do.

00:55:06 It’s just that neural networks are much crappier

00:55:09 at this time.

00:55:10 It doesn’t seem to be fundamentally that different.

00:55:13 But just to stick on reasoning for a little longer,

00:55:17 you said, why not?

00:55:18 Why can’t I reason?

00:55:19 What’s a good impressive feat, benchmark to you

00:55:23 of reasoning that you’ll be impressed by

00:55:28 if neural networks were able to do?

00:55:30 Is that something you already have in mind?

00:55:32 Well, I think writing really good code,

00:55:36 I think proving really hard theorems,

00:55:39 solving open ended problems with out of the box solutions.

00:55:45 And sort of theorem type, mathematical problems.

00:55:49 Yeah, I think those ones are a very natural example

00:55:52 as well.

00:55:52 If you can prove an unproven theorem,

00:55:54 then it’s hard to argue you don’t reason.

00:55:57 And so by the way, and this comes back to the point

00:55:59 about the hard results, if you have machine learning,

00:56:04 deep learning as a field is very fortunate

00:56:06 because we have the ability to sometimes produce

00:56:08 these unambiguous results.

00:56:10 And when they happen, the debate changes,

00:56:13 the conversation changes.

00:56:14 It’s a converse, we have the ability

00:56:16 to produce conversation changing results.

00:56:19 Conversation, and then of course, just like you said,

00:56:21 people kind of take that for granted

00:56:23 and say that wasn’t actually a hard problem.

00:56:25 Well, I mean, at some point we’ll probably run out

00:56:27 of hard problems.

00:56:29 Yeah, that whole mortality thing is kind of a sticky problem

00:56:33 that we haven’t quite figured out.

00:56:35 Maybe we’ll solve that one.

00:56:37 I think one of the fascinating things

00:56:39 in your entire body of work,

00:56:40 but also the work at OpenAI recently,

00:56:43 one of the conversation changes has been

00:56:44 in the world of language models.

00:56:47 Can you briefly kind of try to describe

00:56:50 the recent history of using neural networks

00:56:52 in the domain of language and text?

00:56:54 Well, there’s been lots of history.

00:56:56 I think the Elman network was a small,

00:57:00 tiny recurrent neural network applied to language

00:57:02 back in the 80s.

00:57:03 So the history is really, you know, fairly long at least.

00:57:08 And the thing that started,

00:57:10 the thing that changed the trajectory

00:57:13 of neural networks and language

00:57:14 is the thing that changed the trajectory

00:57:17 of all deep learning and that’s data and compute.

00:57:19 So suddenly you move from small language models,

00:57:22 which learn a little bit,

00:57:24 and with language models in particular,

00:57:26 there’s a very clear explanation

00:57:28 for why they need to be large to be good,

00:57:31 because they’re trying to predict the next word.

00:57:34 So when you don’t know anything,

00:57:36 you’ll notice very, very broad strokes,

00:57:40 surface level patterns,

00:57:41 like sometimes there are characters

00:57:44 and there is a space between those characters.

00:57:46 You’ll notice this pattern.

00:57:47 And you’ll notice that sometimes there is a comma

00:57:50 and then the next character is a capital letter.

00:57:51 You’ll notice that pattern.

00:57:53 Eventually you may start to notice

00:57:55 that there are certain words occur often.

00:57:57 You may notice that spellings are a thing.

00:57:59 You may notice syntax.

00:58:00 And when you get really good at all these,

00:58:03 you start to notice the semantics.

00:58:05 You start to notice the facts.

00:58:07 But for that to happen,

00:58:08 the language model needs to be larger.

00:58:11 So that’s, let’s linger on that,

00:58:14 because that’s where you and Noam Chomsky disagree.

00:58:18 So you think we’re actually taking incremental steps,

00:58:23 a sort of larger network, larger compute

00:58:25 will be able to get to the semantics,

00:58:29 to be able to understand language

00:58:32 without what Noam likes to sort of think of

00:58:35 as a fundamental understandings

00:58:38 of the structure of language,

00:58:40 like imposing your theory of language

00:58:43 onto the learning mechanism.

00:58:45 So you’re saying the learning,

00:58:48 you can learn from raw data,

00:58:50 the mechanism that underlies language.

00:58:53 Well, I think it’s pretty likely,

00:58:56 but I also want to say that I don’t really know precisely

00:59:01 what Chomsky means when he talks about him.

00:59:05 You said something about imposing your structural language.

00:59:08 I’m not 100% sure what he means,

00:59:10 but empirically it seems that

00:59:12 when you inspect those larger language models,

00:59:14 they exhibit signs of understanding the semantics

00:59:16 whereas the smaller language models do not.

00:59:18 We’ve seen that a few years ago

00:59:19 when we did work on the sentiment neuron.

00:59:21 We trained a small, you know,

00:59:24 smallish LSTM to predict the next character

00:59:27 in Amazon reviews.

00:59:28 And we noticed that when you increase the size of the LSTM

00:59:31 from 500 LSTM cells to 4,000 LSTM cells,

00:59:35 then one of the neurons starts to represent the sentiment

00:59:38 of the article, sorry, of the review.

00:59:42 Now, why is that?

00:59:42 Sentiment is a pretty semantic attribute.

00:59:45 It’s not a syntactic attribute.

00:59:46 And for people who might not know,

00:59:48 I don’t know if that’s a standard term,

00:59:49 but sentiment is whether it’s a positive

00:59:51 or a negative review.

00:59:52 That’s right.

00:59:52 Is the person happy with something

00:59:54 or is the person unhappy with something?

00:59:55 And so here we had very clear evidence

00:59:58 that a small neural net does not capture sentiment

01:00:01 while a large neural net does.

01:00:03 And why is that?

01:00:04 Well, our theory is that at some point

01:00:07 you run out of syntax to models,

01:00:08 you start to gotta focus on something else.

01:00:11 And with size, you quickly run out of syntax to model

01:00:15 and then you really start to focus on the semantics

01:00:18 would be the idea.

01:00:19 That’s right.

01:00:20 And so I don’t wanna imply that our models

01:00:22 have complete semantic understanding

01:00:23 because that’s not true,

01:00:25 but they definitely are showing signs

01:00:28 of semantic understanding,

01:00:29 partial semantic understanding,

01:00:30 but the smaller models do not show those signs.

01:00:34 Can you take a step back and say,

01:00:36 what is GPT2, which is one of the big language models

01:00:40 that was the conversation changer

01:00:42 in the past couple of years?

01:00:43 Yeah, so GPT2 is a transformer

01:00:48 with one and a half billion parameters

01:00:50 that was trained on about 40 billion tokens of text

01:00:56 which were obtained from web pages

01:00:58 that were linked to from Reddit articles

01:01:01 with more than three outputs.

01:01:02 And what’s a transformer?

01:01:03 The transformer, it’s the most important advance

01:01:06 in neural network architectures in recent history.

01:01:09 What is attention maybe too?

01:01:11 Cause I think that’s an interesting idea,

01:01:13 not necessarily sort of technically speaking,

01:01:15 but the idea of attention versus maybe

01:01:18 what recurrent neural networks represent.

01:01:21 Yeah, so the thing is the transformer

01:01:23 is a combination of multiple ideas simultaneously

01:01:25 of which attention is one.

01:01:28 Do you think attention is the key?

01:01:29 No, it’s a key, but it’s not the key.

01:01:32 The transformer is successful

01:01:34 because it is the simultaneous combination

01:01:36 of multiple ideas.

01:01:37 And if you were to remove either idea,

01:01:39 it would be much less successful.

01:01:41 So the transformer uses a lot of attention,

01:01:43 but attention existed for a few years.

01:01:45 So that can’t be the main innovation.

01:01:48 The transformer is designed in such a way

01:01:53 that it runs really fast on the GPU.

01:01:56 And that makes a huge amount of difference.

01:01:58 This is one thing.

01:01:59 The second thing is that transformer is not recurrent.

01:02:02 And that is really important too,

01:02:04 because it is more shallow

01:02:06 and therefore much easier to optimize.

01:02:08 So in other words, users attention,

01:02:10 it is a really great fit to the GPU

01:02:14 and it is not recurrent,

01:02:15 so therefore less deep and easier to optimize.

01:02:17 And the combination of those factors make it successful.

01:02:20 So now it makes great use of your GPU.

01:02:24 It allows you to achieve better results

01:02:26 for the same amount of compute.

01:02:28 And that’s why it’s successful.

01:02:31 Were you surprised how well transformers worked

01:02:34 and GPT2 worked?

01:02:36 So you worked on language.

01:02:37 You’ve had a lot of great ideas

01:02:39 before transformers came about in language.

01:02:42 So you got to see the whole set of revolutions

01:02:44 before and after.

01:02:46 Were you surprised?

01:02:47 Yeah, a little.

01:02:48 A little?

01:02:50 I mean, it’s hard to remember

01:02:51 because you adapt really quickly,

01:02:54 but it definitely was surprising.

01:02:55 It definitely was.

01:02:56 In fact, you know what?

01:02:59 I’ll retract my statement.

01:03:00 It was pretty amazing.

01:03:02 It was just amazing to see generate this text of this.

01:03:06 And you know, you gotta keep in mind

01:03:07 that at that time we’ve seen all this progress in GANs

01:03:10 in improving the samples produced by GANs

01:03:13 were just amazing.

01:03:14 You have these realistic faces,

01:03:15 but text hasn’t really moved that much.

01:03:17 And suddenly we moved from, you know,

01:03:20 whatever GANs were in 2015

01:03:23 to the best, most amazing GANs in one step.

01:03:26 And that was really stunning.

01:03:27 Even though theory predicted,

01:03:29 yeah, you train a big language model,

01:03:30 of course you should get this,

01:03:31 but then to see it with your own eyes,

01:03:33 it’s something else.

01:03:34 And yet we adapt really quickly.

01:03:37 And now there’s sort of some cognitive scientists

01:03:42 write articles saying that GPT2 models

01:03:47 don’t truly understand language.

01:03:49 So we adapt quickly to how amazing

01:03:51 the fact that they’re able to model the language so well is.

01:03:55 So what do you think is the bar?

01:03:58 For what?

01:03:59 For impressing us that it…

01:04:02 I don’t know.

01:04:03 Do you think that bar will continuously be moved?

01:04:06 Definitely.

01:04:06 I think when you start to see

01:04:08 really dramatic economic impact,

01:04:11 that’s when I think that’s in some sense the next barrier.

01:04:13 Because right now, if you think about the work in AI,

01:04:16 it’s really confusing.

01:04:18 It’s really hard to know what to make of all these advances.

01:04:22 It’s kind of like, okay, you got an advance

01:04:25 and now you can do more things

01:04:26 and you’ve got another improvement

01:04:29 and you’ve got another cool demo.

01:04:30 At some point, I think people who are outside of AI,

01:04:36 they can no longer distinguish this progress anymore.

01:04:38 So we were talking offline

01:04:40 about translating Russian to English

01:04:41 and how there’s a lot of brilliant work in Russian

01:04:44 that the rest of the world doesn’t know about.

01:04:46 That’s true for Chinese,

01:04:47 it’s true for a lot of scientists

01:04:50 and just artistic work in general.

01:04:52 Do you think translation is the place

01:04:53 where we’re going to see sort of economic big impact?

01:04:57 I don’t know.

01:04:57 I think there is a huge number of…

01:05:00 I mean, first of all,

01:05:01 I wanna point out that translation already today is huge.

01:05:05 I think billions of people interact

01:05:07 with big chunks of the internet primarily through translation.

01:05:11 So translation is already huge

01:05:13 and it’s hugely positive too.

01:05:16 I think self driving is going to be hugely impactful

01:05:20 and that’s, it’s unknown exactly when it happens,

01:05:24 but again, I would not bet against deep learning, so I…

01:05:27 So there’s deep learning in general,

01:05:29 but you think this…

01:05:30 Deep learning for self driving.

01:05:31 Yes, deep learning for self driving.

01:05:33 But I was talking about sort of language models.

01:05:35 I see.

01:05:36 Just to check.

01:05:36 Beard off a little bit.

01:05:38 Just to check,

01:05:38 you’re not seeing a connection between driving and language.

01:05:41 No, no.

01:05:41 Okay.

01:05:42 Or rather both use neural nets.

01:05:44 That’d be a poetic connection.

01:05:45 I think there might be some,

01:05:47 like you said, there might be some kind of unification

01:05:49 towards a kind of multitask transformers

01:05:54 that can take on both language and vision tasks.

01:05:58 That’d be an interesting unification.

01:06:01 Now let’s see, what can I ask about GPT two more?

01:06:04 It’s simple.

01:06:05 There’s not much to ask.

01:06:06 It’s, you take a transform, you make it bigger,

01:06:09 you give it more data,

01:06:10 and suddenly it does all those amazing things.

01:06:12 Yeah, one of the beautiful things is that GPT,

01:06:14 the transformers are fundamentally simple to explain,

01:06:17 to train.

01:06:20 Do you think bigger will continue

01:06:23 to show better results in language?

01:06:27 Probably.

01:06:28 Sort of like what are the next steps

01:06:29 with GPT two, do you think?

01:06:31 I mean, I think for sure seeing

01:06:34 what larger versions can do is one direction.

01:06:37 Also, I mean, there are many questions.

01:06:41 There’s one question which I’m curious about

01:06:42 and that’s the following.

01:06:43 So right now GPT two,

01:06:45 so we feed it all this data from the internet,

01:06:46 which means that it needs to memorize

01:06:48 all those random facts about everything in the internet.

01:06:51 And it would be nice if the model could somehow

01:06:56 use its own intelligence to decide

01:06:59 what data it wants to accept

01:07:01 and what data it wants to reject.

01:07:03 Just like people.

01:07:04 People don’t learn all data indiscriminately.

01:07:07 We are super selective about what we learn.

01:07:09 And I think this kind of active learning,

01:07:11 I think would be very nice to have.

01:07:14 Yeah, listen, I love active learning.

01:07:16 So let me ask, does the selection of data,

01:07:21 can you just elaborate that a little bit more?

01:07:23 Do you think the selection of data is,

01:07:28 like I have this kind of sense

01:07:29 that the optimization of how you select data,

01:07:33 so the active learning process is going to be a place

01:07:38 for a lot of breakthroughs, even in the near future?

01:07:42 Because there hasn’t been many breakthroughs there

01:07:44 that are public.

01:07:45 I feel like there might be private breakthroughs

01:07:47 that companies keep to themselves

01:07:49 because the fundamental problem has to be solved

01:07:51 if you want to solve self driving,

01:07:52 if you want to solve a particular task.

01:07:55 What do you think about the space in general?

01:07:57 Yeah, so I think that for something like active learning,

01:08:00 or in fact, for any kind of capability, like active learning,

01:08:03 the thing that it really needs is a problem.

01:08:05 It needs a problem that requires it.

01:08:09 It’s very hard to do research about the capability

01:08:12 if you don’t have a task,

01:08:12 because then what’s going to happen

01:08:14 is that you will come up with an artificial task,

01:08:16 get good results, but not really convince anyone.

01:08:20 Right, like we’re now past the stage

01:08:22 where getting a result on MNIST, some clever formulation

01:08:28 of MNIST will convince people.

01:08:30 That’s right, in fact, you could quite easily

01:08:33 come up with a simple active learning scheme on MNIST

01:08:35 and get a 10x speed up, but then, so what?

01:08:39 And I think that with active learning,

01:08:41 the need, active learning will naturally arise

01:08:45 as problems that require it pop up.

01:08:49 That’s how I would, that’s my take on it.

01:08:51 There’s another interesting thing

01:08:54 that OpenAI has brought up with GPT2,

01:08:56 which is when you create a powerful

01:09:00 artificial intelligence system,

01:09:01 and it was unclear what kind of detrimental,

01:09:04 once you release GPT2,

01:09:07 what kind of detrimental effect it will have.

01:09:09 Because if you have a model

01:09:11 that can generate a pretty realistic text,

01:09:14 you can start to imagine that it would be used by bots

01:09:18 in some way that we can’t even imagine.

01:09:21 So there’s this nervousness about what is possible to do.

01:09:24 So you did a really kind of brave

01:09:27 and I think profound thing,

01:09:28 which is start a conversation about this.

01:09:30 How do we release powerful artificial intelligence models

01:09:34 to the public?

01:09:36 If we do it all, how do we privately discuss

01:09:39 with other, even competitors,

01:09:42 about how we manage the use of the systems and so on?

01:09:46 So from this whole experience,

01:09:47 you released a report on it,

01:09:49 but in general, are there any insights

01:09:51 that you’ve gathered from just thinking about this,

01:09:55 about how you release models like this?

01:09:57 I mean, I think that my take on this

01:10:00 is that the field of AI has been in a state of childhood.

01:10:05 And now it’s exiting that state

01:10:06 and it’s entering a state of maturity.

01:10:09 What that means is that AI is very successful

01:10:12 and also very impactful.

01:10:14 And its impact is not only large, but it’s also growing.

01:10:16 And so for that reason, it seems wise to start thinking

01:10:21 about the impact of our systems before releasing them,

01:10:24 maybe a little bit too soon, rather than a little bit too late.

01:10:28 And with the case of GPT2, like I mentioned earlier,

01:10:31 the results really were stunning.

01:10:34 And it seemed plausible, it didn’t seem certain,

01:10:37 it seemed plausible that something like GPT2

01:10:40 could easily use to reduce the cost of this information.

01:10:44 And so there was a question of what’s the best way

01:10:47 to release it, and a staged release seemed logical.

01:10:49 A small model was released,

01:10:51 and there was time to see the,

01:10:54 many people use these models in lots of cool ways.

01:10:57 There’ve been lots of really cool applications.

01:10:59 There haven’t been any negative application to be known of.

01:11:03 And so eventually it was released,

01:11:05 but also other people replicated similar models.

01:11:07 That’s an interesting question though that we know of.

01:11:10 So in your view, staged release,

01:11:12 is at least part of the answer to the question of how do we,

01:11:20 what do we do once we create a system like this?

01:11:22 It’s part of the answer, yes.

01:11:24 Is there any other insights?

01:11:26 Like say you don’t wanna release the model at all,

01:11:29 because it’s useful to you for whatever the business is.

01:11:32 Well, plenty of people don’t release models already.

01:11:36 Right, of course, but is there some moral,

01:11:39 ethical responsibility when you have a very powerful model

01:11:43 to sort of communicate?

01:11:44 Like, just as you said, when you had GPT2,

01:11:48 it was unclear how much it could be used for misinformation.

01:11:51 It’s an open question, and getting an answer to that

01:11:54 might require that you talk to other really smart people

01:11:57 that are outside of your particular group.

01:12:00 Have you, please tell me there’s some optimistic pathway

01:12:05 for people to be able to use this model

01:12:08 for people across the world to collaborate

01:12:11 on these kinds of cases?

01:12:14 Or is it still really difficult from one company

01:12:17 to talk to another company?

01:12:19 So it’s definitely possible.

01:12:21 It’s definitely possible to discuss these kind of models

01:12:26 with colleagues elsewhere,

01:12:28 and to get their take on what to do.

01:12:32 How hard is it though?

01:12:33 I mean.

01:12:36 Do you see that happening?

01:12:38 I think that’s a place where it’s important

01:12:40 to gradually build trust between companies.

01:12:43 Because ultimately, all the AI developers

01:12:47 are building technology which is going to be

01:12:48 increasingly more powerful.

01:12:50 And so it’s,

01:12:54 the way to think about it is that ultimately

01:12:56 we’re all in it together.

01:12:58 Yeah, I tend to believe in the better angels of our nature,

01:13:03 but I do hope that when you build a really powerful

01:13:09 AI system in a particular domain,

01:13:11 that you also think about the potential

01:13:14 negative consequences of, yeah.

01:13:21 It’s an interesting and scary possibility

01:13:23 that there will be a race for AI development

01:13:26 that would push people to close that development,

01:13:29 and not share ideas with others.

01:13:31 I don’t love this.

01:13:32 I’ve been a pure academic for 10 years.

01:13:34 I really like sharing ideas and it’s fun, it’s exciting.

01:13:39 What do you think it takes to,

01:13:40 let’s talk about AGI a little bit.

01:13:42 What do you think it takes to build a system

01:13:44 of human level intelligence?

01:13:45 We talked about reasoning,

01:13:47 we talked about long term memory, but in general,

01:13:50 what does it take, do you think?

01:13:51 Well, I can’t be sure.

01:13:55 But I think the deep learning,

01:13:57 plus maybe another,

01:13:58 plus maybe another small idea.

01:14:03 Do you think self play will be involved?

01:14:05 So you’ve spoken about the powerful mechanism of self play

01:14:09 where systems learn by sort of exploring the world

01:14:15 in a competitive setting against other entities

01:14:18 that are similarly skilled as them,

01:14:20 and so incrementally improve in this way.

01:14:23 Do you think self play will be a component

01:14:24 of building an AGI system?

01:14:26 Yeah, so what I would say, to build AGI,

01:14:29 I think it’s going to be deep learning plus some ideas.

01:14:34 And I think self play will be one of those ideas.

01:14:37 I think that that is a very,

01:14:41 self play has this amazing property

01:14:43 that it can surprise us in truly novel ways.

01:14:48 For example, like we, I mean,

01:14:53 pretty much every self play system,

01:14:55 both are Dota bot.

01:14:58 I don’t know if, OpenAI had a release about multi agent

01:15:02 where you had two little agents

01:15:04 who were playing hide and seek.

01:15:06 And of course, also alpha zero.

01:15:08 They were all produced surprising behaviors.

01:15:11 They all produce behaviors that we didn’t expect.

01:15:13 They are creative solutions to problems.

01:15:15 And that seems like an important part of AGI

01:15:18 that our systems don’t exhibit routinely right now.

01:15:22 And so that’s why I like this area.

01:15:24 I like this direction because of its ability to surprise us.

01:15:27 To surprise us.

01:15:28 And an AGI system would surprise us fundamentally.

01:15:31 Yes.

01:15:32 And to be precise, not just a random surprise,

01:15:34 but to find the surprising solution to a problem

01:15:37 that’s also useful.

01:15:39 Right.

01:15:39 Now, a lot of the self play mechanisms

01:15:42 have been used in the game context

01:15:45 or at least in the simulation context.

01:15:48 How far along the path to AGI

01:15:55 do you think will be done in simulation?

01:15:56 How much faith, promise do you have in simulation

01:16:01 versus having to have a system

01:16:03 that operates in the real world?

01:16:05 Whether it’s the real world of digital real world data

01:16:09 or real world like actual physical world of robotics.

01:16:13 I don’t think it’s an easy or.

01:16:15 I think simulation is a tool and it helps.

01:16:17 It has certain strengths and certain weaknesses

01:16:19 and we should use it.

01:16:21 Yeah, but okay, I understand that.

01:16:24 That’s true, but one of the criticisms of self play,

01:16:32 one of the criticisms of reinforcement learning

01:16:34 is one of the, its current power, its current results,

01:16:41 while amazing, have been demonstrated

01:16:42 in a simulated environments

01:16:44 or very constrained physical environments.

01:16:46 Do you think it’s possible to escape them,

01:16:49 escape the simulator environments

01:16:50 and be able to learn in non simulator environments?

01:16:53 Or do you think it’s possible to also just simulate

01:16:57 in a photo realistic and physics realistic way,

01:17:01 the real world in a way that we can solve real problems

01:17:03 with self play in simulation?

01:17:06 So I think that transfer from simulation to the real world

01:17:10 is definitely possible and has been exhibited many times

01:17:14 by many different groups.

01:17:16 It’s been especially successful in vision.

01:17:18 Also open AI in the summer has demonstrated a robot hand

01:17:22 which was trained entirely in simulation

01:17:25 in a certain way that allowed for seem to real transfer

01:17:27 to occur.

01:17:29 Is this for the Rubik’s cube?

01:17:31 Yeah, that’s right.

01:17:32 I wasn’t aware that was trained in simulation.

01:17:34 It was trained in simulation entirely.

01:17:37 Really, so it wasn’t in the physical,

01:17:39 the hand wasn’t trained?

01:17:40 No, 100% of the training was done in simulation

01:17:44 and the policy that was learned in simulation

01:17:46 was trained to be very adaptive.

01:17:48 So adaptive that when you transfer it,

01:17:50 it could very quickly adapt to the physical world.

01:17:53 So the kind of perturbations with the giraffe

01:17:57 or whatever the heck it was,

01:17:58 those weren’t, were those part of the simulation?

01:18:01 Well, the simulation was generally,

01:18:04 so the simulation was trained to be robust

01:18:07 to many different things,

01:18:08 but not the kind of perturbations we’ve had in the video.

01:18:10 So it’s never been trained with a glove.

01:18:12 It’s never been trained with a stuffed giraffe.

01:18:17 So in theory, these are novel perturbations.

01:18:19 Correct, it’s not in theory, in practice.

01:18:22 Those are novel perturbations?

01:18:23 Well, that’s okay.

01:18:26 That’s a clean, small scale,

01:18:28 but clean example of a transfer

01:18:29 from the simulated world to the physical world.

01:18:32 Yeah, and I will also say

01:18:33 that I expect the transfer capabilities

01:18:35 of deep learning to increase in general.

01:18:38 And the better the transfer capabilities are,

01:18:40 the more useful simulation will become.

01:18:43 Because then you could take,

01:18:45 you could experience something in simulation

01:18:48 and then learn a moral of the story,

01:18:50 which you could then carry with you to the real world.

01:18:53 As humans do all the time when they play computer games.

01:18:56 So let me ask sort of a embodied question,

01:19:01 staying on AGI for a sec.

01:19:04 Do you think AGI system would need to have a body?

01:19:07 We need to have some of those human elements

01:19:09 of self awareness, consciousness,

01:19:13 sort of fear of mortality,

01:19:15 sort of self preservation in the physical space,

01:19:18 which comes with having a body.

01:19:20 I think having a body will be useful.

01:19:22 I don’t think it’s necessary,

01:19:24 but I think it’s very useful to have a body for sure,

01:19:26 because you can learn a whole new,

01:19:28 you can learn things which cannot be learned without a body.

01:19:32 But at the same time, I think that if you don’t have a body,

01:19:35 you could compensate for it and still succeed.

01:19:38 You think so?

01:19:39 Yes.

01:19:40 Well, there is evidence for this.

01:19:41 For example, there are many people who were born deaf

01:19:43 and blind and they were able to compensate

01:19:46 for the lack of modalities.

01:19:48 I’m thinking about Helen Keller specifically.

01:19:51 So even if you’re not able to physically interact

01:19:53 with the world, and if you’re not able to,

01:19:56 I mean, I actually was getting at,

01:19:59 maybe let me ask on the more particular,

01:20:02 I’m not sure if it’s connected to having a body or not,

01:20:05 but the idea of consciousness

01:20:07 and a more constrained version of that is self awareness.

01:20:11 Do you think an AGI system should have consciousness?

01:20:16 We can’t define, whatever the heck you think consciousness is.

01:20:19 Yeah, hard question to answer,

01:20:21 given how hard it is to define it.

01:20:24 Do you think it’s useful to think about?

01:20:26 I mean, it’s definitely interesting.

01:20:28 It’s fascinating.

01:20:29 I think it’s definitely possible

01:20:31 that our systems will be conscious.

01:20:33 Do you think that’s an emergent thing that just comes from,

01:20:36 do you think consciousness could emerge

01:20:37 from the representation that’s stored within neural networks?

01:20:40 So like that it naturally just emerges

01:20:42 when you become more and more,

01:20:45 you’re able to represent more and more of the world?

01:20:47 Well, I’d say I’d make the following argument,

01:20:48 which is humans are conscious.

01:20:53 And if you believe that artificial neural nets

01:20:56 are sufficiently similar to the brain,

01:20:59 then there should at least exist artificial neural nets

01:21:02 you should be conscious too.

01:21:04 You’re leaning on that existence proof pretty heavily.

01:21:06 Okay, so that’s the best answer I can give.

01:21:12 No, I know, I know, I know.

01:21:15 There’s still an open question

01:21:17 if there’s not some magic in the brain that we’re not,

01:21:20 I mean, I don’t mean a non materialistic magic,

01:21:23 but that the brain might be a lot more complicated

01:21:27 and interesting than we give it credit for.

01:21:29 If that’s the case, then it should show up.

01:21:32 And at some point we will find out

01:21:35 that we can’t continue to make progress.

01:21:36 But I think it’s unlikely.

01:21:38 So we talk about consciousness,

01:21:40 but let me talk about another poorly defined concept

01:21:42 of intelligence.

01:21:44 Again, we’ve talked about reasoning,

01:21:46 we’ve talked about memory.

01:21:48 What do you think is a good test of intelligence for you?

01:21:51 Are you impressed by the test that Alan Turing formulated

01:21:55 with the imitation game with natural language?

01:21:58 Is there something in your mind

01:22:01 that you will be deeply impressed by

01:22:04 if a system was able to do?

01:22:06 I mean, lots of things.

01:22:07 There’s a certain frontier of capabilities today.

01:22:13 And there exist things outside of that frontier.

01:22:16 And I would be impressed by any such thing.

01:22:18 For example, I would be impressed by a deep learning system

01:22:24 which solves a very pedestrian task,

01:22:27 like machine translation or computer vision task

01:22:29 or something which never makes mistake

01:22:33 a human wouldn’t make under any circumstances.

01:22:37 I think that is something

01:22:38 which have not yet been demonstrated

01:22:40 and I would find it very impressive.

01:22:42 Yeah, so right now they make mistakes in different,

01:22:44 they might be more accurate than human beings,

01:22:46 but they still, they make a different set of mistakes.

01:22:49 So my, I would guess that a lot of the skepticism

01:22:53 that some people have about deep learning

01:22:55 is when they look at their mistakes and they say,

01:22:57 well, those mistakes, they make no sense.

01:23:00 Like if you understood the concept,

01:23:01 you wouldn’t make that mistake.

01:23:04 And I think that changing that would be,

01:23:07 that would inspire me.

01:23:09 That would be, yes, this is progress.

01:23:12 Yeah, that’s a really nice way to put it.

01:23:15 But I also just don’t like that human instinct

01:23:18 to criticize a model is not intelligent.

01:23:21 That’s the same instinct as we do

01:23:23 when we criticize any group of creatures as the other.

01:23:28 Because it’s very possible that GPT2

01:23:33 is much smarter than human beings at many things.

01:23:36 That’s definitely true.

01:23:37 It has a lot more breadth of knowledge.

01:23:39 Yes, breadth of knowledge

01:23:41 and even perhaps depth on certain topics.

01:23:46 It’s kind of hard to judge what depth means,

01:23:48 but there’s definitely a sense in which

01:23:51 humans don’t make mistakes that these models do.

01:23:54 The same is applied to autonomous vehicles.

01:23:57 The same is probably gonna continue being applied

01:23:59 to a lot of artificial intelligence systems.

01:24:01 We find, this is the annoying thing.

01:24:04 This is the process of, in the 21st century,

01:24:06 the process of analyzing the progress of AI

01:24:09 is the search for one case where the system fails

01:24:13 in a big way where humans would not.

01:24:17 And then many people writing articles about it.

01:24:20 And then broadly, the public generally gets convinced

01:24:24 that the system is not intelligent.

01:24:26 And we pacify ourselves by thinking it’s not intelligent

01:24:29 because of this one anecdotal case.

01:24:31 And this seems to continue happening.

01:24:34 Yeah, I mean, there is truth to that.

01:24:36 Although I’m sure that plenty of people

01:24:38 are also extremely impressed

01:24:39 by the system that exists today.

01:24:40 But I think this connects to the earlier point

01:24:42 we discussed that it’s just confusing

01:24:45 to judge progress in AI.

01:24:47 Yeah.

01:24:47 And you have a new robot demonstrating something.

01:24:50 How impressed should you be?

01:24:52 And I think that people will start to be impressed

01:24:55 once AI starts to really move the needle on the GDP.

01:25:00 So you’re one of the people that might be able

01:25:02 to create an AGI system here.

01:25:03 Not you, but you and OpenAI.

01:25:06 If you do create an AGI system

01:25:09 and you get to spend sort of the evening

01:25:11 with it, him, her, what would you talk about, do you think?

01:25:17 The very first time?

01:25:19 First time.

01:25:19 Well, the first time I would just ask all kinds of questions

01:25:23 and try to get it to make a mistake.

01:25:25 And I would be amazed that it doesn’t make mistakes

01:25:28 and just keep asking broad questions.

01:25:33 What kind of questions do you think?

01:25:34 Would they be factual or would they be personal,

01:25:39 emotional, psychological?

01:25:40 What do you think?

01:25:42 All of the above.

01:25:46 Would you ask for advice?

01:25:47 Definitely.

01:25:49 I mean, why would I limit myself

01:25:51 talking to a system like this?

01:25:53 Now, again, let me emphasize the fact

01:25:56 that you truly are one of the people

01:25:57 that might be in the room where this happens.

01:26:01 So let me ask sort of a profound question about,

01:26:06 I’ve just talked to a Stalin historian.

01:26:08 I’ve been talking to a lot of people who are studying power.

01:26:13 Abraham Lincoln said,

01:26:14 “‘Nearly all men can stand adversity,

01:26:17 “‘but if you want to test a man’s character, give him power.’”

01:26:21 I would say the power of the 21st century,

01:26:24 maybe the 22nd, but hopefully the 21st,

01:26:28 would be the creation of an AGI system

01:26:30 and the people who have control,

01:26:33 direct possession and control of the AGI system.

01:26:36 So what do you think, after spending that evening

01:26:39 having a discussion with the AGI system,

01:26:42 what do you think you would do?

01:26:45 Well, the ideal world I’d like to imagine

01:26:50 is one where humanity,

01:26:52 I like, the board members of a company

01:26:57 where the AGI is the CEO.

01:26:59 So it would be, I would like,

01:27:04 the picture which I would imagine

01:27:05 is you have some kind of different entities,

01:27:09 different countries or cities,

01:27:11 and the people that leave their vote

01:27:13 for what the AGI that represents them should do,

01:27:16 and the AGI that represents them goes and does it.

01:27:18 I think a picture like that, I find very appealing.

01:27:23 You could have multiple AGI,

01:27:24 you would have an AGI for a city, for a country,

01:27:26 and there would be multiple AGI’s,

01:27:27 for a city, for a country, and there would be,

01:27:30 it would be trying to, in effect,

01:27:33 take the democratic process to the next level.

01:27:36 And the board can always fire the CEO.

01:27:38 Essentially, press the reset button, say.

01:27:40 Press the reset button.

01:27:41 Rerandomize the parameters.

01:27:42 But let me sort of, that’s actually,

01:27:45 okay, that’s a beautiful vision, I think,

01:27:49 as long as it’s possible to press the reset button.

01:27:53 Do you think it will always be possible

01:27:54 to press the reset button?

01:27:56 So I think that it definitely will be possible to build.

01:28:02 So you’re talking, so the question

01:28:03 that I really understand from you is,

01:28:06 will humans or humans people have control

01:28:12 over the AI systems that they build?

01:28:14 Yes.

01:28:15 And my answer is, it’s definitely possible

01:28:17 to build AI systems which will want

01:28:19 to be controlled by their humans.

01:28:21 Wow, that’s part of their,

01:28:24 so it’s not that just they can’t help but be controlled,

01:28:26 but that’s the, they exist,

01:28:31 the one of the objectives of their existence

01:28:33 is to be controlled.

01:28:34 In the same way that human parents

01:28:39 generally want to help their children,

01:28:42 they want their children to succeed.

01:28:44 It’s not a burden for them.

01:28:46 They are excited to help children and to feed them

01:28:49 and to dress them and to take care of them.

01:28:52 And I believe with high conviction

01:28:56 that the same will be possible for an AGI.

01:28:58 It will be possible to program an AGI,

01:29:00 to design it in such a way

01:29:01 that it will have a similar deep drive

01:29:04 that it will be delighted to fulfill.

01:29:07 And the drive will be to help humans flourish.

01:29:11 But let me take a step back to that moment

01:29:13 where you create the AGI system.

01:29:15 I think this is a really crucial moment.

01:29:17 And between that moment

01:29:21 and the Democratic board members with the AGI at the head,

01:29:28 there has to be a relinquishing of power.

01:29:31 So as George Washington, despite all the bad things he did,

01:29:36 one of the big things he did is he relinquished power.

01:29:39 He, first of all, didn’t want to be president.

01:29:42 And even when he became president,

01:29:43 he gave, he didn’t keep just serving

01:29:45 as most dictators do for indefinitely.

01:29:49 Do you see yourself being able to relinquish control

01:29:55 over an AGI system,

01:29:56 given how much power you can have over the world,

01:29:59 at first financial, just make a lot of money, right?

01:30:02 And then control by having possession as AGI system.

01:30:07 I’d find it trivial to do that.

01:30:09 I’d find it trivial to relinquish this kind of power.

01:30:11 I mean, the kind of scenario you are describing

01:30:15 sounds terrifying to me.

01:30:17 That’s all.

01:30:19 I would absolutely not want to be in that position.

01:30:22 Do you think you represent the majority

01:30:25 or the minority of people in the AI community?

01:30:29 Well, I mean.

01:30:30 Say open question, an important one.

01:30:33 Are most people good is another way to ask it.

01:30:36 So I don’t know if most people are good,

01:30:39 but I think that when it really counts,

01:30:44 people can be better than we think.

01:30:47 That’s beautifully put, yeah.

01:30:49 Are there specific mechanism you can think of

01:30:51 of aligning AI values to human values?

01:30:54 Is that, do you think about these problems

01:30:56 of continued alignment as we develop the AI systems?

01:31:00 Yeah, definitely.

01:31:02 In some sense, the kind of question which you are asking is,

01:31:07 so if I were to translate the question to today’s terms,

01:31:10 it would be a question about how to get an RL agent

01:31:17 that’s optimizing a value function which itself is learned.

01:31:21 And if you look at humans, humans are like that

01:31:23 because the reward function, the value function of humans

01:31:26 is not external, it is internal.

01:31:28 That’s right.

01:31:30 And there are definite ideas

01:31:33 of how to train a value function.

01:31:36 Basically an objective, you know,

01:31:39 and as objective as possible perception system

01:31:42 that will be trained separately to recognize,

01:31:47 to internalize human judgments on different situations.

01:31:51 And then that component would then be integrated

01:31:54 as the base value function

01:31:56 for some more capable RL system.

01:31:59 You could imagine a process like this.

01:32:00 I’m not saying this is the process,

01:32:02 I’m saying this is an example

01:32:03 of the kind of thing you could do.

01:32:05 So on that topic of the objective functions

01:32:11 of human existence,

01:32:12 what do you think is the objective function

01:32:15 that’s implicit in human existence?

01:32:17 What’s the meaning of life?

01:32:18 Oh.

01:32:28 I think the question is wrong in some way.

01:32:31 I think that the question implies

01:32:33 that there is an objective answer

01:32:35 which is an external answer,

01:32:36 you know, your meaning of life is X.

01:32:38 I think what’s going on is that we exist

01:32:40 and that’s amazing.

01:32:44 And we should try to make the most of it

01:32:45 and try to maximize our own value

01:32:48 and enjoyment of a very short time while we do exist.

01:32:53 It’s funny,

01:32:54 because action does require an objective function

01:32:56 is definitely there in some form,

01:32:58 but it’s difficult to make it explicit

01:33:01 and maybe impossible to make it explicit,

01:33:02 I guess is what you’re getting at.

01:33:03 And that’s an interesting fact of an RL environment.

01:33:08 Well, but I was making a slightly different point

01:33:10 is that humans want things

01:33:13 and their wants create the drives that cause them to,

01:33:16 you know, our wants are our objective functions,

01:33:19 our individual objective functions.

01:33:21 We can later decide that we want to change,

01:33:24 that what we wanted before is no longer good

01:33:26 and we want something else.

01:33:27 Yeah, but they’re so dynamic,

01:33:29 there’s gotta be some underlying sort of Freud,

01:33:32 there’s things, there’s like sexual stuff,

01:33:33 there’s people who think it’s the fear of death

01:33:37 and there’s also the desire for knowledge

01:33:40 and you know, all these kinds of things,

01:33:42 procreation, sort of all the evolutionary arguments,

01:33:46 it seems to be,

01:33:47 there might be some kind of fundamental objective function

01:33:49 from which everything else emerges,

01:33:54 but it seems like it’s very difficult to make it explicit.

01:33:56 I think that probably is an evolutionary objective function

01:33:58 which is to survive and procreate

01:34:00 and make sure you make your children succeed.

01:34:02 That would be my guess,

01:34:04 but it doesn’t give an answer to the question

01:34:06 of what’s the meaning of life.

01:34:08 I think you can see how humans are part of this big process,

01:34:13 this ancient process.

01:34:14 We exist on a small planet and that’s it.

01:34:20 So given that we exist, try to make the most of it

01:34:24 and try to enjoy more and suffer less as much as we can.

01:34:28 Let me ask two silly questions about life.

01:34:32 One, do you have regrets?

01:34:34 Moments that if you went back, you would do differently.

01:34:39 And two, are there moments that you’re especially proud of

01:34:42 that made you truly happy?

01:34:44 So I can answer that, I can answer both questions.

01:34:47 Of course, there’s a huge number of choices

01:34:51 and decisions that I’ve made

01:34:52 that with the benefit of hindsight,

01:34:54 I wouldn’t have made them.

01:34:55 And I do experience some regret,

01:34:56 but I try to take solace in the knowledge

01:35:00 that at the time I did the best I could.

01:35:02 And in terms of things that I’m proud of,

01:35:04 I’m very fortunate to have done things I’m proud of

01:35:08 and they made me happy for some time,

01:35:10 but I don’t think that that is the source of happiness.

01:35:14 So your academic accomplishments, all the papers,

01:35:17 you’re one of the most cited people in the world.

01:35:19 All of the breakthroughs I mentioned

01:35:21 in computer vision and language and so on,

01:35:23 what is the source of happiness and pride for you?

01:35:29 I mean, all those things are a source of pride for sure.

01:35:31 I’m very grateful for having done all those things

01:35:35 and it was very fun to do them.

01:35:37 But happiness comes, but you know, happiness,

01:35:40 well, my current view is that happiness comes

01:35:42 from our, to a very large degree,

01:35:45 from the way we look at things.

01:35:47 You know, you can have a simple meal

01:35:49 and be quite happy as a result,

01:35:51 or you can talk to someone and be happy as a result as well.

01:35:54 Or conversely, you can have a meal and be disappointed

01:35:58 that the meal wasn’t a better meal.

01:36:00 So I think a lot of happiness comes from that,

01:36:02 but I’m not sure, I don’t want to be too confident.

01:36:05 Being humble in the face of the uncertainty

01:36:07 seems to be also a part of this whole happiness thing.

01:36:12 Well, I don’t think there’s a better way to end it

01:36:14 than meaning of life and discussions of happiness.

01:36:17 So Ilya, thank you so much.

01:36:19 You’ve given me a few incredible ideas.

01:36:22 You’ve given the world many incredible ideas.

01:36:24 I really appreciate it and thanks for talking today.

01:36:27 Yeah, thanks for stopping by, I really enjoyed it.

01:36:30 Thanks for listening to this conversation

01:36:32 with Ilya Setskever and thank you

01:36:33 to our presenting sponsor, Cash App.

01:36:36 Please consider supporting the podcast

01:36:38 by downloading Cash App and using the code LEXPodcast.

01:36:42 If you enjoy this podcast, subscribe on YouTube,

01:36:45 review it with five stars on Apple Podcast,

01:36:47 support on Patreon, or simply connect with me on Twitter

01:36:51 at Lex Friedman.

01:36:54 And now let me leave you with some words

01:36:56 from Alan Turing on machine learning.

01:37:00 Instead of trying to produce a program

01:37:01 to simulate the adult mind,

01:37:03 why not rather try to produce one

01:37:06 which simulates the child?

01:37:08 If this were then subjected

01:37:10 to an appropriate course of education,

01:37:12 one would obtain the adult brain.

01:37:15 Thank you for listening and hope to see you next time.