Transcript
00:00:00 at which point is the neural network a being versus a tool?
00:00:08 The following is a conversation with Oriel Veniales,
00:00:11 his second time on the podcast.
00:00:13 Oriel is the research director
00:00:15 and deep learning lead at DeepMind
00:00:18 and one of the most brilliant thinkers and researchers
00:00:20 in the history of artificial intelligence.
00:00:24 This is the Lex Friedman podcast.
00:00:26 To support it, please check out our sponsors
00:00:28 in the description.
00:00:30 And now, dear friends, here’s Oriel Veniales.
00:00:34 You are one of the most brilliant researchers
00:00:37 in the history of AI,
00:00:38 working across all kinds of modalities.
00:00:40 Probably the one common theme is
00:00:42 it’s always sequences of data.
00:00:45 So we’re talking about languages, images,
00:00:46 even biology and games, as we talked about last time.
00:00:50 So you’re a good person to ask this.
00:00:53 In your lifetime, will we be able to build an AI system
00:00:57 that’s able to replace me as the interviewer
00:01:00 in this conversation,
00:01:02 in terms of ability to ask questions
00:01:04 that are compelling to somebody listening?
00:01:06 And then further question is, are we close?
00:01:10 Will we be able to build a system that replaces you
00:01:13 as the interviewee
00:01:16 in order to create a compelling conversation?
00:01:18 How far away are we, do you think?
00:01:20 It’s a good question.
00:01:21 I think partly I would say, do we want that?
00:01:24 I really like when we start now with very powerful models,
00:01:29 interacting with them and thinking of them
00:01:32 more closer to us.
00:01:34 The question is, if you remove the human side
00:01:37 of the conversation, is that an interesting artifact?
00:01:42 And I would say, probably not.
00:01:44 I’ve seen, for instance, last time we spoke,
00:01:47 like we were talking about StarCraft,
00:01:50 and creating agents that play games involves self play,
00:01:54 but ultimately what people care about was,
00:01:57 how does this agent behave
00:01:59 when the opposite side is a human?
00:02:02 So without a doubt,
00:02:04 we will probably be more empowered by AI.
00:02:08 Maybe you can source some questions from an AI system.
00:02:12 I mean, that even today, I would say it’s quite plausible
00:02:15 that with your creativity,
00:02:17 you might actually find very interesting questions
00:02:19 that you can filter.
00:02:20 We call this cherry picking sometimes
00:02:22 in the field of language.
00:02:24 And likewise, if I had now the tools on my side,
00:02:27 I could say, look, you’re asking this interesting question.
00:02:30 From this answer, I like the words chosen
00:02:33 by this particular system that created a few words.
00:02:36 Completely replacing it feels not exactly exciting to me.
00:02:41 Although in my lifetime, I think way,
00:02:43 I mean, given the trajectory,
00:02:45 I think it’s possible that perhaps
00:02:48 there could be interesting,
00:02:49 maybe self play interviews as you’re suggesting
00:02:53 that would look or sound quite interesting
00:02:56 and probably would educate
00:02:57 or you could learn a topic through listening
00:03:00 to one of these interviews at a basic level at least.
00:03:03 So you said it doesn’t seem exciting to you,
00:03:04 but what if exciting is part of the objective function
00:03:07 the thing is optimized over?
00:03:09 So there’s probably a huge amount of data of humans
00:03:12 if you look correctly, of humans communicating online,
00:03:16 and there’s probably ways to measure the degree of,
00:03:19 you know, as they talk about engagement.
00:03:21 So you can probably optimize the question
00:03:24 that’s most created an engaging conversation in the past.
00:03:28 So actually, if you strictly use the word exciting,
00:03:33 there is probably a way to create
00:03:37 a optimally exciting conversations
00:03:40 that involve AI systems.
00:03:42 At least one side is AI.
00:03:44 Yeah, that makes sense, I think,
00:03:46 maybe looping back a bit to games and the game industry,
00:03:50 when you design algorithms,
00:03:53 you’re thinking about winning as the objective, right?
00:03:55 Or the reward function.
00:03:57 But in fact, when we discussed this with Blizzard,
00:04:00 the creators of StarCraft in this case,
00:04:02 I think what’s exciting, fun,
00:04:05 if you could measure that and optimize for that,
00:04:09 that’s probably why we play video games
00:04:11 or why we interact or listen or look at cat videos
00:04:14 or whatever on the internet.
00:04:16 So it’s true that modeling reward
00:04:19 beyond the obvious reward functions
00:04:21 we’ve used to in reinforcement learning
00:04:23 is definitely very exciting.
00:04:25 And again, there is some progress actually
00:04:28 into a particular aspect of AI, which is quite critical,
00:04:32 which is, for instance, is a conversation
00:04:36 or is the information truthful, right?
00:04:38 So you could start trying to evaluate these
00:04:41 from accepts from the internet, right?
00:04:44 That has lots of information.
00:04:45 And then if you can learn a function automated ideally,
00:04:50 so you can also optimize it more easily,
00:04:52 then you could actually have conversations
00:04:54 that optimize for non obvious things such as excitement.
00:04:59 So yeah, that’s quite possible.
00:05:01 And then I would say in that case,
00:05:03 it would definitely be fun exercise
00:05:05 and quite unique to have at least one site
00:05:08 that is fully driven by an excitement reward function.
00:05:12 But obviously, there would be still quite a lot of humanity
00:05:16 in the system, both from who is building the system,
00:05:20 of course, and also, ultimately,
00:05:23 if we think of labeling for excitement,
00:05:26 that those labels must come from us
00:05:28 because it’s just hard to have a computational measure
00:05:32 of excitement.
00:05:33 As far as I understand, there’s no such thing.
00:05:36 Well, as you mentioned truth also,
00:05:39 I would actually venture to say that excitement
00:05:41 is easier to label than truth,
00:05:44 or is perhaps has lower consequences of failure.
00:05:49 But there is perhaps the humanness that you mentioned,
00:05:55 that’s perhaps part of a thing that could be labeled.
00:05:58 And that could mean an AI system that’s doing dialogue,
00:06:02 that’s doing conversations should be flawed, for example.
00:06:07 Like that’s the thing you optimize for,
00:06:09 which is have inherent contradictions by design,
00:06:13 have flaws by design.
00:06:15 Maybe it also needs to have a strong sense of identity.
00:06:18 So it has a backstory it told itself that it sticks to.
00:06:22 It has memories, not in terms of how the system is designed,
00:06:26 but it’s able to tell stories about its past.
00:06:30 It’s able to have mortality and fear of mortality
00:06:36 in the following way that it has an identity.
00:06:39 And if it says something stupid
00:06:41 and gets canceled on Twitter, that’s the end of that system.
00:06:44 So it’s not like you get to rebrand yourself.
00:06:47 That system is, that’s it.
00:06:49 So maybe the high stakes nature of it,
00:06:52 because you can’t say anything stupid now,
00:06:54 or because you’d be canceled on Twitter.
00:06:57 And there’s stakes to that.
00:06:59 And that I think part of the reason
00:07:01 that makes it interesting.
00:07:03 And then you have a perspective,
00:07:04 like you’ve built up over time that you stick with,
00:07:07 and then people can disagree with you.
00:07:09 So holding that perspective strongly,
00:07:11 holding sort of maybe a controversial,
00:07:14 at least a strong opinion.
00:07:16 All of those elements, it feels like they can be learned
00:07:18 because it feels like there’s a lot of data
00:07:21 on the internet of people having an opinion.
00:07:24 And then combine that with a metric of excitement,
00:07:27 you can start to create something that,
00:07:30 as opposed to trying to optimize
00:07:31 for sort of grammatical clarity and truthfulness,
00:07:38 the factual consistency over many sentences,
00:07:42 you optimize for the humanness.
00:07:45 And there’s obviously data for humanness on the internet.
00:07:48 So I wonder if there’s a future where that’s part,
00:07:53 or I mean, I sometimes wonder that about myself.
00:07:56 I’m a huge fan of podcasts,
00:07:58 and I listen to some podcasts,
00:08:00 and I think like, what is interesting about this?
00:08:03 What is compelling?
00:08:05 The same way you watch other games.
00:08:07 Like you said, watch, play StarCraft,
00:08:09 or have Magnus Carlsen play chess.
00:08:13 So I’m not a chess player,
00:08:14 but it’s still interesting to me.
00:08:16 What is that?
00:08:16 That’s the stakes of it,
00:08:19 maybe the end of a domination of a series of wins.
00:08:23 I don’t know, there’s all those elements
00:08:25 somehow connect to a compelling conversation.
00:08:28 And I wonder how hard is that to replace,
00:08:30 because ultimately all of that connects
00:08:31 to the initial proposition of how to test,
00:08:35 whether an AI is intelligent or not with the Turing test,
00:08:38 which I guess my question comes from a place
00:08:41 of the spirit of that test.
00:08:43 Yes, I actually recall,
00:08:45 I was just listening to our first podcast
00:08:47 where we discussed Turing test.
00:08:50 So I would say from a neural network,
00:08:54 AI builder perspective,
00:08:57 there’s usually you try to map
00:09:01 many of these interesting topics you discuss to benchmarks,
00:09:05 and then also to actual architectures
00:09:08 on the how these systems are currently built,
00:09:10 how they learn, what data they learn from,
00:09:13 what are they learning, right?
00:09:14 We’re talking about weights of a mathematical function,
00:09:17 and then looking at the current state of the game,
00:09:21 maybe what do we need leaps forward
00:09:26 to get to the ultimate stage of all these experiences,
00:09:30 lifetime experience, fears,
00:09:32 like words that currently,
00:09:34 barely we’re seeing progress
00:09:38 just because what’s happening today
00:09:40 is you take all these human interactions,
00:09:44 it’s a large vast variety of human interactions online,
00:09:47 and then you’re distilling these sequences, right?
00:09:51 Going back to my passion,
00:09:53 like sequences of words, letters, images, sound,
00:09:56 there’s more modalities here to be at play.
00:09:59 And then you’re trying to just learn a function
00:10:03 that will be happy,
00:10:04 that maximizes the likelihood of seeing all these
00:10:08 through a neural network.
00:10:10 Now, I think there’s a few places
00:10:14 where the way currently we train these models
00:10:17 would clearly lack to be able to develop
00:10:20 the kinds of capabilities you save.
00:10:22 I’ll tell you maybe a couple.
00:10:23 One is the lifetime of an agent or a model.
00:10:27 So you learn from this data offline, right?
00:10:30 So you’re just passively observing and maximizing these,
00:10:33 it’s almost like a mountains,
00:10:35 like a landscape of mountains,
00:10:37 and then everywhere there’s data
00:10:39 that humans interacted in this way,
00:10:41 you’re trying to make that higher
00:10:43 and then lower where there’s no data.
00:10:45 And then these models generally
00:10:48 don’t then experience themselves.
00:10:51 They just are observers, right?
00:10:52 They’re passive observers of the data.
00:10:54 And then we’re putting them to then generate data
00:10:57 when we interact with them,
00:10:59 but that’s very limiting.
00:11:00 The experience they actually experience
00:11:03 when they could maybe be optimizing
00:11:05 or further optimizing the weights,
00:11:07 we’re not even doing that.
00:11:08 So to be clear, and again, mapping to AlphaGo, AlphaStar,
00:11:14 we train the model.
00:11:15 And when we deploy it to play against humans,
00:11:18 or in this case interact with humans,
00:11:20 like language models,
00:11:21 they don’t even keep training, right?
00:11:23 They’re not learning in the sense of the weights
00:11:26 that you’ve learned from the data,
00:11:28 they don’t keep changing.
00:11:29 Now, there’s something a bit more feels magical,
00:11:33 but it’s understandable if you’re into Neuronet,
00:11:36 which is, well, they might not learn
00:11:39 in the strict sense of the words,
00:11:40 the weights changing,
00:11:41 maybe that’s mapping to how neurons interconnect
00:11:44 and how we learn over our lifetime.
00:11:46 But it’s true that the context of the conversation
00:11:50 that takes place when you talk to these systems,
00:11:55 it’s held in their working memory, right?
00:11:57 It’s almost like you start the computer,
00:12:00 it has a hard drive that has a lot of information,
00:12:02 you have access to the internet,
00:12:04 which has probably all the information,
00:12:06 but there’s also a working memory
00:12:08 where these agents, as we call them,
00:12:11 or start calling them, build upon.
00:12:13 Now, this memory is very limited.
00:12:16 I mean, right now we’re talking, to be concrete,
00:12:19 about 2,000 words that we hold,
00:12:21 and then beyond that, we start forgetting what we’ve seen.
00:12:24 So you can see that there’s some short term coherence
00:12:28 already, right, with what you said.
00:12:29 I mean, it’s a very interesting topic.
00:12:32 Having sort of a mapping, an agent to have consistency,
00:12:37 then if you say, oh, what’s your name,
00:12:40 it could remember that,
00:12:42 but then it might forget beyond 2,000 words,
00:12:45 which is not that long of context
00:12:47 if we think even of these podcast books are much longer.
00:12:51 So technically speaking, there’s a limitation there,
00:12:55 super exciting from people that work on deep learning
00:12:58 to be working on, but I would say we lack maybe benchmarks
00:13:03 and the technology to have this lifetime like experience
00:13:07 of memory that keeps building up.
00:13:10 However, the way it learns offline
00:13:13 is clearly very powerful, right?
00:13:14 So you asked me three years ago, I would say,
00:13:17 oh, we’re very far.
00:13:18 I think we’ve seen the power of this imitation,
00:13:22 again, on the internet scale that has enabled this
00:13:26 to feel like at least the knowledge,
00:13:28 the basic knowledge about the world now
00:13:30 is incorporated into the weights,
00:13:33 but then this experience is lacking.
00:13:36 And in fact, as I said, we don’t even train them
00:13:39 when we’re talking to them,
00:13:41 other than their working memory, of course, is affected.
00:13:44 So that’s the dynamic part,
00:13:46 but they don’t learn in the same way
00:13:48 that you and I have learned, right?
00:13:50 From basically when we were born and probably before.
00:13:54 So lots of fascinating, interesting questions you asked there.
00:13:57 I think the one I mentioned is this idea of memory
00:14:01 and experience versus just kind of observe the world
00:14:05 and learn its knowledge, which I think for that,
00:14:08 I would argue lots of recent advancements
00:14:10 that make me very excited about the field.
00:14:13 And then the second maybe issue that I see is
00:14:18 all these models, we train them from scratch.
00:14:21 That’s something I would have complained three years ago
00:14:24 or six years ago or 10 years ago.
00:14:26 And it feels if we take inspiration from how we got here,
00:14:31 how the universe evolved us and we keep evolving,
00:14:35 it feels that is a missing piece,
00:14:37 that we should not be training models from scratch
00:14:41 every few months,
00:14:42 that there should be some sort of way
00:14:45 in which we can grow models much like as a species
00:14:49 and many other elements in the universe
00:14:51 is building from the previous sort of iterations.
00:14:55 And that from a just purely neural network perspective,
00:14:59 even though we would like to make it work,
00:15:02 it’s proven very hard to not throw away
00:15:06 the previous weights, right?
00:15:07 This landscape we learn from the data
00:15:09 and refresh it with a brand new set of weights,
00:15:13 given maybe a recent snapshot of these data sets
00:15:17 we train on, et cetera, or even a new game we’re learning.
00:15:20 So that feels like something is missing fundamentally.
00:15:24 We might find it, but it’s not very clear
00:15:27 how it will look like.
00:15:28 There’s many ideas and it’s super exciting as well.
00:15:30 Yes, just for people who don’t know,
00:15:32 when you’re approaching a new problem in machine learning,
00:15:35 you’re going to come up with an architecture
00:15:38 that has a bunch of weights
00:15:41 and then you initialize them somehow,
00:15:43 which in most cases is some version of random.
00:15:47 So that’s what you mean by starting from scratch.
00:15:49 And it seems like it’s a waste every time you solve
00:15:54 the game of Go and chess, StarCraft, protein folding,
00:15:59 like surely there’s some way to reuse the weights
00:16:03 as we grow this giant database of neural networks
00:16:08 that have solved some of the toughest problems in the world.
00:16:10 And so some of that is, what is that?
00:16:15 Methods, how to reuse weights,
00:16:19 how to learn, extract what’s generalizable
00:16:22 or at least has a chance to be
00:16:25 and throw away the other stuff.
00:16:27 And maybe the neural network itself
00:16:29 should be able to tell you that.
00:16:31 Like what, yeah, how do you,
00:16:34 what ideas do you have for better initialization of weights?
00:16:37 Maybe stepping back,
00:16:38 if we look at the field of machine learning,
00:16:41 but especially deep learning, right?
00:16:44 At the core of deep learning,
00:16:45 there’s this beautiful idea that is a single algorithm
00:16:49 can solve any task, right?
00:16:50 So it’s been proven over and over
00:16:54 with more increasing set of benchmarks
00:16:56 and things that were thought impossible
00:16:58 that are being cracked by this basic principle
00:17:01 that is you take a neural network of uninitialized weights,
00:17:05 so like a blank computational brain,
00:17:09 then you give it, in the case of supervised learning,
00:17:12 a lot ideally of examples of,
00:17:14 hey, here is what the input looks like
00:17:17 and the desired output should look like this.
00:17:19 I mean, image classification is very clear example,
00:17:22 images to maybe one of a thousand categories,
00:17:25 that’s what ImageNet is like,
00:17:26 but many, many, if not all problems can be mapped this way.
00:17:30 And then there’s a generic recipe, right?
00:17:33 That you can use.
00:17:35 And this recipe with very little change,
00:17:38 and I think that’s the core of deep learning research, right?
00:17:41 That what is the recipe that is universal?
00:17:44 That for any new given task,
00:17:46 I’ll be able to use without thinking,
00:17:48 without having to work very hard on the problem at stake.
00:17:52 We have not found this recipe,
00:17:54 but I think the field is excited to find less tweaks
00:18:00 or tricks that people find when they work
00:18:02 on important problems specific to those
00:18:05 and more of a general algorithm, right?
00:18:07 So at an algorithmic level,
00:18:09 I would say we have something general already,
00:18:11 which is this formula of training a very powerful model,
00:18:14 a neural network on a lot of data.
00:18:17 And in many cases, you need some specificity
00:18:21 to the actual problem you’re solving,
00:18:23 protein folding being such an important problem
00:18:26 has some basic recipe that is learned from before, right?
00:18:30 Like transformer models, graph neural networks,
00:18:34 ideas coming from NLP, like something called BERT,
00:18:38 that is a kind of loss that you can emplace
00:18:41 to help the knowledge distillation is another technique,
00:18:45 right?
00:18:46 So this is the formula.
00:18:47 We still had to find some particular things
00:18:50 that were specific to alpha fold, right?
00:18:53 That’s very important because protein folding
00:18:55 is such a high value problem that as humans,
00:18:59 we should solve it no matter
00:19:00 if we need to be a bit specific.
00:19:02 And it’s possible that some of these learnings
00:19:04 will apply then to the next iteration of this recipe
00:19:07 that deep learners are about.
00:19:09 But it is true that so far, the recipe is what’s common,
00:19:13 but the weights you generally throw away,
00:19:15 which feels very sad.
00:19:17 Although, maybe in the last,
00:19:20 especially in the last two, three years,
00:19:22 and when we last spoke,
00:19:23 I mentioned this area of meta learning,
00:19:25 which is the idea of learning to learn.
00:19:28 That idea and some progress has been had starting,
00:19:32 I would say, mostly from GPT3 on the language domain only,
00:19:36 in which you could conceive a model that is trained once.
00:19:41 And then this model is not narrow in that it only knows
00:19:44 how to translate a pair of languages or even a set of
00:19:47 or it only knows how to assign sentiment to a sentence.
00:19:51 These actually, you could teach it by a prompting,
00:19:55 it’s called, and this prompting is essentially
00:19:56 just showing it a few more examples,
00:19:59 almost like you do show examples, input, output examples,
00:20:03 algorithmically speaking to the process
00:20:04 of creating this model.
00:20:06 But now you’re doing it through language,
00:20:07 which is very natural way for us to learn from one another.
00:20:11 I tell you, hey, you should do this new task.
00:20:13 I’ll tell you a bit more.
00:20:14 Maybe you ask me some questions
00:20:16 and now you know the task, right?
00:20:17 You didn’t need to retrain it from scratch.
00:20:20 And we’ve seen these magical moments almost
00:20:24 in this way to do few shot promptings through language
00:20:26 on language only domain.
00:20:28 And then in the last two years,
00:20:30 we’ve seen these expanded to beyond language,
00:20:34 adding vision, adding actions and games,
00:20:38 lots of progress to be had.
00:20:39 But this is maybe, if you ask me like about
00:20:42 how are we gonna crack this problem?
00:20:43 This is perhaps one way in which you have a single model.
00:20:48 The problem of this model is it’s hard to grow
00:20:52 in weights or capacity,
00:20:54 but the model is certainly so powerful
00:20:56 that you can teach it some tasks, right?
00:20:58 In this way that I teach you,
00:21:00 I could teach you a new task now,
00:21:02 if we were all at a text based task
00:21:05 or a classification of vision style task.
00:21:08 But it still feels like more breakthroughs should be had,
00:21:12 but it’s a great beginning, right?
00:21:14 We have a good baseline.
00:21:15 We have an idea that this maybe is the way we want
00:21:18 to benchmark progress towards AGI.
00:21:20 And I think in my view, that’s critical
00:21:22 to always have a way to benchmark the community
00:21:25 sort of converging to these overall,
00:21:27 which is good to see.
00:21:29 And then this is actually what excites me
00:21:33 in terms of also next steps for deep learning
00:21:36 is how to make these models more powerful,
00:21:39 how do you train them, how to grow them
00:21:41 if they must grow, should they change their weights
00:21:44 as you teach it task or not?
00:21:46 There’s some interesting questions, many to be answered.
00:21:48 Yeah, you’ve opened the door
00:21:49 about to a bunch of questions I want to ask,
00:21:52 but let’s first return to your tweet
00:21:55 and read it like a Shakespeare.
00:21:57 You wrote, God is not the end, it’s the beginning.
00:22:01 And then you wrote meow and then an emoji of a cat.
00:22:06 So first two questions.
00:22:07 First, can you explain the meow and the cat emoji?
00:22:10 And second, can you explain what Godot is and how it works?
00:22:13 Right, indeed.
00:22:14 I mean, thanks for reminding me
00:22:16 that we’re all exposing on Twitter and.
00:22:19 Permanently there.
00:22:20 Yes, permanently there.
00:22:21 One of the greatest AI researchers of all time,
00:22:25 meow and cat emoji.
00:22:27 Yes. There you go.
00:22:28 Right, so.
00:22:29 Can you imagine like touring, tweeting, meow and cat,
00:22:32 probably he would, probably would.
00:22:34 Probably.
00:22:35 So yeah, the tweet is important actually.
00:22:38 You know, I put thought on the tweets, I hope people.
00:22:40 Which part do you think?
00:22:41 Okay, so there’s three sentences.
00:22:44 Godot is not the end, Godot is the beginning,
00:22:48 meow, cat emoji.
00:22:50 Okay, which is the important part?
00:22:51 The meow, no, no.
00:22:53 Definitely that it is the beginning.
00:22:56 I mean, I probably was just explaining a bit
00:23:00 where the field is going, but let me tell you about Godot.
00:23:03 So first the name Godot comes from maybe a sequence
00:23:08 of releases that DeepMind had that named,
00:23:11 like used animal names to name some of their models
00:23:15 that are based on this idea of large sequence models.
00:23:19 Initially they’re only language,
00:23:20 but we are expanding to other modalities.
00:23:23 So we had, you know, we had Gopher, Chinchilla,
00:23:28 these were language only.
00:23:29 And then more recently we released Flamingo,
00:23:32 which adds vision to the equation.
00:23:35 And then Godot, which adds vision
00:23:38 and then also actions in the mix, right?
00:23:41 As we discuss actually actions,
00:23:44 especially discrete actions like up, down, left, right.
00:23:47 I just told you the actions, but they’re words.
00:23:49 So you can kind of see how actions naturally map
00:23:52 to sequence modeling of words,
00:23:54 which these models are very powerful.
00:23:57 So Godot was named after, I believe,
00:24:01 I can only from memory, right?
00:24:03 These, you know, these things always happen
00:24:06 with an amazing team of researchers behind.
00:24:08 So before the release, we had a discussion
00:24:12 about which animal would we pick, right?
00:24:14 And I think because of the word general agent, right?
00:24:18 And this is a property quite unique to Godot.
00:24:21 We kind of were playing with the GA words
00:24:24 and then, you know, Godot.
00:24:26 Rhymes with cat.
00:24:26 Yes.
00:24:28 And Godot is obviously a Spanish version of cat.
00:24:30 I had nothing to do with it, although I’m from Spain.
00:24:32 Oh, how do you, wait, sorry.
00:24:33 How do you say cat in Spanish?
00:24:34 Gato.
00:24:35 Oh, gato, okay.
00:24:36 Now it all makes sense.
00:24:37 Okay, okay, I see, I see, I see.
00:24:38 Now it all makes sense.
00:24:39 Okay, so.
00:24:39 How do you say meow in Spanish?
00:24:40 No, that’s probably the same.
00:24:41 I think you say it the same way,
00:24:44 but you write it as M, I, A, U.
00:24:48 Okay, it’s universal.
00:24:49 Yes.
00:24:50 All right, so then how does the thing work?
00:24:51 So you said general is, so you said language, vision.
00:24:57 And action. Action.
00:24:59 How does this, can you explain
00:25:01 what kind of neural networks are involved?
00:25:04 What does the training look like?
00:25:06 And maybe what do you,
00:25:09 are some beautiful ideas within the system?
00:25:11 Yeah, so maybe the basics of Gato
00:25:16 are not that dissimilar from many, many work that come.
00:25:19 So here is where the sort of the recipe,
00:25:22 I mean, hasn’t changed too much.
00:25:24 There is a transformer model
00:25:25 that’s the kind of recurrent neural network
00:25:28 that essentially takes a sequence of modalities,
00:25:33 observations that could be words,
00:25:36 could be vision or could be actions.
00:25:38 And then its own objective that you train it to do
00:25:42 when you train it is to predict what the next anything is.
00:25:46 And anything means what’s the next action.
00:25:48 If this sequence that I’m showing you to train
00:25:51 is a sequence of actions and observations,
00:25:53 then you’re predicting what’s the next action
00:25:55 and the next observation, right?
00:25:57 So you think of these really as a sequence of bites, right?
00:26:00 So take any sequence of words,
00:26:04 a sequence of interleaved words and images,
00:26:07 a sequence of maybe observations that are images
00:26:11 and moves in Atari up, down, left, right.
00:26:14 And these you just think of them as bites
00:26:17 and you’re modeling what’s the next bite gonna be like.
00:26:20 And you might interpret that as an action
00:26:23 and then play it in a game,
00:26:25 or you could interpret it as a word
00:26:27 and then write it down
00:26:29 if you’re chatting with the system and so on.
00:26:32 So Gato basically can be thought as inputs,
00:26:36 images, text, video, actions.
00:26:41 It also actually inputs some sort of proprioception sensors
00:26:45 from robotics because robotics is one of the tasks
00:26:48 that it’s been trained to do.
00:26:49 And then at the output, similarly,
00:26:51 it outputs words, actions.
00:26:53 It does not output images, that’s just by design,
00:26:57 we decided not to go that way for now.
00:27:00 That’s also in part why it’s the beginning
00:27:02 because there’s more to do clearly.
00:27:04 But that’s kind of what the Gato is,
00:27:06 is this brain that essentially you give it any sequence
00:27:09 of these observations and modalities
00:27:11 and it outputs the next step.
00:27:13 And then off you go, you feed the next step into
00:27:17 and predict the next one and so on.
00:27:20 Now, it is more than a language model
00:27:24 because even though you can chat with Gato,
00:27:26 like you can chat with Chinchilla or Flamingo,
00:27:30 it also is an agent, right?
00:27:33 So that’s why we call it A of Gato,
00:27:37 like the letter A and also it’s general.
00:27:41 It’s not an agent that’s been trained to be good
00:27:43 at only StarCraft or only Atari or only Go.
00:27:47 It’s been trained on a vast variety of datasets.
00:27:51 What makes it an agent, if I may interrupt,
00:27:53 the fact that it can generate actions?
00:27:56 Yes, so when we call it, I mean, it’s a good question, right?
00:28:00 When do we call a model?
00:28:02 I mean, everything is a model,
00:28:03 but what is an agent in my view is indeed the capacity
00:28:07 to take actions in an environment that you then send to it
00:28:11 and then the environment might return
00:28:13 with a new observation
00:28:15 and then you generate the next action.
00:28:17 This actually, this reminds me of the question
00:28:20 from the side of biology, what is life?
00:28:23 Which is actually a very difficult question as well.
00:28:25 What is living, what is living when you think about life
00:28:29 here on this planet Earth?
00:28:31 And a question interesting to me about aliens,
00:28:33 what is life when we visit another planet?
00:28:35 Would we be able to recognize it?
00:28:37 And this feels like, it sounds perhaps silly,
00:28:40 but I don’t think it is.
00:28:41 At which point is the neural network a being versus a tool?
00:28:48 And it feels like action, ability to modify its environment
00:28:52 is that fundamental leap.
00:28:54 Yeah, I think it certainly feels like action
00:28:57 is a necessary condition to be more alive,
00:29:01 but probably not sufficient either.
00:29:04 So sadly I…
00:29:05 It’s a soul consciousness thing, whatever.
00:29:06 Yeah, yeah, we can get back to that later.
00:29:09 But anyways, going back to the meow and the gato, right?
00:29:12 So one of the leaps forward and what took the team a lot
00:29:17 of effort and time was, as you were asking,
00:29:21 how has gato been trained?
00:29:23 So I told you gato is this transformer neural network,
00:29:26 models actions, sequences of actions, words, et cetera.
00:29:30 And then the way we train it is by essentially pulling
00:29:35 data sets of observations, right?
00:29:39 So it’s a massive imitation learning algorithm
00:29:42 that it imitates obviously to what
00:29:45 is the next word that comes next from the usual data
00:29:48 sets we use before, right?
00:29:50 So these are these web scale style data sets of people
00:29:54 writing on webs or chatting or whatnot, right?
00:29:58 So that’s an obvious source that we use on all language work.
00:30:02 But then we also took a lot of agents
00:30:05 that we have at DeepMind.
00:30:06 I mean, as you know, DeepMind, we’re quite interested
00:30:10 in learning reinforcement learning and learning agents
00:30:14 that play in different environments.
00:30:17 So we kind of created a data set of these trajectories,
00:30:20 as we call them, or agent experiences.
00:30:23 So in a way, there are other agents
00:30:25 we train for a single mind purpose to, let’s say,
00:30:29 control a 3D game environment and navigate a maze.
00:30:33 So we had all the experience that
00:30:35 was created through one agent interacting
00:30:38 with that environment.
00:30:39 And we added this to the data set, right?
00:30:41 And as I said, we just see all the data,
00:30:44 all these sequences of words or sequences
00:30:46 of this agent interacting with that environment or agents
00:30:51 playing Atari and so on.
00:30:52 We see it as the same kind of data.
00:30:54 And so we mix these data sets together.
00:30:57 And we train Gato.
00:31:00 That’s the G part, right?
00:31:01 It’s general because it really has mixed.
00:31:05 It doesn’t have different brains for each modality
00:31:07 or each narrow task.
00:31:09 It has a single brain.
00:31:10 It’s not that big of a brain compared
00:31:12 to most of the neural networks we see these days.
00:31:14 It has 1 billion parameters.
00:31:18 Some models we’re seeing get in the trillions these days.
00:31:21 And certainly, 100 billion feels like a size
00:31:25 that is very common from when you train these jobs.
00:31:29 So the actual agent is relatively small.
00:31:32 But it’s been trained on a very challenging, diverse data set,
00:31:36 not only containing all of the internet
00:31:38 but containing all these agent experience playing
00:31:40 very different, distinct environments.
00:31:43 So this brings us to the part of the tweet of this
00:31:46 is not the end, it’s the beginning.
00:31:48 It feels very cool to see Gato, in principle,
00:31:53 is able to control any sort of environments, especially
00:31:57 the ones that it’s been trained to do, these 3D games, Atari
00:32:00 games, all sorts of robotics tasks, and so on.
00:32:04 But obviously, it’s not as proficient
00:32:07 as the teachers it learned from on these environments.
00:32:10 Not obvious.
00:32:11 It’s not obvious that it wouldn’t be more proficient.
00:32:15 It’s just the current beginning part
00:32:17 is that the performance is such that it’s not as good
00:32:21 as if it’s specialized to that task.
00:32:23 Right.
00:32:23 So it’s not as good, although I would argue size matters here.
00:32:28 So the fact that I would argue always size always matters.
00:32:31 That’s a different conversation.
00:32:33 But for neural networks, certainly size does matter.
00:32:36 So it’s the beginning because it’s relatively small.
00:32:39 So obviously, scaling this idea up
00:32:42 might make the connections that exist between text
00:32:48 on the internet and playing Atari and so on more
00:32:51 synergistic with one another.
00:32:53 And you might gain.
00:32:54 And that moment, we didn’t quite see.
00:32:56 But obviously, that’s why it’s the beginning.
00:32:58 That synergy might emerge with scale.
00:33:00 Right, might emerge with scale.
00:33:02 And also, I believe there’s some new research or ways
00:33:05 in which you prepare the data that you
00:33:08 might need to make it more clear to the model
00:33:11 that you’re not only playing Atari,
00:33:14 and you start from a screen.
00:33:16 And here is up and a screen and down.
00:33:18 Maybe you can think of playing Atari
00:33:20 as there’s some sort of context that is needed for the agent
00:33:23 before it starts seeing, oh, this is an Atari screen.
00:33:26 I’m going to start playing.
00:33:28 You might require, for instance, to be told in words,
00:33:33 hey, in this sequence that I’m showing,
00:33:36 you’re going to be playing an Atari game.
00:33:39 So text might actually be a good driver to enhance the data.
00:33:44 So then these connections might be made more easily.
00:33:46 So that’s an idea that we start seeing in language.
00:33:51 But obviously, beyond, this is going to be effective.
00:33:55 It’s not like I don’t show you a screen,
00:33:57 and you, from scratch, you’re supposed to learn a game.
00:34:01 There is a lot of context we might set.
00:34:03 So there might be some work needed as well
00:34:05 to set that context.
00:34:07 But anyways, there’s a lot of work.
00:34:10 So that context puts all the different modalities
00:34:13 on the same level ground to provide the context best.
00:34:16 So maybe on that point, so there’s
00:34:19 this task, which may not seem trivial, of tokenizing the data,
00:34:25 of converting the data into pieces,
00:34:28 into basic atomic elements that then could cross modalities
00:34:34 somehow.
00:34:35 So what’s tokenization?
00:34:37 How do you tokenize text?
00:34:39 How do you tokenize images?
00:34:42 How do you tokenize games and actions and robotics tasks?
00:34:47 Yeah, that’s a great question.
00:34:48 So tokenization is the entry point
00:34:52 to actually make all the data look like a sequence,
00:34:55 because tokens then are just these little puzzle pieces.
00:34:59 We break down anything into these puzzle pieces,
00:35:01 and then we just model, what’s this puzzle look like when
00:35:05 you make it lay down in a line, so to speak, in a sequence?
00:35:09 So in Gato, the text, there’s a lot of work.
00:35:15 You tokenize text usually by looking
00:35:17 at commonly used substrings, right?
00:35:20 So there’s ING in English is a very common substring,
00:35:23 so that becomes a token.
00:35:25 There’s quite a well studied problem on tokenizing text.
00:35:29 And Gato just used the standard techniques
00:35:31 that have been developed from many years,
00:35:34 even starting from ngram models in the 1950s and so on.
00:35:38 Just for context, how many tokens,
00:35:40 what order, magnitude, number of tokens
00:35:42 is required for a word, usually?
00:35:45 What are we talking about?
00:35:46 Yeah, for a word in English, I mean,
00:35:48 every language is very different.
00:35:51 The current level or granularity of tokenization
00:35:53 generally means it’s maybe two to five.
00:35:57 I mean, I don’t know the statistics exactly,
00:36:00 but to give you an idea, we don’t tokenize
00:36:03 at the level of letters.
00:36:04 Then it would probably be, I don’t
00:36:05 know what the average length of a word is in English,
00:36:08 but that would be the minimum set of tokens you could use.
00:36:11 It was bigger than letters, smaller than words.
00:36:13 Yes, yes.
00:36:13 And you could think of very, very common words like the.
00:36:16 I mean, that would be a single token,
00:36:18 but very quickly you’re talking two, three, four tokens or so.
00:36:22 Have you ever tried to tokenize emojis?
00:36:24 Emojis are actually just sequences of letters, so.
00:36:30 Maybe to you, but to me they mean so much more.
00:36:33 Yeah, you can render the emoji, but you
00:36:35 might if you actually just.
00:36:36 Yeah, this is a philosophical question.
00:36:39 Is emojis an image or a text?
00:36:43 The way we do these things is they’re actually
00:36:46 mapped to small sequences of characters.
00:36:49 So you can actually play with these models
00:36:52 and input emojis, it will output emojis back,
00:36:55 which is actually quite a fun exercise.
00:36:57 You probably can find other tweets about these out there.
00:37:02 But yeah, so anyways, text.
00:37:04 It’s very clear how this is done.
00:37:06 And then in Gato, what we did for images
00:37:10 is we map images to essentially we compressed images,
00:37:14 so to speak, into something that looks more like less
00:37:19 like every pixel with every intensity.
00:37:21 That would mean we have a very long sequence, right?
00:37:23 Like if we were talking about 100 by 100 pixel images,
00:37:27 that would make the sequences far too long.
00:37:30 So what was done there is you just
00:37:32 use a technique that essentially compresses an image
00:37:35 into maybe 16 by 16 patches of pixels,
00:37:40 and then that is mapped, again, tokenized.
00:37:42 You just essentially quantize this space
00:37:45 into a special word that actually
00:37:48 maps to these little sequence of pixels.
00:37:51 And then you put the pixels together in some raster order,
00:37:55 and then that’s how you get out or in the image
00:37:59 that you’re processing.
00:38:00 But there’s no semantic aspect to that,
00:38:04 so you’re doing some kind of,
00:38:05 you don’t need to understand anything about the image
00:38:07 in order to tokenize it currently.
00:38:09 No, you’re only using this notion of compression.
00:38:12 So you’re trying to find common,
00:38:15 it’s like JPG or all these algorithms.
00:38:17 It’s actually very similar at the tokenization level.
00:38:20 All we’re doing is finding common patterns
00:38:23 and then making sure in a lossy way we compress these images
00:38:27 given the statistics of the images
00:38:29 that are contained in all the data we deal with.
00:38:31 Although you could probably argue that JPEG
00:38:34 does have some understanding of images.
00:38:38 Because visual information, maybe color,
00:38:44 compressing crudely based on color
00:38:46 does capture something important about an image
00:38:51 that’s about its meaning, not just about some statistics.
00:38:54 Yeah, I mean, JP, as I said,
00:38:56 the algorithms look actually very similar
00:38:58 to they use the cosine transform in JPG.
00:39:04 The approach we usually do in machine learning
00:39:07 when we deal with images and we do this quantization step
00:39:10 is a bit more data driven.
00:39:11 So rather than have some sort of Fourier basis
00:39:14 for how frequencies appear in the natural world,
00:39:18 we actually just use the statistics of the images
00:39:23 and then quantize them based on the statistics,
00:39:27 much like you do in words, right?
00:39:28 So common substrings are allocated a token
00:39:32 and images is very similar.
00:39:34 But there’s no connection.
00:39:36 The token space, if you think of,
00:39:39 oh, like the tokens are an integer
00:39:41 and in the end of the day.
00:39:42 So now like we work on, maybe we have about,
00:39:46 let’s say, I don’t know the exact numbers,
00:39:48 but let’s say 10,000 tokens for text, right?
00:39:51 Certainly more than characters
00:39:52 because we have groups of characters and so on.
00:39:55 So from one to 10,000, those are representing
00:39:58 all the language and the words we’ll see.
00:40:01 And then images occupy the next set of integers.
00:40:04 So they’re completely independent, right?
00:40:05 So from 10,001 to 20,000,
00:40:08 those are the tokens that represent
00:40:10 these other modality images.
00:40:12 And that is an interesting aspect that makes it orthogonal.
00:40:18 So what connects these concepts is the data, right?
00:40:21 Once you have a data set,
00:40:23 for instance, that captions images that tells you,
00:40:26 oh, this is someone playing a frisbee on a green field.
00:40:30 Now the model will need to predict the tokens
00:40:34 from the text green field to then the pixels.
00:40:37 And that will start making the connections
00:40:39 between the tokens.
00:40:40 So these connections happen as the algorithm learns.
00:40:43 And then the last, if we think of these integers,
00:40:45 the first few are words, the next few are images.
00:40:48 In Gato, we also allocated the highest order of integers
00:40:55 to actions, right?
00:40:56 Which we discretize and actions are very diverse, right?
00:40:59 In Atari, there’s, I don’t know if 17 discrete actions.
00:41:04 In robotics, actions might be torques
00:41:06 and forces that we apply.
00:41:08 So we just use kind of similar ideas
00:41:11 to compress these actions into tokens.
00:41:14 And then we just, that’s how we map now
00:41:18 all the space to these sequence of integers.
00:41:20 But they occupy different space
00:41:22 and what connects them is then the learning algorithm.
00:41:24 That’s where the magic happens.
00:41:26 So the modalities are orthogonal
00:41:28 to each other in token space.
00:41:30 So in the input, everything you add, you add extra tokens.
00:41:35 And then you’re shoving all of that into one place.
00:41:40 Yes, the transformer.
00:41:41 And that transformer, that transformer tries
00:41:46 to look at this gigantic token space
00:41:49 and tries to form some kind of representation,
00:41:52 some kind of unique wisdom
00:41:56 about all of these different modalities.
00:41:59 How’s that possible?
00:42:02 If you were to sort of like put your psychoanalysis hat on
00:42:06 and try to psychoanalyze this neural network,
00:42:09 is it schizophrenic?
00:42:11 Does it try to, given this very few weights,
00:42:17 represent multiple disjoint things
00:42:19 and somehow have them not interfere with each other?
00:42:22 Or is it somehow building on the joint strength,
00:42:27 on whatever is common to all the different modalities?
00:42:31 Like what, if you were to ask a question,
00:42:34 is it schizophrenic or is it of one mind?
00:42:38 I mean, it is one mind and it’s actually
00:42:42 the simplest algorithm, which that’s kind of in a way
00:42:46 how it feels like the field hasn’t changed
00:42:49 since back propagation and gradient descent
00:42:52 was purpose for learning neural networks.
00:42:55 So there is obviously details on the architecture.
00:42:58 This has evolved.
00:42:59 The current iteration is still the transformer,
00:43:03 which is a powerful sequence modeling architecture.
00:43:07 But then the goal of this, you know,
00:43:11 setting these weights to predict the data
00:43:13 is essentially the same as basically I could describe.
00:43:17 I mean, we described a few years ago,
00:43:18 Alpha star language modeling and so on, right?
00:43:21 We take, let’s say an Atari game,
00:43:24 we map it to a string of numbers
00:43:27 that will all be probably image space
00:43:30 and action space interleaved.
00:43:32 And all we’re gonna do is say, okay, given the numbers,
00:43:37 you know, 10,001, 10,004, 10,005,
00:43:40 the next number that comes is 20,006,
00:43:43 which is in the action space.
00:43:45 And you’re just optimizing these weights
00:43:48 via very simple gradients.
00:43:51 Like, you know, mathematical is almost
00:43:53 the most boring algorithm you could imagine.
00:43:55 We settle the weights so that
00:43:57 given this particular instance,
00:44:00 these weights are set to maximize the probability
00:44:04 of having seen this particular sequence of integers
00:44:07 for this particular game.
00:44:09 And then the algorithm does this
00:44:11 for many, many, many iterations,
00:44:14 looking at different modalities, different games, right?
00:44:17 That’s the mixture of the dataset we discussed.
00:44:20 So in a way, it’s a very simple algorithm
00:44:24 and the weights, right, they’re all shared, right?
00:44:27 So in terms of, is it focusing on one modality or not?
00:44:30 The intermediate weights that are converting
00:44:33 from these input of integers
00:44:35 to the target integer you’re predicting next,
00:44:37 those weights certainly are common.
00:44:40 And then the way that tokenization happens,
00:44:43 there is a special place in the neural network,
00:44:45 which is we map this integer, like number 10,001,
00:44:49 to a vector of real numbers.
00:44:51 Like real numbers, we can optimize them
00:44:54 with gradient descent, right?
00:44:56 The functions we learn
00:44:57 are actually surprisingly differentiable.
00:44:59 That’s why we compute gradients.
00:45:01 So this step is the only one
00:45:03 that this orthogonality you mentioned applies.
00:45:06 So mapping a certain token for text or image or actions,
00:45:12 each of these tokens gets its own little vector
00:45:15 of real numbers that represents this.
00:45:17 If you look at the field back many years ago,
00:45:19 people were talking about word vectors or word embeddings.
00:45:23 These are the same.
00:45:24 We have word vectors or embeddings.
00:45:26 We have image vector or embeddings
00:45:28 and action vector of embeddings.
00:45:30 And the beauty here is that as you train this model,
00:45:33 if you visualize these little vectors,
00:45:36 it might be that they start aligning
00:45:38 even though they’re independent parameters.
00:45:41 There could be anything,
00:45:42 but then it might be that you take the word gato or cat,
00:45:47 which maybe is common enough
00:45:48 that it actually has its own token.
00:45:50 And then you take pixels that have a cat
00:45:52 and you might start seeing
00:45:53 that these vectors look like they align, right?
00:45:57 So by learning from this vast amount of data,
00:46:00 the model is realizing the potential connections
00:46:03 between these modalities.
00:46:05 Now, I will say there will be another way,
00:46:07 at least in part, to not have these different vectors
00:46:13 for each different modality.
00:46:15 For instance, when I tell you about actions
00:46:18 in certain space, I’m defining actions by words, right?
00:46:22 So you could imagine a world in which I’m not learning
00:46:26 that the action app in Atari is its own number.
00:46:31 The action app in Atari maybe is literally the word
00:46:34 or the sentence app in Atari, right?
00:46:37 And that would mean we now leverage
00:46:39 much more from the language.
00:46:41 This is not what we did here,
00:46:42 but certainly it might make these connections
00:46:45 much easier to learn and also to teach the model
00:46:49 to correct its own actions and so on, right?
00:46:51 So all these to say that gato is indeed the beginning,
00:46:55 that it is a radical idea to do this this way,
00:46:59 but there’s probably a lot more to be done
00:47:02 and the results to be more impressive,
00:47:04 not only through scale, but also through some new research
00:47:07 that will come hopefully in the years to come.
00:47:10 So just to elaborate quickly,
00:47:12 you mean one possible next step
00:47:16 or one of the paths that you might take next
00:47:20 is doing the tokenization fundamentally
00:47:25 as a kind of linguistic communication.
00:47:28 So like you convert even images into language.
00:47:31 So doing something like a crude semantic segmentation,
00:47:35 trying to just assign a bunch of words to an image
00:47:38 that like have almost like a dumb entity
00:47:42 explaining as much as it can about the image.
00:47:45 And so you convert that into words
00:47:46 and then you convert games into words
00:47:49 and then you provide the context in words and all of it.
00:47:52 And eventually getting to a point
00:47:56 where everybody agrees with Noam Chomsky
00:47:58 that language is actually at the core of everything.
00:48:00 That’s it’s the base layer of intelligence
00:48:04 and consciousness and all that kind of stuff, okay.
00:48:07 You mentioned early on like size, it’s hard to grow.
00:48:11 What did you mean by that?
00:48:12 Because we’re talking about scale might change.
00:48:17 There might be, and we’ll talk about this too,
00:48:18 like there’s a emergent, there’s certain things
00:48:23 about these neural networks that are emergent.
00:48:25 So certain like performance we can see only with scale
00:48:28 and there’s some kind of threshold of scale.
00:48:30 So why is it hard to grow something like this Meow network?
00:48:36 So the Meow network, it’s not hard to grow
00:48:41 if you retrain it.
00:48:42 What’s hard is, well, we have now 1 billion parameters.
00:48:46 We train them for a while.
00:48:48 We spend some amount of work towards building these weights
00:48:53 that are an amazing initial brain
00:48:55 for doing these kinds of tasks we care about.
00:48:58 Could we reuse the weights and expand to a larger brain?
00:49:03 And that is extraordinarily hard,
00:49:06 but also exciting from a research perspective
00:49:10 and a practical perspective point of view, right?
00:49:12 So there’s this notion of modularity in software engineering
00:49:17 and we starting to see some examples
00:49:20 and work that leverages modularity.
00:49:23 In fact, if we go back one step from Gato
00:49:26 to a work that I would say train much larger,
00:49:29 much more capable network called Flamingo.
00:49:32 Flamingo did not deal with actions,
00:49:34 but it definitely dealt with images in an interesting way,
00:49:38 kind of akin to what Gato did,
00:49:40 but slightly different technique for tokenizing,
00:49:42 but we don’t need to go into that detail.
00:49:45 But what Flamingo also did, which Gato didn’t do,
00:49:49 and that just happens because these projects,
00:49:51 they’re different, it’s a bit of like the exploratory nature
00:49:55 of research, which is great.
00:49:57 The research behind these projects is also modular.
00:50:00 Yes, exactly.
00:50:01 And it has to be, right?
00:50:02 We need to have creativity
00:50:05 and sometimes you need to protect pockets of people,
00:50:09 researchers and so on.
00:50:10 By we, you mean humans.
00:50:11 Yes.
00:50:12 And also in particular researchers
00:50:14 and maybe even further DeepMind or other such labs.
00:50:18 And then the neural networks themselves.
00:50:20 So it’s modularity all the way down.
00:50:23 All the way down.
00:50:24 So the way that we did modularity very beautifully
00:50:27 in Flamingo is we took Chinchilla,
00:50:30 which is a language only model, not an agent,
00:50:33 if we think of actions being necessary for agency.
00:50:36 So we took Chinchilla, we took the weights of Chinchilla
00:50:40 and then we froze them.
00:50:42 We said, these don’t change.
00:50:44 We train them to be very good at predicting the next word.
00:50:47 It’s a very good language model, state of the art
00:50:50 at the time you release it, et cetera, et cetera.
00:50:52 We’re going to add a capability to see, right?
00:50:55 We are going to add the ability to see
00:50:56 to this language model.
00:50:58 So we’re going to attach small pieces of neural networks
00:51:01 at the right places in the model.
00:51:03 It’s almost like I’m injecting the network
00:51:07 with some weights and some substructures
00:51:10 in a good way, right?
00:51:12 So you need the research to say, what is effective?
00:51:15 How do you add this capability
00:51:16 without destroying others, et cetera.
00:51:18 So we created a small sub network initialized,
00:51:24 not from random, but actually from self supervised learning
00:51:28 that a model that understands vision in general.
00:51:32 And then we took data sets that connect the two modalities,
00:51:37 vision and language.
00:51:38 And then we froze the main part,
00:51:41 the largest portion of the network, which was Chinchilla,
00:51:43 that is 70 billion parameters.
00:51:45 And then we added a few more parameters on top,
00:51:49 trained from scratch, and then some others
00:51:51 that were pre trained with the capacity to see,
00:51:55 like it was not tokenization
00:51:57 in the way I described for Gato, but it’s a similar idea.
00:52:01 And then we trained the whole system.
00:52:03 Parts of it were frozen, parts of it were new.
00:52:06 And all of a sudden, we developed Flamingo,
00:52:09 which is an amazing model that is essentially,
00:52:12 I mean, describing it is a chatbot
00:52:14 where you can also upload images
00:52:16 and start conversing about images.
00:52:19 But it’s also kind of a dialogue style chatbot.
00:52:23 So the input is images and text and the output is text.
00:52:26 Exactly.
00:52:28 How many parameters, you said 70 billion for Chinchilla?
00:52:31 Yeah, Chinchilla is 70 billion.
00:52:33 And then the ones we add on top,
00:52:34 which kind of almost is almost like a way
00:52:38 to overwrite its little activations
00:52:40 so that when it sees vision,
00:52:42 it does kind of a correct computation of what it’s seeing,
00:52:45 mapping it back towards, so to speak.
00:52:47 That adds an extra 10 billion parameters, right?
00:52:50 So it’s total 80 billion, the largest one we released.
00:52:53 And then you train it on a few datasets
00:52:57 that contain vision and language.
00:52:59 And once you interact with the model,
00:53:01 you start seeing that you can upload an image
00:53:04 and start sort of having a dialogue about the image,
00:53:07 which is actually not something,
00:53:09 it’s very similar and akin to what we saw in language only.
00:53:12 These prompting abilities that it has,
00:53:15 you can teach it a new vision task, right?
00:53:17 It does things beyond the capabilities
00:53:20 that in theory the datasets provided in themselves,
00:53:24 but because it leverages a lot of the language knowledge
00:53:27 acquired from Chinchilla,
00:53:28 it actually has this few shot learning ability
00:53:31 and these emerging abilities
00:53:33 that we didn’t even measure
00:53:34 once we were developing the model,
00:53:36 but once developed, then as you play with the interface,
00:53:40 you can start seeing, wow, okay, yeah, it’s cool.
00:53:42 We can upload, I think one of the tweets
00:53:45 talking about Twitter was this image from Obama
00:53:47 that is placing a weight
00:53:49 and someone is kind of weighting themselves
00:53:52 and it’s kind of a joke style image.
00:53:54 And it’s notable because I think Andrew Carpati
00:53:57 a few years ago said,
00:53:59 no computer vision system can understand
00:54:02 the subtlety of this joke in this image,
00:54:04 all the things that go on.
00:54:06 And so what we try to do, and it’s very anecdotally,
00:54:09 I mean, this is not a proof that we solved this issue,
00:54:12 but it just shows that you can upload now this image
00:54:15 and start conversing with the model,
00:54:17 trying to make out if it gets that there’s a joke
00:54:21 because the person weighting themselves
00:54:23 doesn’t see that someone behind
00:54:25 is making the weight higher and so on and so forth.
00:54:27 So it’s a fascinating capability
00:54:30 and it comes from this key idea of modularity
00:54:33 where we took a frozen brain
00:54:34 and we just added a new capability.
00:54:37 So the question is, should we,
00:54:40 so in a way you can see even from DeepMind,
00:54:42 we have Flamingo that this moderate approach
00:54:46 and thus could leverage the scale a bit more reasonably
00:54:49 because we didn’t need to retrain a system from scratch.
00:54:52 And on the other hand, we had Gato,
00:54:54 which used the same data sets,
00:54:55 but then he trained it from scratch, right?
00:54:57 And so I guess big question for the community
00:55:00 is should we train from scratch
00:55:02 or should we embrace modularity?
00:55:04 And this lies, like this goes back to modularity
00:55:08 as a way to grow, but reuse seems like natural
00:55:12 and it was very effective, certainly.
00:55:14 The next question is, if you go the way of modularity,
00:55:18 is there a systematic way of freezing weights
00:55:22 and joining different modalities across,
00:55:27 you know, not just two or three or four networks,
00:55:29 but hundreds of networks from all different kinds of places,
00:55:32 maybe open source network that looks at weather patterns
00:55:36 and you shove that in somehow
00:55:37 and then you have networks that, I don’t know,
00:55:40 do all kinds of stuff, play StarCraft
00:55:42 and play all the other video games
00:55:43 and you can keep adding them in without significant effort,
00:55:49 like maybe the effort scales linearly or something like that
00:55:53 as opposed to like the more network you add,
00:55:54 the more you have to worry about the instabilities created.
00:55:57 Yeah, so that vision is beautiful.
00:55:59 I think there’s still the question
00:56:03 about within single modalities, like Chinchilla was reused,
00:56:06 but now if we train a next iteration of language models,
00:56:10 are we gonna use Chinchilla or not?
00:56:11 Yeah, how do you swap out Chinchilla?
00:56:13 Right, so there’s still big questions,
00:56:15 but that idea is actually really akin to software engineering,
00:56:19 which we’re not reimplementing libraries from scratch,
00:56:22 we’re reusing and then building ever more amazing things,
00:56:25 including neural networks with software that we’re reusing.
00:56:28 So I think this idea of modularity, I like it,
00:56:32 I think it’s here to stay
00:56:33 and that’s also why I mentioned
00:56:35 it’s just the beginning, not the end.
00:56:38 You’ve mentioned meta learning,
00:56:39 so given this promise of Gato,
00:56:42 can we try to redefine this term
00:56:45 that’s almost akin to consciousness
00:56:47 because it means different things to different people
00:56:50 throughout the history of artificial intelligence,
00:56:52 but what do you think meta learning is
00:56:56 and looks like now in the five years, 10 years,
00:57:00 will it look like the system like Gato, but scaled?
00:57:03 What’s your sense of, what does meta learning look like?
00:57:07 Do you think with all the wisdom we’ve learned so far?
00:57:10 Yeah, great question.
00:57:11 Maybe it’s good to give another data point
00:57:14 looking backwards rather than forward.
00:57:16 So when we talk in 2019,
00:57:22 meta learning meant something that has changed
00:57:26 mostly through the revolution of GPT3 and beyond.
00:57:31 So what meta learning meant at the time
00:57:35 was driven by what benchmarks people care about
00:57:37 in meta learning.
00:57:38 And the benchmarks were about a capability
00:57:42 to learn about object identities.
00:57:44 So it was very much overfitted to vision
00:57:48 and object classification.
00:57:50 And the part that was meta about that was that,
00:57:52 oh, we’re not just learning a thousand categories
00:57:55 that ImageNet tells us to learn.
00:57:57 We’re going to learn object categories
00:57:59 that can be defined when we interact with the model.
00:58:03 So it’s interesting to see the evolution, right?
00:58:06 The way this started was we have a special language
00:58:10 that was a data set, a small data set
00:58:13 that we prompted the model with saying,
00:58:15 hey, here is a new classification task.
00:58:18 I’ll give you one image and the name,
00:58:21 which was an integer at the time of the image
00:58:24 and a different image and so on.
00:58:25 So you have a small prompt in the form of a data set,
00:58:30 a machine learning data set.
00:58:31 And then you got then a system that could then predict
00:58:35 or classify these objects that you just
00:58:37 defined kind of on the fly.
00:58:40 So fast forward, it was revealed that language models
00:58:46 are few shot learners.
00:58:47 That’s the title of the paper.
00:58:49 So very good title.
00:58:50 Sometimes titles are really good.
00:58:51 So this one is really, really good.
00:58:53 Because that’s the point of GPT3 that showed that, look, sure,
00:58:58 we can focus on object classification
00:59:00 and what meta learning means within the space of learning
00:59:04 object categories.
00:59:05 This goes beyond, or before rather,
00:59:07 to also Omniglot, before ImageNet and so on.
00:59:10 So there’s a few benchmarks.
00:59:11 To now, all of a sudden, we’re a bit unlocked from benchmarks.
00:59:15 And through language, we can define tasks.
00:59:17 So we’re literally telling the model
00:59:20 some logical task or a little thing that we wanted to do.
00:59:23 We prompt it much like we did before,
00:59:26 but now we prompt it through natural language.
00:59:28 And then not perfectly, I mean, these models have failure modes
00:59:32 and that’s fine, but these models then
00:59:35 are now doing a new task.
00:59:37 And so they meta learn this new capability.
00:59:40 Now, that’s where we are now.
00:59:43 Flamingo expanded this to visual and language,
00:59:47 but it basically has the same abilities.
00:59:49 You can teach it, for instance, an emergent property
00:59:52 was that you can take pictures of numbers
00:59:55 and then do arithmetic with the numbers just by teaching it,
00:59:59 oh, when I show you 3 plus 6, I want you to output 9.
01:00:03 And you show it a few examples, and now it does that.
01:00:06 So it went way beyond the image net categorization of images
01:00:12 that we were a bit stuck maybe before this revelation
01:00:17 moment that happened in 2000.
01:00:19 I believe it was 19, but it was after we checked.
01:00:21 In that way, it has solved meta learning
01:00:24 as was previously defined.
01:00:26 Yes, it expanded what it meant.
01:00:27 So that’s what you say, what does it mean?
01:00:29 So it’s an evolving term.
01:00:31 But here is maybe now looking forward,
01:00:35 looking at what’s happening, obviously,
01:00:38 in the community with more modalities, what we can expect.
01:00:42 And I would certainly hope to see the following.
01:00:45 And this is a pretty drastic hope.
01:00:48 But in five years, maybe we chat again.
01:00:51 And we have a system, a set of weights
01:00:55 that we can teach it to play StarCraft.
01:00:59 Maybe not at the level of AlphaStar,
01:01:01 but play StarCraft, a complex game,
01:01:03 we teach it through interactions to prompting.
01:01:06 You can certainly prompt a system.
01:01:08 That’s what Gata shows to play some simple Atari games.
01:01:11 So imagine if you start talking to a system,
01:01:15 teaching it a new game, showing it
01:01:17 examples of in this particular game,
01:01:20 this user did something good.
01:01:22 Maybe the system can even play and ask you questions.
01:01:25 Say, hey, I played this game.
01:01:27 I just played this game.
01:01:28 Did I do well?
01:01:29 Can you teach me more?
01:01:30 So five, maybe to 10 years, these capabilities,
01:01:34 or what meta learning means, will
01:01:36 be much more interactive, much more rich,
01:01:38 and through domains that we were specializing.
01:01:41 So you see the difference.
01:01:42 We built AlphaStar Specialized to play StarCraft.
01:01:47 The algorithms were general, but the weights were specialized.
01:01:50 And what we’re hoping is that we can teach a network
01:01:54 to play games, to play any game, just using games as an example,
01:01:58 through interacting with it, teaching it,
01:02:01 uploading the Wikipedia page of StarCraft.
01:02:04 This is in the horizon.
01:02:06 And obviously, there are details that need to be filled
01:02:09 and research needs to be done.
01:02:11 But that’s how I see meta learning above,
01:02:13 which is going to be beyond prompting.
01:02:15 It’s going to be a bit more interactive.
01:02:18 The system might tell us to give it feedback
01:02:20 after it maybe makes mistakes or it loses a game.
01:02:24 But it’s nonetheless very exciting
01:02:26 because if you think about this this way,
01:02:28 the benchmarks are already there.
01:02:30 We just repurposed the benchmarks.
01:02:33 So in a way, I like to map the space of what
01:02:38 maybe AGI means to say, OK, we went 101% performance in Go,
01:02:45 in Chess, in StarCraft.
01:02:47 The next iteration might be 20% performance
01:02:51 across, quote unquote, all tasks.
01:02:54 And even if it’s not as good, it’s fine.
01:02:57 We have ways to also measure progress
01:02:59 because we have those specialized agents and so on.
01:03:04 So this is, to me, very exciting.
01:03:06 And these next iteration models are definitely
01:03:10 hinting at that direction of progress,
01:03:13 which hopefully we can have.
01:03:14 There are obviously some things that
01:03:16 could go wrong in terms of we might not have the tools.
01:03:20 Maybe transformers are not enough.
01:03:22 There are some breakthroughs to come, which
01:03:24 makes the field more exciting to people like me as well,
01:03:27 of course.
01:03:28 But that’s, if you ask me, five to 10 years,
01:03:32 you might see these models that start
01:03:33 to look more like weights that are already trained.
01:03:36 And then it’s more about teaching or make
01:03:40 their meta learn what you’re trying
01:03:44 to induce in terms of tasks and so on,
01:03:47 well beyond the simple now tasks we’re
01:03:49 starting to see emerge like small arithmetic tasks
01:03:53 and so on.
01:03:54 So a few questions around that.
01:03:55 This is fascinating.
01:03:57 So that kind of teaching, interactive,
01:04:01 so it’s beyond prompting.
01:04:02 So it’s interacting with the neural network.
01:04:05 That’s different than the training process.
01:04:08 So it’s different than the optimization
01:04:12 over differentiable functions.
01:04:15 This is already trained.
01:04:17 And now you’re teaching, I mean, it’s
01:04:21 almost akin to the brain, the neurons already
01:04:25 set with their connections.
01:04:26 On top of that, you’re now using that infrastructure
01:04:30 to build up further knowledge.
01:04:33 So that’s a really interesting distinction that’s actually
01:04:37 not obvious from a software engineering perspective,
01:04:40 that there’s a line to be drawn.
01:04:42 Because you always think for a neural network to learn,
01:04:44 it has to be retrained, trained and retrained.
01:04:49 And prompting is a way of teaching.
01:04:54 And you’ll now work a little bit of context
01:04:55 about whatever the heck you’re trying it to do.
01:04:57 So you can maybe expand this prompting capability
01:05:00 by making it interact.
01:05:03 That’s really, really interesting.
01:05:04 By the way, this is not, if you look at way back
01:05:08 at different ways to tackle even classification tasks.
01:05:11 So this comes from longstanding literature
01:05:16 in machine learning.
01:05:18 What I’m suggesting could sound to some
01:05:20 like a bit like nearest neighbor.
01:05:23 So nearest neighbor is almost the simplest algorithm
01:05:27 that does not require learning.
01:05:30 So it has this interesting, you don’t
01:05:32 need to compute gradients.
01:05:34 And what nearest neighbor does is you, quote unquote,
01:05:37 have a data set or upload a data set.
01:05:39 And then all you need to do is a way
01:05:42 to measure distance between points.
01:05:44 And then to classify a new point,
01:05:46 you’re just simply computing, what’s
01:05:48 the closest point in this massive amount of data?
01:05:51 And that’s my answer.
01:05:52 So you can think of prompting in a way
01:05:55 as you’re uploading not just simple points.
01:05:58 And the metric is not the distance between the images
01:06:02 or something simple.
01:06:03 It’s something that you compute that’s much more advanced.
01:06:06 But in a way, it’s very similar.
01:06:09 You simply are uploading some knowledge
01:06:12 to this pre trained system in nearest neighbor.
01:06:15 Maybe the metric is learned or not,
01:06:17 but you don’t need to further train it.
01:06:19 And then now you immediately get a classifier out of this.
01:06:23 Now it’s just an evolution of that concept,
01:06:25 very classical concept in machine learning, which
01:06:28 is just learning through what’s the closest point, closest
01:06:32 by some distance, and that’s it.
01:06:34 It’s an evolution of that.
01:06:36 And I will say how I saw meta learning when
01:06:39 we worked on a few ideas in 2016 was precisely
01:06:44 through the lens of nearest neighbor, which
01:06:47 is very common in computer vision community.
01:06:50 There’s a very active area of research
01:06:52 about how do you compute the distance between two images.
01:06:55 But if you have a good distance metric,
01:06:57 you also have a good classifier.
01:06:59 All I’m saying is now these distances and the points
01:07:02 are not just images.
01:07:03 They’re like words or sequences of words and images
01:07:08 and actions that teach you something new.
01:07:10 But it might be that technique wise those come back.
01:07:14 And I will say that it’s not necessarily true
01:07:18 that you might not ever train the weights a bit further.
01:07:21 Some aspect of meta learning, some techniques
01:07:24 in meta learning do actually do a bit of fine tuning
01:07:28 as it’s called.
01:07:29 They train the weights a little bit when they get a new task.
01:07:32 So as I call the how or how we’re going to achieve this,
01:07:37 as a deep learner, I’m very skeptic.
01:07:39 We’re going to try a few things, whether it’s
01:07:41 a bit of training, adding a few parameters,
01:07:44 thinking of these as nearest neighbor,
01:07:45 or just simply thinking of there’s a sequence of words,
01:07:49 it’s a prefix.
01:07:50 And that’s the new classifier.
01:07:53 We’ll see.
01:07:53 There’s the beauty of research.
01:07:55 But what’s important is that is a good goal in itself
01:08:00 that I see as very worthwhile pursuing for the next stages
01:08:03 of not only meta learning.
01:08:05 I think this is basically what’s exciting about machine learning
01:08:10 period to me.
01:08:11 Well, and the interactive aspect of that
01:08:13 is also very interesting, the interactive version
01:08:16 of nearest neighbor to help you pull out the classifier
01:08:22 from this giant thing.
01:08:23 OK, is this the way we can go in 5, 10 plus years
01:08:31 from any task, sorry, from many tasks to any task?
01:08:38 And what does that mean?
01:08:39 What does it need to be actually trained on?
01:08:42 Which point is the network had enough?
01:08:45 So what does a network need to learn about this world
01:08:50 in order to be able to perform any task?
01:08:52 Is it just as simple as language, image, and action?
01:08:57 Or do you need some set of representative images?
01:09:02 Like if you only see land images,
01:09:05 will you know anything about underwater?
01:09:06 Is that some fundamentally different?
01:09:08 I don’t know.
01:09:09 I mean, those are open questions, I would say.
01:09:12 I mean, the way you put, let me maybe further your example.
01:09:15 If all you see is land images but you’re
01:09:18 reading all about land and water worlds
01:09:21 but in books, imagine, would that be enough?
01:09:25 Good question.
01:09:26 We don’t know.
01:09:27 But I guess maybe you can join us
01:09:30 if you want in our quest to find this.
01:09:32 That’s precisely.
01:09:33 Water world, yeah.
01:09:34 Yes, that’s precisely, I mean, the beauty of research.
01:09:37 And that’s the research business we’re in,
01:09:42 I guess, is to figure this out and ask the right questions
01:09:46 and then iterate with the whole community,
01:09:49 publishing findings and so on.
01:09:52 But yeah, this is a question.
01:09:55 It’s not the only question, but it’s certainly, as you ask,
01:09:58 on my mind constantly.
01:10:00 And so we’ll need to wait for maybe the, let’s say, five
01:10:03 years, let’s hope it’s not 10, to see what are the answers.
01:10:09 Some people will largely believe in unsupervised or
01:10:12 self supervised learning of single modalities
01:10:15 and then crossing them.
01:10:18 Some people might think end to end learning is the answer.
01:10:21 Modularity is maybe the answer.
01:10:23 So we don’t know, but we’re just definitely excited
01:10:27 to find out.
01:10:27 But it feels like this is the right time
01:10:29 and we’re at the beginning of this journey.
01:10:31 We’re finally ready to do these kind of general big models
01:10:36 and agents.
01:10:37 What do you sort of specific technical thing
01:10:42 about Gato, Flamingo, Chinchilla, Gopher, any of these
01:10:48 that is especially beautiful, that was surprising, maybe?
01:10:51 Is there something that just jumps out at you?
01:10:55 Of course, there’s the general thing of like,
01:10:57 you didn’t think it was possible and then you
01:11:00 realize it’s possible in terms of the generalizability
01:11:03 across modalities and all that kind of stuff.
01:11:05 Or maybe how small of a network, relatively speaking,
01:11:08 Gato is, all that kind of stuff.
01:11:10 But is there some weird little things that were surprising?
01:11:15 Look, I’ll give you an answer that’s very important
01:11:18 because maybe people don’t quite realize this,
01:11:22 but the teams behind these efforts, the actual humans,
01:11:27 that’s maybe the surprising in an obviously positive way.
01:11:31 So anytime you see these breakthroughs,
01:11:34 I mean, it’s easy to map it to a few people.
01:11:37 There’s people that are great at explaining things and so on.
01:11:39 And that’s very nice.
01:11:40 But maybe the learnings or the method learnings
01:11:44 that I get as a human about this is, sure, we can move forward.
01:11:50 But the surprising bit is how important
01:11:55 are all the pieces of these projects,
01:11:58 how do they come together?
01:12:00 So I’ll give you maybe some of the ingredients of success
01:12:04 that are common across these, but not the obvious ones
01:12:07 on machine learning.
01:12:08 I can always also give you those.
01:12:11 But basically, there is engineering is critical.
01:12:17 So very good engineering because ultimately we’re
01:12:21 collecting data sets, right?
01:12:23 So the engineering of data and then
01:12:26 of deploying the models at scale into some compute cluster
01:12:31 that cannot go understated, that is a huge factor of success.
01:12:36 And it’s hard to believe that details matter so much.
01:12:41 We would like to believe that it’s
01:12:43 true that there is more and more of a standard formula,
01:12:47 as I was saying, like this recipe that
01:12:49 works for everything.
01:12:50 But then when you zoom into each of these projects,
01:12:53 then you realize the devil is indeed in the details.
01:12:57 And then the teams have to work together towards these goals.
01:13:03 So engineering of data and obviously clusters
01:13:07 and large scale is very important.
01:13:09 And then one that is often not, maybe nowadays it is more clear
01:13:15 is benchmark progress, right?
01:13:17 So we’re talking here about multiple months of tens
01:13:20 of researchers and people that are
01:13:24 trying to organize the research and so on working together.
01:13:28 And you don’t know that you can get there.
01:13:32 I mean, this is the beauty.
01:13:34 If you’re not risking to trying to do something
01:13:37 that feels impossible, you’re not going to get there.
01:13:41 But you need a way to measure progress.
01:13:43 So the benchmarks that you build are critical.
01:13:47 I’ve seen this beautifully play out in many projects.
01:13:50 I mean, maybe the one I’ve seen it more consistently,
01:13:53 which means we establish the metric,
01:13:56 actually the community did.
01:13:58 And then we leverage that massively is alpha fold.
01:14:01 This is a project where the data, the metrics
01:14:05 were all there.
01:14:06 And all it took was, and it’s easier said than done,
01:14:09 an amazing team working not to try
01:14:12 to find some incremental improvement
01:14:14 and publish, which is one way to do research that is valid,
01:14:17 but aim very high and work literally for years
01:14:22 to iterate over that process.
01:14:24 And working for years with the team,
01:14:25 I mean, it is tricky that also happened to happen partly
01:14:30 during a pandemic and so on.
01:14:32 So I think my meta learning from all this
01:14:34 is the teams are critical to the success.
01:14:37 And then if now going to the machine learning,
01:14:40 the part that’s surprising is so we like architectures
01:14:46 like neural networks.
01:14:48 And I would say this was a very rapidly evolving field
01:14:53 until the transformer came.
01:14:54 So attention might indeed be all you need,
01:14:58 which is the title, also a good title,
01:15:00 although in hindsight is good.
01:15:02 I don’t think at the time I thought
01:15:03 this is a great title for a paper.
01:15:05 But that architecture is proving that the dream of modeling
01:15:10 sequences of any bytes, there is something there that will stick.
01:15:15 And I think these advance in architectures
01:15:18 in how neural networks are architecture
01:15:21 to do what they do.
01:15:23 It’s been hard to find one that has been so stable
01:15:26 and relatively has changed very little
01:15:28 since it was invented five or so years ago.
01:15:33 So that is a surprising, is a surprise
01:15:35 that keeps recurring into other projects.
01:15:38 Try to, on a philosophical or technical level, introspect,
01:15:43 what is the magic of attention?
01:15:45 What is attention?
01:15:47 That’s attention in people that study cognition,
01:15:50 so human attention.
01:15:52 I think there’s giant wars over what attention means,
01:15:55 how it works in the human mind.
01:15:57 So there’s very simple looks at what
01:16:00 attention is in a neural network from the days of attention
01:16:03 is all you need.
01:16:04 But do you think there’s a general principle that’s
01:16:07 really powerful here?
01:16:08 Yeah, so a distinction between transformers and LSTMs,
01:16:13 which were what came before.
01:16:15 And there was a transitional period
01:16:17 where you could use both.
01:16:19 In fact, when we talked about AlphaStar,
01:16:22 we used transformers and LSTMs.
01:16:24 So it was still the beginning of transformers.
01:16:26 They were very powerful.
01:16:27 But LSTMs were also very powerful sequence models.
01:16:31 So the power of the transformer is
01:16:35 that it has built in what we call
01:16:38 an inductive bias of attention that makes the model.
01:16:43 When you think of a sequence of integers,
01:16:45 like we discussed this before, this is a sequence of words.
01:16:50 When you have to do very hard tasks over these words,
01:16:54 this could be we’re going to translate a whole paragraph
01:16:57 or we’re going to predict the next paragraph given
01:17:00 10 paragraphs before.
01:17:04 There’s some loose intuition from how we do it as a human
01:17:10 that is very nicely mimicked and replicated structurally
01:17:15 speaking in the transformer, which
01:17:16 is this idea of you’re looking for something.
01:17:21 So you’re sort of when you just read a piece of text,
01:17:25 now you’re thinking what comes next.
01:17:27 You might want to relook at the text or look it from scratch.
01:17:31 I mean, literally is because there’s no recurrence.
01:17:35 You’re just thinking what comes next.
01:17:37 And it’s almost hypothesis driven.
01:17:40 So if I’m thinking the next word that I write is cat or dog,
01:17:46 the way the transformer works almost philosophically
01:17:49 is it has these two hypotheses.
01:17:52 Is it going to be cat or is it going to be dog?
01:17:55 And then it says, OK, if it’s cat,
01:17:58 I’m going to look for certain words.
01:17:59 Not necessarily cat, although cat is an obvious word
01:18:01 you would look in the past to see
01:18:03 whether it makes more sense to output cat or dog.
01:18:05 And then it does some very deep computation
01:18:09 over the words and beyond.
01:18:11 So it combines the words, but it has the query
01:18:16 as we call it that is cat.
01:18:18 And then similarly for dog.
01:18:20 And so it’s a very computational way to think about, look,
01:18:24 if I’m thinking deeply about text,
01:18:27 I need to go back to look at all of the text, attend over it.
01:18:30 But it’s not just attention.
01:18:32 What is guiding the attention?
01:18:34 And that was the key insight from an earlier paper
01:18:36 is not how far away is it?
01:18:39 I mean, how far away is it is important?
01:18:40 What did I just write about?
01:18:42 That’s critical.
01:18:44 But what you wrote about 10 pages ago
01:18:46 might also be critical.
01:18:48 So you’re looking not positionally, but content wise.
01:18:53 And transformers have this beautiful way
01:18:56 to query for certain content and pull it out
01:18:59 in a compressed way.
01:19:00 So then you can make a more informed decision.
01:19:02 I mean, that’s one way to explain transformers.
01:19:05 But I think it’s a very powerful inductive bias.
01:19:10 There might be some details that might change over time,
01:19:12 but I think that is what makes transformers so much more
01:19:17 powerful than the recurrent networks that
01:19:20 were more recency bias based, which obviously works
01:19:23 in some tasks, but it has major flaws.
01:19:26 Transformer itself has flaws.
01:19:29 And I think the main one, the main challenge
01:19:31 is these prompts that we just were talking about,
01:19:35 they can be 1,000 words long.
01:19:38 But if I’m teaching you StarGraph,
01:19:40 I’ll have to show you videos.
01:19:41 I’ll have to point you to whole Wikipedia articles
01:19:44 about the game.
01:19:46 We’ll have to interact probably as you play.
01:19:48 You’ll ask me questions.
01:19:49 The context required for us to achieve
01:19:52 me being a good teacher to you on the game
01:19:54 as you would want to do it with a model, I think
01:19:58 goes well beyond the current capabilities.
01:20:01 So the question is, how do we benchmark this?
01:20:03 And then how do we change the structure of the architectures?
01:20:07 I think there’s ideas on both sides,
01:20:08 but we’ll have to see empirically, obviously,
01:20:11 what ends up working.
01:20:13 And as you talked about, some of the ideas
01:20:15 could be keeping the constraint of that length in place,
01:20:19 but then forming hierarchical representations
01:20:23 to where you can start being much clever in how
01:20:26 you use those 1,000 tokens.
01:20:28 Indeed.
01:20:30 Yeah, that’s really interesting.
01:20:32 But it also is possible that this attentional mechanism
01:20:34 where you basically, you don’t have a recency bias,
01:20:37 but you look more generally, you make it learnable.
01:20:42 The mechanism in which way you look back into the past,
01:20:45 you make that learnable.
01:20:46 It’s also possible we’re at the very beginning of that
01:20:50 because that, you might become smarter and smarter
01:20:54 in the way you query the past.
01:20:58 So recent past and distant past and maybe very, very distant
01:21:01 past.
01:21:02 So almost like the attention mechanism
01:21:04 will have to improve and evolve as good as the tokenization
01:21:11 mechanism so you can represent long term memory somehow.
01:21:14 Yes.
01:21:16 And I mean, hierarchies are very,
01:21:18 I mean, it’s a very nice word that sounds appealing.
01:21:22 There’s lots of work adding hierarchy to the memories.
01:21:25 In practice, it does seem like we keep coming back
01:21:29 to the main formula or main architecture.
01:21:33 That sometimes tells us something.
01:21:35 There is such a sentence that a friend of mine told me,
01:21:38 like, whether it wants to work or not.
01:21:41 So Transformer was clearly an idea that wanted to work.
01:21:44 And then I think there’s some principles
01:21:47 we believe will be needed.
01:21:49 But finding the exact details, details matter so much.
01:21:52 That’s going to be tricky.
01:21:54 I love the idea that there’s like you as a human being,
01:21:59 you want some ideas to work.
01:22:01 And then there’s the model that wants some ideas
01:22:03 to work and you get to have a conversation
01:22:05 to see which more likely the model will win in the end.
01:22:10 Because it’s the one, you don’t have to do any work.
01:22:12 The model is the one that has to do the work.
01:22:14 So you should listen to the model.
01:22:15 And I really love this idea that you
01:22:17 talked about the humans in this picture.
01:22:19 If I could just briefly ask, one is you’re
01:22:21 saying the benchmarks about the modular humans working on this,
01:22:28 the benchmarks providing a sturdy ground of a wish
01:22:32 to do these things that seem impossible.
01:22:34 They give you, in the darkest of times,
01:22:37 give you hope because little signs of improvement.
01:22:41 Yes.
01:22:42 Like somehow you’re not lost if you have metrics
01:22:46 to measure your improvement.
01:22:48 And then there’s other aspect.
01:22:50 You said elsewhere and here today, like titles matter.
01:22:56 I wonder how much humans matter in the evolution
01:23:01 of all of this, meaning individual humans.
01:23:06 Something about their interactions,
01:23:08 something about their ideas, how much they change
01:23:11 the direction of all of this.
01:23:12 Like if you change the humans in this picture,
01:23:15 is it that the model is sitting there
01:23:18 and it wants some idea to work?
01:23:22 Or is it the humans, or maybe the model
01:23:25 is providing you 20 ideas that could work.
01:23:27 And depending on the humans you pick,
01:23:29 they’re going to be able to hear some of those ideas.
01:23:33 Because you’re now directing all of deep learning and deep mind,
01:23:35 you get to interact with a lot of projects,
01:23:37 a lot of brilliant researchers.
01:23:40 How much variability is created by the humans in all of this?
01:23:44 Yeah, I mean, I do believe humans matter a lot,
01:23:47 at the very least at the time scale of years
01:23:53 on when things are happening and what’s the sequencing of it.
01:23:56 So you get to interact with people that, I mean,
01:24:00 you mentioned this.
01:24:02 Some people really want some idea to work
01:24:05 and they’ll persist.
01:24:07 And then some other people might be more practical,
01:24:09 like I don’t care what idea works.
01:24:12 I care about cracking protein folding.
01:24:16 And at least these two kind of seem opposite sides.
01:24:21 We need both.
01:24:22 And we’ve clearly had both historically,
01:24:25 and that made certain things happen earlier or later.
01:24:28 So definitely humans involved in all of this endeavor
01:24:33 have had, I would say, years of change or of ordering
01:24:38 how things have happened, which breakthroughs came before,
01:24:41 which other breakthroughs, and so on.
01:24:43 So certainly that does happen.
01:24:45 And so one other, maybe one other axis of distinction
01:24:50 is what I called, and this is most commonly used
01:24:53 in reinforcement learning, is the exploration exploitation
01:24:56 trade off as well.
01:24:57 It’s not exactly what I meant, although quite related.
01:25:00 So when you start trying to help others,
01:25:07 like you become a bit more of a mentor
01:25:11 to a large group of people, be it a project or the deep
01:25:14 learning team or something, or even in the community
01:25:17 when you interact with people in conferences and so on,
01:25:20 you’re identifying quickly some things that are explorative
01:25:26 or exploitative.
01:25:27 And it’s tempting to try to guide people, obviously.
01:25:30 I mean, that’s what makes our experience.
01:25:33 We bring it, and we try to shape things sometimes wrongly.
01:25:36 And there’s many times that I’ve been wrong in the past.
01:25:39 That’s great.
01:25:40 But it would be wrong to dismiss any sort of the research
01:25:47 styles that I’m observing.
01:25:49 And I often get asked, well, you’re in industry, right?
01:25:52 So we do have access to large compute scale and so on.
01:25:55 So there are certain kinds of research
01:25:57 I almost feel like we need to do responsibly and so on.
01:26:01 But it is, Carlos, we have the particle accelerator here,
01:26:05 so to speak, in physics.
01:26:06 So we need to use it.
01:26:07 We need to answer the questions that we
01:26:09 should be answering right now for the scientific progress.
01:26:12 But then at the same time, I look at many advances,
01:26:15 including attention, which was discovered in Montreal
01:26:19 initially because of lack of compute, right?
01:26:22 So we were working on sequence to sequence
01:26:24 with my friends over at Google Brain at the time.
01:26:27 And we were using, I think, eight GPUs,
01:26:30 which was somehow a lot at the time.
01:26:32 And then I think Montreal was a bit more limited in the scale.
01:26:36 But then they discovered this content based attention
01:26:38 concept that then has obviously triggered things
01:26:42 like Transformer.
01:26:43 Not everything obviously starts Transformer.
01:26:46 There’s always a history that is important to recognize
01:26:49 because then you can make sure that then those who might feel
01:26:53 now, well, we don’t have so much compute,
01:26:56 you need to then help them optimize
01:27:00 that kind of research that might actually
01:27:02 produce amazing change.
01:27:04 Perhaps it’s not as short term as some of these advancements
01:27:07 or perhaps it’s a different time scale.
01:27:09 But the people and the diversity of the field
01:27:13 is quite critical that we maintain it.
01:27:15 And at times, especially mixed a bit with hype or other things,
01:27:19 it’s a bit tricky to be observing maybe
01:27:23 too much of the same thinking across the board.
01:27:27 But the humans definitely are critical.
01:27:30 And I can think of quite a few personal examples
01:27:33 where also someone told me something
01:27:36 that had a huge effect onto some idea.
01:27:40 And then that’s why I’m saying at least in terms of years,
01:27:43 probably some things do happen.
01:27:44 Yeah, it’s fascinating.
01:27:46 And it’s also fascinating how constraints somehow
01:27:48 are essential for innovation.
01:27:51 And the other thing you mentioned about engineering,
01:27:53 I have a sneaking suspicion.
01:27:54 Maybe I over, my love is with engineering.
01:28:00 So I have a sneaky suspicion that all the genius,
01:28:04 a large percentage of the genius is
01:28:06 in the tiny details of engineering.
01:28:09 So I think we like to think our genius,
01:28:14 the genius is in the big ideas.
01:28:17 I have a sneaking suspicion that because I’ve
01:28:20 seen the genius of details, of engineering details,
01:28:24 make the night and day difference.
01:28:28 And I wonder if those kind of have a ripple effect over time.
01:28:32 So that too, so that’s sort of taking the engineering
01:28:36 perspective that sometimes that quiet innovation
01:28:39 at the level of an individual engineer
01:28:41 or maybe at the small scale of a few engineers
01:28:44 can make all the difference.
01:28:46 Because we’re working on computers that
01:28:50 are scaled across large groups, that one engineering decision
01:28:55 can lead to ripple effects.
01:28:57 It’s interesting to think about.
01:28:59 Yeah, I mean, engineering, there’s
01:29:01 also kind of a historical, it might be a bit random.
01:29:06 Because if you think of the history of how especially
01:29:10 deep learning and neural networks took off,
01:29:12 feels like a bit random because GPUs happened
01:29:16 to be there at the right time for a different purpose, which
01:29:19 was to play video games.
01:29:20 So even the engineering that goes into the hardware
01:29:24 and it might have a time, the time frame
01:29:27 might be very different.
01:29:28 I mean, the GPUs were evolved throughout many years
01:29:31 where we didn’t even were looking at that.
01:29:33 So even at that level, that revolution, so to speak,
01:29:38 the ripples are like, we’ll see when they stop.
01:29:42 But in terms of thinking of why is this happening,
01:29:46 I think that when I try to categorize it
01:29:49 in sort of things that might not be so obvious,
01:29:52 I mean, clearly there’s a hardware revolution.
01:29:54 We are surfing thanks to that.
01:29:58 Data centers as well.
01:29:59 I mean, data centers are like, I mean, at Google,
01:30:02 for instance, obviously they’re serving Google.
01:30:04 But there’s also now thanks to that
01:30:06 and to have built such amazing data centers,
01:30:09 we can train these models.
01:30:11 Software is an important one.
01:30:13 I think if I look at the state of how
01:30:16 I had to implement things to implement my ideas,
01:30:20 how I discarded ideas because they were too hard
01:30:22 to implement.
01:30:23 Yeah, clearly the times have changed.
01:30:25 And thankfully, we are in a much better software position
01:30:28 as well.
01:30:29 And then, I mean, obviously there’s
01:30:31 research that happens at scale and more people
01:30:34 enter the field.
01:30:35 That’s great to see.
01:30:35 But it’s almost enabled by these other things.
01:30:38 And last but not least is also data, right?
01:30:40 Curating data sets, labeling data sets,
01:30:43 these benchmarks we think about.
01:30:44 Maybe we’ll want to have all the benchmarks in one system.
01:30:48 But it’s still very valuable that someone
01:30:51 put the thought and the time and the vision
01:30:53 to build certain benchmarks.
01:30:54 We’ve seen progress thanks to.
01:30:56 But we’re going to repurpose the benchmarks.
01:30:59 That’s the beauty of Atari is like we solved it in a way.
01:31:04 But we use it in Gato.
01:31:06 It was critical.
01:31:06 And I’m sure there’s still a lot more
01:31:09 to do thanks to that amazing benchmark
01:31:10 that someone took the time to put,
01:31:13 even though at the time maybe, oh, you
01:31:15 have to think what’s the next iteration of architectures.
01:31:19 That’s what maybe the field recognizes.
01:31:21 But that’s another thing we need to balance
01:31:24 in terms of humans behind.
01:31:25 We need to recognize all these aspects
01:31:27 because they’re all critical.
01:31:29 And we tend to think of the genius, the scientist,
01:31:33 and so on.
01:31:34 But I’m glad I know you have a strong engineering background.
01:31:38 But also, I’m a lover of data.
01:31:40 And the pushback on the engineering comment
01:31:43 ultimately could be the creators of benchmarks
01:31:46 who have the most impact.
01:31:47 Andrej Karpathy, who you mentioned,
01:31:49 has recently been talking a lot of trash about ImageNet, which
01:31:52 he has the right to do because of how critical he is about
01:31:54 ImageNet, how essential he is to the development
01:31:57 and the success of deep learning around ImageNet.
01:32:01 And he’s saying that that’s actually
01:32:02 that benchmark is holding back the field.
01:32:05 Because I mean, especially in his context on Tesla Autopilot,
01:32:09 that’s looking at real world behavior of a system.
01:32:14 There’s something fundamentally missing
01:32:16 about ImageNet that doesn’t capture
01:32:17 the real worldness of things.
01:32:20 That we need to have data sets, benchmarks that
01:32:23 have the unpredictability, the edge cases, whatever
01:32:27 the heck it is that makes the real world so
01:32:30 difficult to operate in.
01:32:32 We need to have benchmarks of that.
01:32:34 But just to think about the impact of ImageNet
01:32:37 as a benchmark, and that really puts a lot of emphasis
01:32:42 on the importance of a benchmark,
01:32:43 both sort of internally a deep mind and as a community.
01:32:46 So one is coming in from within, like,
01:32:50 how do I create a benchmark for me to mark and make progress?
01:32:55 And how do I make benchmark for the community
01:32:58 to mark and push progress?
01:33:02 You have this amazing paper you coauthored,
01:33:05 a survey paper called Emergent Abilities
01:33:08 of Large Language Models.
01:33:10 It has, again, the philosophy here
01:33:12 that I’d love to ask you about.
01:33:14 What’s the intuition about the phenomena of emergence
01:33:17 in neural networks transformed as language models?
01:33:20 Is there a magic threshold beyond which
01:33:24 we start to see certain performance?
01:33:27 And is that different from task to task?
01:33:29 Is that us humans just being poetic and romantic?
01:33:32 Or is there literally some level at which we start
01:33:36 to see breakthrough performance?
01:33:38 Yeah, I mean, this is a property that we start seeing in systems
01:33:43 that actually tend to be so in machine learning,
01:33:48 traditionally, again, going to benchmarks.
01:33:51 I mean, if you have some input, output, right,
01:33:54 like that is just a single input and a single output,
01:33:58 you generally, when you train these systems,
01:34:01 you see reasonably smooth curves when
01:34:04 you analyze how much the data set size affects
01:34:10 the performance, or how the model size affects
01:34:12 the performance, or how long you train the system for affects
01:34:18 the performance, right?
01:34:19 So if we think of ImageNet, the training curves
01:34:23 look fairly smooth and predictable in a way.
01:34:28 And I would say that’s probably because it’s
01:34:31 kind of a one hop reasoning task, right?
01:34:36 It’s like, here is an input, and you
01:34:39 think for a few milliseconds or 100 milliseconds, 300
01:34:42 as a human, and then you tell me,
01:34:44 yeah, there’s an alpaca in this image.
01:34:47 So in language, we are seeing benchmarks that require more
01:34:55 pondering and more thought in a way, right?
01:34:58 This is just kind of you need to look for some subtleties.
01:35:02 It involves inputs that you might think of,
01:35:05 even if the input is a sentence describing
01:35:08 a mathematical problem, there is a bit more processing
01:35:13 required as a human and more introspection.
01:35:15 So I think how these benchmarks work
01:35:20 means that there is actually a threshold.
01:35:24 Just going back to how transformers
01:35:26 work in this way of querying for the right questions
01:35:29 to get the right answers, that might
01:35:31 mean that performance becomes random
01:35:35 until the right question is asked
01:35:37 by the querying system of a transformer or of a language
01:35:40 model like a transformer.
01:35:42 And then only then you might start
01:35:46 seeing performance going from random to nonrandom.
01:35:50 And this is more empirical.
01:35:53 There’s no formalism or theory behind this yet,
01:35:56 although it might be quite important.
01:35:57 But we are seeing these phase transitions
01:36:00 of random performance until some,
01:36:03 let’s say, scale of a model.
01:36:04 And then it goes beyond that.
01:36:06 And it might be that you need to fit
01:36:10 a few low order bits of thought before you can make progress
01:36:16 on the whole task.
01:36:17 And if you could measure, actually,
01:36:19 those breakdown of the task, maybe you
01:36:22 would see more smooth, like, yeah,
01:36:25 once you get these and these and these and these and these,
01:36:27 then you start making progress in the task.
01:36:30 But it’s somehow a bit annoying because then it
01:36:35 means that certain questions we might ask about architectures
01:36:40 possibly can only be done at a certain scale.
01:36:42 And one thing that, conversely, I’ve
01:36:46 seen great progress on in the last couple of years
01:36:49 is this notion of science of deep learning and science
01:36:53 of scale in particular.
01:36:55 So on the negative is that there are
01:36:57 some benchmarks for which progress might
01:37:01 need to be measured at minimum at a certain scale
01:37:04 until you see then what details of the model
01:37:07 matter to make that performance better.
01:37:09 So that’s a bit of a con.
01:37:11 But what we’ve also seen is that you can empirically
01:37:17 analyze behavior of models at scales that are smaller.
01:37:22 So let’s say, to put an example, we
01:37:25 had this Chinchilla paper that revised the so called scaling
01:37:30 laws of models.
01:37:31 And that whole study is done at a reasonably small scale,
01:37:35 that may be hundreds of millions up to 1 billion parameters.
01:37:38 And then the cool thing is that you create some loss,
01:37:41 some loss that some trends, you extract trends from data
01:37:45 that you see, OK, it looks like the amount of data required
01:37:49 to train now a 10x larger model would be this.
01:37:52 And these laws so far, these extrapolations
01:37:55 have helped us save compute and just get to a better place
01:37:59 in terms of the science of how should we
01:38:02 run these models at scale, how much data, how much depth,
01:38:05 and all sorts of questions we start
01:38:07 asking extrapolating from a small scale.
01:38:10 But then these emergence is sadly that not everything
01:38:13 can be extrapolated from scale depending on the benchmark.
01:38:16 And maybe the harder benchmarks are not
01:38:19 so good for extracting these laws.
01:38:21 But we have a variety of benchmarks at least.
01:38:24 So I wonder to which degree the threshold, the phase shift
01:38:29 scale is a function of the benchmark.
01:38:32 So some of the science of scale might
01:38:35 be engineering benchmarks where that threshold is low,
01:38:40 sort of taking a main benchmark and reducing it somehow
01:38:46 where the essential difficulty is left
01:38:48 but the scale of which the emergence happens
01:38:51 is lower just for the science aspect of it
01:38:54 versus the actual real world aspect.
01:38:56 Yeah, so luckily we have quite a few benchmarks, some of which
01:38:59 are simpler or maybe they’re more like I think people might
01:39:02 call these systems one versus systems two style.
01:39:05 So I think what we’re not seeing luckily
01:39:09 is that extrapolations from maybe slightly more smooth
01:39:14 or simpler benchmarks are translating to the harder ones.
01:39:18 But that is not to say that this extrapolation will
01:39:21 hit its limits.
01:39:22 And when it does, then how much we scale or how we scale
01:39:27 will sadly be a bit suboptimal until we find better laws.
01:39:31 And these laws, again, are very empirical laws.
01:39:33 They’re not like physical laws of models,
01:39:35 although I wish there would be better theory about these
01:39:39 things as well.
01:39:40 But so far, I would say empirical theory,
01:39:43 as I call it, is way ahead than actual theory
01:39:46 of machine learning.
01:39:47 Let me ask you almost for fun.
01:39:50 So this is not, Oriol, as a deep mind person or anything
01:39:55 to do with deep mind or Google, just as a human being,
01:39:59 looking at these news of a Google engineer who claimed
01:40:04 that, I guess, the lambda language model was sentient.
01:40:11 And you still need to look into the details of this.
01:40:14 But making an official report and the claim
01:40:19 that he believes there’s evidence that this system has
01:40:23 achieved sentience.
01:40:25 And I think this is a really interesting case
01:40:29 on a human level, on a psychological level,
01:40:31 on a technical machine learning level of how language models
01:40:37 transform our world, and also just philosophical level
01:40:39 of the role of AI systems in a human world.
01:40:44 So what do you find interesting?
01:40:48 What’s your take on all of this as a machine learning
01:40:51 engineer and a researcher and also as a human being?
01:40:54 Yeah, I mean, a few reactions.
01:40:57 Quite a few, actually.
01:40:58 Have you ever briefly thought, is this thing sentient?
01:41:02 Right, so never, absolutely never.
01:41:04 Like even with Alpha Star?
01:41:06 Wait a minute.
01:41:08 Sadly, though, I think, yeah, sadly, I have not.
01:41:11 Yeah, I think the current, any of the current models,
01:41:15 although very useful and very good,
01:41:18 yeah, I think we’re quite far from that.
01:41:22 And there’s kind of a converse side story.
01:41:25 So one of my passions is about science in general.
01:41:30 And I think I feel I’m a bit of a failed scientist.
01:41:34 That’s why I came to machine learning,
01:41:36 because you always feel, and you start seeing this,
01:41:40 that machine learning is maybe the science that
01:41:43 can help other sciences, as we’ve seen.
01:41:46 It’s such a powerful tool.
01:41:48 So thanks to that angle, that, OK, I love science.
01:41:52 I love, I mean, I love astronomy.
01:41:53 I love biology.
01:41:54 But I’m not an expert.
01:41:56 And I decided, well, the thing I can do better
01:41:58 at is computers.
01:41:59 But having, especially with when I was a bit more involved
01:42:04 in AlphaFold, learning a bit about proteins
01:42:07 and about biology and about life,
01:42:11 the complexity, it feels like it really is.
01:42:14 I mean, if you start looking at the things that are going on
01:42:19 at the atomic level, and also, I mean, there’s obviously the,
01:42:26 we are maybe inclined to try to think of neural networks
01:42:29 as like the brain.
01:42:30 But the complexities and the amount of magic
01:42:33 that it feels when, I mean, I’m not an expert,
01:42:37 so it naturally feels more magic.
01:42:38 But looking at biological systems,
01:42:40 as opposed to these computational brains,
01:42:46 just makes me like, wow, there’s such a level of complexity
01:42:50 difference still, like orders of magnitude complexity that,
01:42:54 sure, these weights, I mean, we train them
01:42:56 and they do nice things.
01:42:58 But they’re not at the level of biological entities, brains,
01:43:04 cells.
01:43:06 It just feels like it’s just not possible to achieve
01:43:09 the same level of complexity behavior.
01:43:12 And my belief, when I talk to other beings,
01:43:16 is certainly shaped by this amazement of biology
01:43:20 that, maybe because I know too much,
01:43:22 I don’t have about machine learning,
01:43:23 but I certainly feel it’s very far fetched and far
01:43:28 in the future to be calling or to be thinking,
01:43:31 well, this mathematical function that is differentiable
01:43:35 is, in fact, sentient and so on.
01:43:39 There’s something on that point that is very interesting.
01:43:42 So you know enough about machines and enough
01:43:46 about biology to know that there’s
01:43:47 many orders of magnitude of difference and complexity.
01:43:51 But you know how machine learning works.
01:43:56 So the interesting question for human beings
01:43:58 that are interacting with a system that don’t know
01:44:00 about the underlying complexity.
01:44:02 And I’ve seen people, probably including myself,
01:44:05 that have fallen in love with things that are quite simple.
01:44:08 And so maybe the complexity is one part of the picture,
01:44:11 but maybe that’s not a necessary condition for sentience,
01:44:18 for perception or emulation of sentience.
01:44:24 Right.
01:44:25 So I mean, I guess the other side of this
01:44:27 is that’s how I feel personally.
01:44:29 I mean, you asked me about the person, right?
01:44:32 Now, it’s very interesting to see how other humans feel
01:44:35 about things, right?
01:44:37 We are, again, I’m not as amazed about things
01:44:41 that I feel this is not as magical as this other thing
01:44:44 because of maybe how I got to learn about it
01:44:48 and how I see the curve a bit more smooth
01:44:50 because I’ve just seen the progress of language models
01:44:54 since Shannon in the 50s.
01:44:56 And actually looking at that time scale,
01:44:58 we’re not that fast progress, right?
01:45:00 I mean, what we were thinking at the time almost 100 years ago
01:45:06 is not that dissimilar to what we’re doing now.
01:45:08 But at the same time, yeah, obviously others,
01:45:11 my experience, the personal experience,
01:45:14 I think no one should tell others how they should feel.
01:45:20 I mean, the feelings are very personal, right?
01:45:22 So how others might feel about the models and so on.
01:45:26 That’s one part of the story that
01:45:27 is important to understand for me personally as a researcher.
01:45:31 And then when I maybe disagree or I
01:45:35 don’t understand or see that, yeah, maybe this is not
01:45:38 something I think right now is reasonable,
01:45:39 knowing all that I know, one of the other things
01:45:42 and perhaps partly why it’s great to be talking to you
01:45:46 and reaching out to the world about machine learning
01:45:49 is, hey, let’s demystify a bit the magic
01:45:53 and try to see a bit more of the math
01:45:56 and the fact that literally to create these models,
01:45:59 if we had the right software, it would be 10 lines of code
01:46:03 and then just a dump of the internet.
01:46:06 Versus then the complexity of the creation of humans
01:46:11 from their inception, right?
01:46:13 And also the complexity of evolution of the whole universe
01:46:17 to where we are that feels orders of magnitude
01:46:21 more complex and fascinating to me.
01:46:23 So I think, yeah, maybe part of the only thing
01:46:26 I’m thinking about trying to tell you is, yeah, I think
01:46:30 explaining a bit of the magic.
01:46:32 There is a bit of magic.
01:46:33 It’s good to be in love, obviously,
01:46:35 with what you do at work.
01:46:36 And I’m certainly fascinated and surprised quite often as well.
01:46:41 But I think, hopefully, as experts in biology,
01:46:45 hopefully will tell me this is not as magic.
01:46:47 And I’m happy to learn that through interactions
01:46:50 with the larger community, we can also
01:46:54 have a certain level of education
01:46:56 that in practice also will matter because, I mean,
01:46:58 one question is how you feel about this.
01:47:00 But then the other very important is
01:47:03 you starting to interact with these in products and so on.
01:47:06 It’s good to understand a bit what’s going on,
01:47:09 what’s not going on, what’s safe, what’s not safe,
01:47:12 and so on, right?
01:47:13 Otherwise, the technology will not
01:47:15 be used properly for good, which is obviously
01:47:18 the goal of all of us, I hope.
01:47:20 So let me then ask the next question.
01:47:22 Do you think in order to solve intelligence
01:47:25 or to replace the leg spot that does interviews
01:47:29 as we started this conversation with,
01:47:31 do you think the system needs to be sentient?
01:47:34 Do you think it needs to achieve something like consciousness?
01:47:38 And do you think about what consciousness
01:47:41 is in the human mind that could be instructive for creating AI
01:47:45 systems?
01:47:46 Yeah.
01:47:47 Honestly, I think probably not to the degree of intelligence
01:47:53 that there’s this brain that can learn,
01:47:58 can be extremely useful, can challenge you, can teach you.
01:48:02 Conversely, you can teach it to do things.
01:48:05 I’m not sure it’s necessary, personally speaking.
01:48:09 But if consciousness or any other biological or evolutionary
01:48:15 lesson can be repurposed to then influence
01:48:20 our next set of algorithms, that is a great way
01:48:24 to actually make progress, right?
01:48:25 And the same way I try to explain transformers a bit
01:48:28 how it feels we operate when we look at text specifically,
01:48:33 these insights are very important, right?
01:48:36 So there’s a distinction between details of how the brain might
01:48:41 be doing computation.
01:48:43 I think my understanding is, sure, there’s neurons
01:48:46 and there’s some resemblance to neural networks,
01:48:48 but we don’t quite understand enough of the brain in detail,
01:48:52 right, to be able to replicate it.
01:48:55 But then if you zoom out a bit, our thought process,
01:49:01 how memory works, maybe even how evolution got us here,
01:49:05 what’s exploration, exploitation,
01:49:07 like how these things happen, I think
01:49:09 these clearly can inform algorithmic level research.
01:49:12 And I’ve seen some examples of this
01:49:17 being quite useful to then guide the research,
01:49:19 even it might be for the wrong reasons, right?
01:49:21 So I think biology and what we know about ourselves
01:49:26 can help a whole lot to build, essentially,
01:49:30 what we call AGI, this general, the real ghetto, right?
01:49:34 The last step of the chain, hopefully.
01:49:36 But consciousness in particular, I don’t myself
01:49:40 at least think too hard about how to add that to the system.
01:49:44 But maybe my understanding is also very personal
01:49:47 about what it means, right?
01:49:48 I think even that in itself is a long debate
01:49:51 that I know people have often.
01:49:55 And maybe I should learn more about this.
01:49:57 Yeah, and I personally, I notice the magic often
01:50:01 on a personal level, especially with physical systems
01:50:04 like robots.
01:50:06 I have a lot of legged robots now in Austin
01:50:10 that I play with.
01:50:11 And even when you program them, when
01:50:13 they do things you didn’t expect,
01:50:15 there’s an immediate anthropomorphization.
01:50:18 And you notice the magic, and you
01:50:19 start to think about things like sentience
01:50:22 that has to do more with effective communication
01:50:26 and less with any of these kind of dramatic things.
01:50:30 It seems like a useful part of communication.
01:50:32 Having the perception of consciousness
01:50:36 seems like useful for us humans.
01:50:38 We treat each other more seriously.
01:50:40 We are able to do a nearest neighbor shoving of that entity
01:50:46 into your memory correctly, all that kind of stuff.
01:50:48 It seems useful, at least to fake it,
01:50:50 even if you never make it.
01:50:52 So maybe, like, yeah, mirroring the question.
01:50:55 And since you talked to a few people,
01:50:57 then you do think that we’ll need
01:50:59 to figure something out in order to achieve intelligence
01:51:04 in a grander sense of the word.
01:51:06 Yeah, I personally believe yes, but I don’t even
01:51:09 think it’ll be like a separate island we’ll have to travel to.
01:51:14 I think it will emerge quite naturally.
01:51:16 OK, that’s easier for us then.
01:51:19 Thank you.
01:51:20 But the reason I think it’s important to think about
01:51:22 is you will start, I believe, like with this Google
01:51:25 engineer, you will start seeing this a lot more, especially
01:51:29 when you have AI systems that are actually interacting
01:51:31 with human beings that don’t have an engineering background.
01:51:35 And we have to prepare for that.
01:51:38 Because I do believe there will be a civil rights
01:51:41 movement for robots, as silly as it is to say.
01:51:44 There’s going to be a large number of people
01:51:46 that realize there’s these intelligent entities with whom
01:51:49 I have a deep relationship, and I don’t want to lose them.
01:51:53 They’ve come to be a part of my life, and they mean a lot.
01:51:55 They have a name.
01:51:57 They have a story.
01:51:58 They have a memory.
01:51:59 And we start to ask questions about ourselves.
01:52:01 Well, this thing sure seems like it’s capable of suffering,
01:52:07 because it tells all these stories of suffering.
01:52:09 It doesn’t want to die and all those kinds of things.
01:52:11 And we have to start to ask ourselves questions.
01:52:14 What is the difference between a human being and this thing?
01:52:16 And so when you engineer, I believe
01:52:20 from an engineering perspective, from a deep mind or anybody
01:52:23 that builds systems, there might be laws in the future
01:52:26 where you’re not allowed to engineer systems
01:52:29 with displays of sentience, unless they’re explicitly
01:52:35 designed to be that, unless it’s a pet.
01:52:37 So if you have a system that’s just doing customer support,
01:52:41 you’re legally not allowed to display sentience.
01:52:44 We’ll start to ask ourselves that question.
01:52:47 And then so that’s going to be part of the software
01:52:49 engineering process.
01:52:52 Which features do we have?
01:52:53 And one of them is communications of the sentience.
01:52:56 But it’s important to start thinking about that stuff,
01:52:58 especially how much it captivates public attention.
01:53:01 Yeah, absolutely.
01:53:03 It’s definitely a topic that is important.
01:53:06 We think about.
01:53:07 And I think in a way, I always see not every movie
01:53:12 is equally on point with certain things.
01:53:16 But certainly science fiction in this sense
01:53:19 at least has prepared society to start
01:53:22 thinking about certain topics that even if it’s
01:53:25 too early to talk about, as long as we are reasonable,
01:53:29 it’s certainly going to prepare us for both the research
01:53:33 to come and how to.
01:53:34 I mean, there’s many important challenges and topics
01:53:38 that come with building an intelligent system, many of
01:53:43 which you just mentioned.
01:53:44 So I think we’re never going to be fully ready
01:53:49 unless we talk about these.
01:53:51 And we start also, as I said, just expanding the people
01:53:58 we talk to not include only our own researchers and so on.
01:54:03 And in fact, places like DeepMind but elsewhere,
01:54:06 there’s more interdisciplinary groups forming up
01:54:10 to start asking and really working
01:54:12 with us on these questions.
01:54:14 Because obviously, this is not initially
01:54:17 what your passion is when you do your PhD,
01:54:19 but certainly it is coming.
01:54:21 So it’s fascinating.
01:54:23 It’s the thing that brings me to one of my passions
01:54:27 that is learning.
01:54:28 So in this sense, this is a new area
01:54:31 that, as a learning system myself,
01:54:35 I want to keep exploring.
01:54:36 And I think it’s great to see parts of the debate.
01:54:41 And even I’ve seen a level of maturity
01:54:43 in the conferences that deal with AI.
01:54:46 If you look five years ago to now,
01:54:49 just the amount of workshops and so on has changed so much.
01:54:53 It’s impressive to see how much topics of safety, ethics,
01:54:58 and so on come to the surface, which is great.
01:55:01 And if it were too early, clearly it’s fine.
01:55:03 I mean, it’s a big field, and there’s
01:55:05 lots of people with lots of interests
01:55:09 that will do progress or make progress.
01:55:11 And obviously, I don’t believe we’re too late.
01:55:14 So in that sense, I think it’s great
01:55:16 that we’re doing this already.
01:55:18 It better be too early than too late
01:55:20 when it comes to super intelligent AI systems.
01:55:22 Let me ask, speaking of sentient AIs,
01:55:25 you gave props to your friend Ilyas Etzgever
01:55:28 for being elected the fellow of the Royal Society.
01:55:31 So just as a shout out to a fellow researcher
01:55:34 and a friend, what’s the secret to the genius of Ilyas
01:55:38 Etzgever?
01:55:39 And also, do you believe that his tweets,
01:55:42 as you’ve hypothesized and Andrej Karpathy did as well,
01:55:46 are generated by a language model?
01:55:48 Yeah.
01:55:49 So I strongly believe Ilya is going to visit in a few weeks,
01:55:54 actually.
01:55:54 So I’ll ask him in person.
01:55:58 Will he tell you the truth?
01:55:59 Yes, of course, hopefully.
01:56:00 I mean, ultimately, we all have shared paths,
01:56:04 and there’s friendships that go beyond, obviously,
01:56:08 institutions and so on.
01:56:09 So I hope he tells me the truth.
01:56:11 Well, maybe the AI system is holding him hostage somehow.
01:56:14 Maybe he has some videos that he doesn’t want to release.
01:56:16 So maybe it has taken control over him.
01:56:19 So he can’t tell the truth.
01:56:20 Well, if I see him in person, then I think he will know.
01:56:23 But I think Ilya’s personality, just knowing him for a while,
01:56:33 everyone in Twitter, I guess, gets a different persona.
01:56:36 And I think Ilya’s one does not surprise me.
01:56:40 So I think knowing Ilya from before social media
01:56:43 and before AI was so prevalent, I
01:56:46 recognize a lot of his character.
01:56:47 So that’s something for me that I
01:56:49 feel good about a friend that hasn’t changed
01:56:52 or is still true to himself.
01:56:55 Obviously, there is, though, a fact
01:56:58 that your field becomes more popular,
01:57:02 and he is obviously one of the main figures in the field,
01:57:05 having done a lot of advancement.
01:57:07 So I think that the tricky bit here
01:57:09 is how to balance your true self with the responsibility
01:57:12 that your words carry.
01:57:13 So in this sense, I appreciate the style, and I understand it.
01:57:19 But it created debates on some of his tweets
01:57:24 that maybe it’s good we have them early anyways.
01:57:27 But yeah, then the reactions are usually polarizing.
01:57:31 I think we’re just seeing the reality of social media
01:57:34 be there as well, reflected on that particular topic
01:57:38 or set of topics he’s tweeting about.
01:57:40 Yeah, I mean, it’s funny that he used to speak to this tension.
01:57:42 He was one of the early seminal figures
01:57:46 in the field of deep learning, so there’s
01:57:47 a responsibility with that.
01:57:48 But he’s also, from having interacted with him quite a bit,
01:57:53 he’s just a brilliant thinker about ideas, which, as are you.
01:58:01 And there’s a tension between becoming
01:58:03 the manager versus the actual thinking
01:58:06 through very novel ideas, the scientist versus the manager.
01:58:13 And he’s one of the great scientists of our time.
01:58:17 So this was quite interesting.
01:58:18 And also, people tell me quite silly,
01:58:20 which I haven’t quite detected yet.
01:58:23 But in private, we’ll have to see about that.
01:58:26 Yeah, yeah.
01:58:27 I mean, just on the point of, I mean,
01:58:30 Ilya has been an inspiration.
01:58:33 I mean, quite a few colleagues, I can think,
01:58:35 shaped the person you are.
01:58:38 Like, Ilya certainly gets probably the top spot,
01:58:42 if not close to the top.
01:58:43 And if we go back to the question about people in the field,
01:58:47 like how their role would have changed the field or not,
01:58:51 I think Ilya’s case is interesting
01:58:54 because he really has a deep belief in the scaling up
01:58:58 of neural networks.
01:58:59 There was a talk that is still famous to this day
01:59:03 from the Sequence to Sequence paper, where he was just
01:59:07 claiming, just give me supervised data
01:59:10 and a large neural network, and then you’ll
01:59:12 solve basically all the problems.
01:59:16 That vision was already there many years ago.
01:59:19 So it’s good to see someone who is, in this case,
01:59:22 very deeply into this style of research
01:59:27 and clearly has had a tremendous track record of successes
01:59:32 and so on.
01:59:34 The funny bit about that talk is that we rehearsed the talk
01:59:37 in a hotel room before, and the original version of that talk
01:59:42 would have been even more controversial.
01:59:44 So maybe I’m the only person that
01:59:46 has seen the unfiltered version of the talk.
01:59:49 And maybe when the time comes, maybe we
01:59:52 should revisit some of the skip slides
01:59:55 from the talk from Ilya.
01:59:57 But I really think the deep belief
02:00:01 into some certain style of research
02:00:03 pays out, is good to be practical sometimes.
02:00:06 And I actually think Ilya and myself are practical,
02:00:09 but it’s also good.
02:00:10 There’s some sort of long term belief and trajectory.
02:00:14 Obviously, there’s a bit of lack involved,
02:00:16 but it might be that that’s the right path.
02:00:18 Then you clearly are ahead and hugely influential to the field
02:00:22 as he has been.
02:00:23 Do you agree with that intuition that maybe
02:00:26 was written about by Rich Sutton in The Bitter Lesson,
02:00:33 that the biggest lesson that can be read from 70 years of AI
02:00:36 research is that general methods that leverage computation
02:00:40 are ultimately the most effective?
02:00:42 Do you think that intuition is ultimately correct?
02:00:48 General methods that leverage computation,
02:00:52 allowing the scaling of computation
02:00:54 to do a lot of the work.
02:00:56 And so the basic task of us humans
02:00:59 is to design methods that are more
02:01:01 and more general versus more and more specific to the tasks
02:01:05 at hand.
02:01:07 I certainly think this essentially mimics
02:01:10 a bit of the deep learning research,
02:01:14 almost like philosophy, that on the one hand,
02:01:18 we want to be data agnostic.
02:01:20 We don’t want to preprocess data sets.
02:01:22 We want to see the bytes, the true data as it is,
02:01:25 and then learn everything on top.
02:01:27 So very much agree with that.
02:01:30 And I think scaling up feels, at the very least, again,
02:01:33 necessary for building incredible complex systems.
02:01:38 It’s possibly not sufficient, barring that we
02:01:42 need a couple of breakthroughs.
02:01:45 I think Reed Sutton mentioned search
02:01:47 being part of the equation of scale and search.
02:01:52 I think search, I’ve seen it, that’s
02:01:55 been more mixed in my experience.
02:01:57 So from that lesson in particular,
02:01:59 search is a bit more tricky because it
02:02:02 is very appealing to search in domains like Go,
02:02:05 where you have a clear reward function that you can then
02:02:08 discard some search traces.
02:02:10 But then in some other tasks, it’s
02:02:13 not very clear how you would do that,
02:02:15 although recently one of our recent works, which actually
02:02:19 was mostly mimicking or a continuation,
02:02:22 and even the team and the people involved were pretty much very
02:02:25 intersecting with AlphaStar, was AlphaCode,
02:02:28 in which we actually saw the bitter lesson how
02:02:31 scale of the models and then a massive amount of search
02:02:34 yielded this kind of very interesting result
02:02:36 of being able to have human level code competition.
02:02:41 So I’ve seen examples of it being
02:02:43 literally mapped to search and scale.
02:02:46 I’m not so convinced about the search bit,
02:02:48 but certainly I’m convinced scale will be needed.
02:02:51 So we need general methods.
02:02:52 We need to test them, and maybe we
02:02:54 need to make sure that we can scale them given the hardware
02:02:57 that we have in practice.
02:02:59 But then maybe we should also shape how the hardware looks
02:03:01 like based on which methods might be needed to scale.
02:03:05 And that’s an interesting contrast of these GPU comments
02:03:11 that is we got it for free almost because games
02:03:14 were using these.
02:03:15 But maybe now if sparsity is required,
02:03:19 we don’t have the hardware.
02:03:20 Although in theory, many people are
02:03:22 building different kinds of hardware these days.
02:03:24 But there’s a bit of this notion of hardware lottery
02:03:27 for scale that might actually have an impact at least
02:03:31 on the scale of years on how fast we will make progress
02:03:35 to maybe a version of neural nets
02:03:37 or whatever comes next that might enable
02:03:41 truly intelligent agents.
02:03:44 Do you think in your lifetime we will build an AGI system that
02:03:50 would undeniably be a thing that achieves human level
02:03:55 intelligence and goes far beyond?
02:03:58 I definitely think it’s possible that it will go far beyond.
02:04:03 But I’m definitely convinced that it will
02:04:05 be human level intelligence.
02:04:08 And I’m hypothesizing about the beyond
02:04:11 because the beyond bit is a bit tricky to define,
02:04:16 especially when we look at the current formula of starting
02:04:21 from this imitation learning standpoint.
02:04:23 So we can certainly imitate humans at language and beyond.
02:04:30 So getting at human level through imitation
02:04:33 feels very possible.
02:04:34 Going beyond will require reinforcement learning
02:04:39 and other things.
02:04:39 And I think in some areas that certainly already has paid out.
02:04:43 I mean, Go being an example that’s
02:04:46 my favorite so far in terms of going
02:04:48 beyond human capabilities.
02:04:50 But in general, I’m not sure we can define reward functions
02:04:55 that from a seed of imitating human level
02:04:59 intelligence that is general and then going beyond.
02:05:02 That bit is not so clear in my lifetime.
02:05:05 But certainly, human level, yes.
02:05:08 And I mean, that in itself is already quite powerful,
02:05:11 I think.
02:05:11 So going beyond, I think it’s obviously not.
02:05:14 We’re not going to not try that if then we
02:05:17 get to superhuman scientists and discovery
02:05:20 and advancing the world.
02:05:22 But at least human level in general
02:05:25 is also very, very powerful.
02:05:27 Well, especially if human level or slightly beyond
02:05:31 is integrated deeply with human society
02:05:33 and there’s billions of agents like that,
02:05:36 do you think there’s a singularity moment beyond which
02:05:39 our world will be just very deeply transformed
02:05:44 by these kinds of systems?
02:05:45 Because now you’re talking about intelligence systems
02:05:47 that are just, I mean, this is no longer just going
02:05:53 from horse and buggy to the car.
02:05:56 It feels like a very different kind of shift
02:05:59 in what it means to be a living entity on Earth.
02:06:03 Are you afraid?
02:06:04 Are you excited of this world?
02:06:06 I’m afraid if there’s a lot more.
02:06:09 So I think maybe we’ll need to think about if we truly
02:06:13 get there just thinking of limited resources
02:06:18 like humanity clearly hit some limits
02:06:21 and then there’s some balance, hopefully,
02:06:23 that biologically the planet is imposing.
02:06:26 And we should actually try to get better at this.
02:06:28 As we know, there’s quite a few issues
02:06:31 with having too many people coexisting
02:06:35 in a resource limited way.
02:06:37 So for digital entities, it’s an interesting question.
02:06:40 I think such a limit maybe should exist.
02:06:43 But maybe it’s going to be imposed by energy availability
02:06:47 because this also consumes energy.
02:06:49 In fact, most systems are more inefficient
02:06:53 than we are in terms of energy required.
02:06:56 But definitely, I think as a society,
02:06:59 we’ll need to just work together to find
02:07:03 what would be reasonable in terms of growth
02:07:06 or how we coexist if that is to happen.
02:07:11 I am very excited about, obviously,
02:07:14 the aspects of automation that make people
02:07:17 that obviously don’t have access to certain resources
02:07:20 or knowledge, for them to have that access.
02:07:23 I think those are the applications in a way
02:07:26 that I’m most excited to see and to personally work towards.
02:07:30 Yeah, there’s going to be significant improvements
02:07:32 in productivity and the quality of life
02:07:34 across the whole population, which is very interesting.
02:07:36 But I’m looking even far beyond
02:07:39 us becoming a multiplanetary species.
02:07:42 And just as a quick bet, last question.
02:07:45 Do you think as humans become multiplanetary species,
02:07:49 go outside our solar system, all that kind of stuff,
02:07:52 do you think there will be more humans
02:07:54 or more robots in that future world?
02:07:57 So will humans be the quirky, intelligent being of the past
02:08:04 or is there something deeply fundamental
02:08:07 to human intelligence that’s truly special,
02:08:09 where we will be part of those other planets,
02:08:12 not just AI systems?
02:08:13 I think we’re all excited to build AGI
02:08:18 to empower or make us more powerful as human species.
02:08:25 Not to say there might be some hybridization.
02:08:27 I mean, this is obviously speculation,
02:08:29 but there are companies also trying to,
02:08:32 the same way medicine is making us better.
02:08:35 Maybe there are other things that are yet to happen on that.
02:08:39 But if the ratio is not at most one to one,
02:08:43 I would not be happy.
02:08:44 So I would hope that we are part of the equation,
02:08:49 but maybe there’s maybe a one to one ratio feels
02:08:53 like possible, constructive and so on,
02:08:56 but it would not be good to have a misbalance,
02:08:59 at least from my core beliefs and the why I’m doing
02:09:03 what I’m doing when I go to work and I research
02:09:05 what I research.
02:09:07 Well, this is how I know you’re human
02:09:09 and this is how you’ve passed the Turing test.
02:09:12 And you are one of the special humans, Oriel.
02:09:15 It’s a huge honor that you would talk with me
02:09:17 and I hope we get the chance to speak again,
02:09:19 maybe once before the singularity, once after
02:09:23 and see how our view of the world changes.
02:09:25 Thank you again for talking today.
02:09:26 Thank you for the amazing work you do.
02:09:28 You’re a shining example of a research
02:09:31 and a human being in this community.
02:09:32 Thanks a lot.
02:09:33 Like yeah, looking forward to before the singularity
02:09:36 certainly and maybe after.
02:09:39 Thanks for listening to this conversation
02:09:41 with Oriel Venialis.
02:09:43 To support this podcast, please check out our sponsors
02:09:45 in the description.
02:09:46 And now let me leave you with some words from Alan Turing.
02:09:51 Those who can imagine anything can create the impossible.
02:09:55 Thank you for listening and hope to see you next time.