Oriol Vinyals: Deep Learning and Artificial General Intelligence #306

Transcript

00:00:00 at which point is the neural network a being versus a tool?

00:00:08 The following is a conversation with Oriel Veniales,

00:00:11 his second time on the podcast.

00:00:13 Oriel is the research director

00:00:15 and deep learning lead at DeepMind

00:00:18 and one of the most brilliant thinkers and researchers

00:00:20 in the history of artificial intelligence.

00:00:24 This is the Lex Friedman podcast.

00:00:26 To support it, please check out our sponsors

00:00:28 in the description.

00:00:30 And now, dear friends, here’s Oriel Veniales.

00:00:34 You are one of the most brilliant researchers

00:00:37 in the history of AI,

00:00:38 working across all kinds of modalities.

00:00:40 Probably the one common theme is

00:00:42 it’s always sequences of data.

00:00:45 So we’re talking about languages, images,

00:00:46 even biology and games, as we talked about last time.

00:00:50 So you’re a good person to ask this.

00:00:53 In your lifetime, will we be able to build an AI system

00:00:57 that’s able to replace me as the interviewer

00:01:00 in this conversation,

00:01:02 in terms of ability to ask questions

00:01:04 that are compelling to somebody listening?

00:01:06 And then further question is, are we close?

00:01:10 Will we be able to build a system that replaces you

00:01:13 as the interviewee

00:01:16 in order to create a compelling conversation?

00:01:18 How far away are we, do you think?

00:01:20 It’s a good question.

00:01:21 I think partly I would say, do we want that?

00:01:24 I really like when we start now with very powerful models,

00:01:29 interacting with them and thinking of them

00:01:32 more closer to us.

00:01:34 The question is, if you remove the human side

00:01:37 of the conversation, is that an interesting artifact?

00:01:42 And I would say, probably not.

00:01:44 I’ve seen, for instance, last time we spoke,

00:01:47 like we were talking about StarCraft,

00:01:50 and creating agents that play games involves self play,

00:01:54 but ultimately what people care about was,

00:01:57 how does this agent behave

00:01:59 when the opposite side is a human?

00:02:02 So without a doubt,

00:02:04 we will probably be more empowered by AI.

00:02:08 Maybe you can source some questions from an AI system.

00:02:12 I mean, that even today, I would say it’s quite plausible

00:02:15 that with your creativity,

00:02:17 you might actually find very interesting questions

00:02:19 that you can filter.

00:02:20 We call this cherry picking sometimes

00:02:22 in the field of language.

00:02:24 And likewise, if I had now the tools on my side,

00:02:27 I could say, look, you’re asking this interesting question.

00:02:30 From this answer, I like the words chosen

00:02:33 by this particular system that created a few words.

00:02:36 Completely replacing it feels not exactly exciting to me.

00:02:41 Although in my lifetime, I think way,

00:02:43 I mean, given the trajectory,

00:02:45 I think it’s possible that perhaps

00:02:48 there could be interesting,

00:02:49 maybe self play interviews as you’re suggesting

00:02:53 that would look or sound quite interesting

00:02:56 and probably would educate

00:02:57 or you could learn a topic through listening

00:03:00 to one of these interviews at a basic level at least.

00:03:03 So you said it doesn’t seem exciting to you,

00:03:04 but what if exciting is part of the objective function

00:03:07 the thing is optimized over?

00:03:09 So there’s probably a huge amount of data of humans

00:03:12 if you look correctly, of humans communicating online,

00:03:16 and there’s probably ways to measure the degree of,

00:03:19 you know, as they talk about engagement.

00:03:21 So you can probably optimize the question

00:03:24 that’s most created an engaging conversation in the past.

00:03:28 So actually, if you strictly use the word exciting,

00:03:33 there is probably a way to create

00:03:37 a optimally exciting conversations

00:03:40 that involve AI systems.

00:03:42 At least one side is AI.

00:03:44 Yeah, that makes sense, I think,

00:03:46 maybe looping back a bit to games and the game industry,

00:03:50 when you design algorithms,

00:03:53 you’re thinking about winning as the objective, right?

00:03:55 Or the reward function.

00:03:57 But in fact, when we discussed this with Blizzard,

00:04:00 the creators of StarCraft in this case,

00:04:02 I think what’s exciting, fun,

00:04:05 if you could measure that and optimize for that,

00:04:09 that’s probably why we play video games

00:04:11 or why we interact or listen or look at cat videos

00:04:14 or whatever on the internet.

00:04:16 So it’s true that modeling reward

00:04:19 beyond the obvious reward functions

00:04:21 we’ve used to in reinforcement learning

00:04:23 is definitely very exciting.

00:04:25 And again, there is some progress actually

00:04:28 into a particular aspect of AI, which is quite critical,

00:04:32 which is, for instance, is a conversation

00:04:36 or is the information truthful, right?

00:04:38 So you could start trying to evaluate these

00:04:41 from accepts from the internet, right?

00:04:44 That has lots of information.

00:04:45 And then if you can learn a function automated ideally,

00:04:50 so you can also optimize it more easily,

00:04:52 then you could actually have conversations

00:04:54 that optimize for non obvious things such as excitement.

00:04:59 So yeah, that’s quite possible.

00:05:01 And then I would say in that case,

00:05:03 it would definitely be fun exercise

00:05:05 and quite unique to have at least one site

00:05:08 that is fully driven by an excitement reward function.

00:05:12 But obviously, there would be still quite a lot of humanity

00:05:16 in the system, both from who is building the system,

00:05:20 of course, and also, ultimately,

00:05:23 if we think of labeling for excitement,

00:05:26 that those labels must come from us

00:05:28 because it’s just hard to have a computational measure

00:05:32 of excitement.

00:05:33 As far as I understand, there’s no such thing.

00:05:36 Well, as you mentioned truth also,

00:05:39 I would actually venture to say that excitement

00:05:41 is easier to label than truth,

00:05:44 or is perhaps has lower consequences of failure.

00:05:49 But there is perhaps the humanness that you mentioned,

00:05:55 that’s perhaps part of a thing that could be labeled.

00:05:58 And that could mean an AI system that’s doing dialogue,

00:06:02 that’s doing conversations should be flawed, for example.

00:06:07 Like that’s the thing you optimize for,

00:06:09 which is have inherent contradictions by design,

00:06:13 have flaws by design.

00:06:15 Maybe it also needs to have a strong sense of identity.

00:06:18 So it has a backstory it told itself that it sticks to.

00:06:22 It has memories, not in terms of how the system is designed,

00:06:26 but it’s able to tell stories about its past.

00:06:30 It’s able to have mortality and fear of mortality

00:06:36 in the following way that it has an identity.

00:06:39 And if it says something stupid

00:06:41 and gets canceled on Twitter, that’s the end of that system.

00:06:44 So it’s not like you get to rebrand yourself.

00:06:47 That system is, that’s it.

00:06:49 So maybe the high stakes nature of it,

00:06:52 because you can’t say anything stupid now,

00:06:54 or because you’d be canceled on Twitter.

00:06:57 And there’s stakes to that.

00:06:59 And that I think part of the reason

00:07:01 that makes it interesting.

00:07:03 And then you have a perspective,

00:07:04 like you’ve built up over time that you stick with,

00:07:07 and then people can disagree with you.

00:07:09 So holding that perspective strongly,

00:07:11 holding sort of maybe a controversial,

00:07:14 at least a strong opinion.

00:07:16 All of those elements, it feels like they can be learned

00:07:18 because it feels like there’s a lot of data

00:07:21 on the internet of people having an opinion.

00:07:24 And then combine that with a metric of excitement,

00:07:27 you can start to create something that,

00:07:30 as opposed to trying to optimize

00:07:31 for sort of grammatical clarity and truthfulness,

00:07:38 the factual consistency over many sentences,

00:07:42 you optimize for the humanness.

00:07:45 And there’s obviously data for humanness on the internet.

00:07:48 So I wonder if there’s a future where that’s part,

00:07:53 or I mean, I sometimes wonder that about myself.

00:07:56 I’m a huge fan of podcasts,

00:07:58 and I listen to some podcasts,

00:08:00 and I think like, what is interesting about this?

00:08:03 What is compelling?

00:08:05 The same way you watch other games.

00:08:07 Like you said, watch, play StarCraft,

00:08:09 or have Magnus Carlsen play chess.

00:08:13 So I’m not a chess player,

00:08:14 but it’s still interesting to me.

00:08:16 What is that?

00:08:16 That’s the stakes of it,

00:08:19 maybe the end of a domination of a series of wins.

00:08:23 I don’t know, there’s all those elements

00:08:25 somehow connect to a compelling conversation.

00:08:28 And I wonder how hard is that to replace,

00:08:30 because ultimately all of that connects

00:08:31 to the initial proposition of how to test,

00:08:35 whether an AI is intelligent or not with the Turing test,

00:08:38 which I guess my question comes from a place

00:08:41 of the spirit of that test.

00:08:43 Yes, I actually recall,

00:08:45 I was just listening to our first podcast

00:08:47 where we discussed Turing test.

00:08:50 So I would say from a neural network,

00:08:54 AI builder perspective,

00:08:57 there’s usually you try to map

00:09:01 many of these interesting topics you discuss to benchmarks,

00:09:05 and then also to actual architectures

00:09:08 on the how these systems are currently built,

00:09:10 how they learn, what data they learn from,

00:09:13 what are they learning, right?

00:09:14 We’re talking about weights of a mathematical function,

00:09:17 and then looking at the current state of the game,

00:09:21 maybe what do we need leaps forward

00:09:26 to get to the ultimate stage of all these experiences,

00:09:30 lifetime experience, fears,

00:09:32 like words that currently,

00:09:34 barely we’re seeing progress

00:09:38 just because what’s happening today

00:09:40 is you take all these human interactions,

00:09:44 it’s a large vast variety of human interactions online,

00:09:47 and then you’re distilling these sequences, right?

00:09:51 Going back to my passion,

00:09:53 like sequences of words, letters, images, sound,

00:09:56 there’s more modalities here to be at play.

00:09:59 And then you’re trying to just learn a function

00:10:03 that will be happy,

00:10:04 that maximizes the likelihood of seeing all these

00:10:08 through a neural network.

00:10:10 Now, I think there’s a few places

00:10:14 where the way currently we train these models

00:10:17 would clearly lack to be able to develop

00:10:20 the kinds of capabilities you save.

00:10:22 I’ll tell you maybe a couple.

00:10:23 One is the lifetime of an agent or a model.

00:10:27 So you learn from this data offline, right?

00:10:30 So you’re just passively observing and maximizing these,

00:10:33 it’s almost like a mountains,

00:10:35 like a landscape of mountains,

00:10:37 and then everywhere there’s data

00:10:39 that humans interacted in this way,

00:10:41 you’re trying to make that higher

00:10:43 and then lower where there’s no data.

00:10:45 And then these models generally

00:10:48 don’t then experience themselves.

00:10:51 They just are observers, right?

00:10:52 They’re passive observers of the data.

00:10:54 And then we’re putting them to then generate data

00:10:57 when we interact with them,

00:10:59 but that’s very limiting.

00:11:00 The experience they actually experience

00:11:03 when they could maybe be optimizing

00:11:05 or further optimizing the weights,

00:11:07 we’re not even doing that.

00:11:08 So to be clear, and again, mapping to AlphaGo, AlphaStar,

00:11:14 we train the model.

00:11:15 And when we deploy it to play against humans,

00:11:18 or in this case interact with humans,

00:11:20 like language models,

00:11:21 they don’t even keep training, right?

00:11:23 They’re not learning in the sense of the weights

00:11:26 that you’ve learned from the data,

00:11:28 they don’t keep changing.

00:11:29 Now, there’s something a bit more feels magical,

00:11:33 but it’s understandable if you’re into Neuronet,

00:11:36 which is, well, they might not learn

00:11:39 in the strict sense of the words,

00:11:40 the weights changing,

00:11:41 maybe that’s mapping to how neurons interconnect

00:11:44 and how we learn over our lifetime.

00:11:46 But it’s true that the context of the conversation

00:11:50 that takes place when you talk to these systems,

00:11:55 it’s held in their working memory, right?

00:11:57 It’s almost like you start the computer,

00:12:00 it has a hard drive that has a lot of information,

00:12:02 you have access to the internet,

00:12:04 which has probably all the information,

00:12:06 but there’s also a working memory

00:12:08 where these agents, as we call them,

00:12:11 or start calling them, build upon.

00:12:13 Now, this memory is very limited.

00:12:16 I mean, right now we’re talking, to be concrete,

00:12:19 about 2,000 words that we hold,

00:12:21 and then beyond that, we start forgetting what we’ve seen.

00:12:24 So you can see that there’s some short term coherence

00:12:28 already, right, with what you said.

00:12:29 I mean, it’s a very interesting topic.

00:12:32 Having sort of a mapping, an agent to have consistency,

00:12:37 then if you say, oh, what’s your name,

00:12:40 it could remember that,

00:12:42 but then it might forget beyond 2,000 words,

00:12:45 which is not that long of context

00:12:47 if we think even of these podcast books are much longer.

00:12:51 So technically speaking, there’s a limitation there,

00:12:55 super exciting from people that work on deep learning

00:12:58 to be working on, but I would say we lack maybe benchmarks

00:13:03 and the technology to have this lifetime like experience

00:13:07 of memory that keeps building up.

00:13:10 However, the way it learns offline

00:13:13 is clearly very powerful, right?

00:13:14 So you asked me three years ago, I would say,

00:13:17 oh, we’re very far.

00:13:18 I think we’ve seen the power of this imitation,

00:13:22 again, on the internet scale that has enabled this

00:13:26 to feel like at least the knowledge,

00:13:28 the basic knowledge about the world now

00:13:30 is incorporated into the weights,

00:13:33 but then this experience is lacking.

00:13:36 And in fact, as I said, we don’t even train them

00:13:39 when we’re talking to them,

00:13:41 other than their working memory, of course, is affected.

00:13:44 So that’s the dynamic part,

00:13:46 but they don’t learn in the same way

00:13:48 that you and I have learned, right?

00:13:50 From basically when we were born and probably before.

00:13:54 So lots of fascinating, interesting questions you asked there.

00:13:57 I think the one I mentioned is this idea of memory

00:14:01 and experience versus just kind of observe the world

00:14:05 and learn its knowledge, which I think for that,

00:14:08 I would argue lots of recent advancements

00:14:10 that make me very excited about the field.

00:14:13 And then the second maybe issue that I see is

00:14:18 all these models, we train them from scratch.

00:14:21 That’s something I would have complained three years ago

00:14:24 or six years ago or 10 years ago.

00:14:26 And it feels if we take inspiration from how we got here,

00:14:31 how the universe evolved us and we keep evolving,

00:14:35 it feels that is a missing piece,

00:14:37 that we should not be training models from scratch

00:14:41 every few months,

00:14:42 that there should be some sort of way

00:14:45 in which we can grow models much like as a species

00:14:49 and many other elements in the universe

00:14:51 is building from the previous sort of iterations.

00:14:55 And that from a just purely neural network perspective,

00:14:59 even though we would like to make it work,

00:15:02 it’s proven very hard to not throw away

00:15:06 the previous weights, right?

00:15:07 This landscape we learn from the data

00:15:09 and refresh it with a brand new set of weights,

00:15:13 given maybe a recent snapshot of these data sets

00:15:17 we train on, et cetera, or even a new game we’re learning.

00:15:20 So that feels like something is missing fundamentally.

00:15:24 We might find it, but it’s not very clear

00:15:27 how it will look like.

00:15:28 There’s many ideas and it’s super exciting as well.

00:15:30 Yes, just for people who don’t know,

00:15:32 when you’re approaching a new problem in machine learning,

00:15:35 you’re going to come up with an architecture

00:15:38 that has a bunch of weights

00:15:41 and then you initialize them somehow,

00:15:43 which in most cases is some version of random.

00:15:47 So that’s what you mean by starting from scratch.

00:15:49 And it seems like it’s a waste every time you solve

00:15:54 the game of Go and chess, StarCraft, protein folding,

00:15:59 like surely there’s some way to reuse the weights

00:16:03 as we grow this giant database of neural networks

00:16:08 that have solved some of the toughest problems in the world.

00:16:10 And so some of that is, what is that?

00:16:15 Methods, how to reuse weights,

00:16:19 how to learn, extract what’s generalizable

00:16:22 or at least has a chance to be

00:16:25 and throw away the other stuff.

00:16:27 And maybe the neural network itself

00:16:29 should be able to tell you that.

00:16:31 Like what, yeah, how do you,

00:16:34 what ideas do you have for better initialization of weights?

00:16:37 Maybe stepping back,

00:16:38 if we look at the field of machine learning,

00:16:41 but especially deep learning, right?

00:16:44 At the core of deep learning,

00:16:45 there’s this beautiful idea that is a single algorithm

00:16:49 can solve any task, right?

00:16:50 So it’s been proven over and over

00:16:54 with more increasing set of benchmarks

00:16:56 and things that were thought impossible

00:16:58 that are being cracked by this basic principle

00:17:01 that is you take a neural network of uninitialized weights,

00:17:05 so like a blank computational brain,

00:17:09 then you give it, in the case of supervised learning,

00:17:12 a lot ideally of examples of,

00:17:14 hey, here is what the input looks like

00:17:17 and the desired output should look like this.

00:17:19 I mean, image classification is very clear example,

00:17:22 images to maybe one of a thousand categories,

00:17:25 that’s what ImageNet is like,

00:17:26 but many, many, if not all problems can be mapped this way.

00:17:30 And then there’s a generic recipe, right?

00:17:33 That you can use.

00:17:35 And this recipe with very little change,

00:17:38 and I think that’s the core of deep learning research, right?

00:17:41 That what is the recipe that is universal?

00:17:44 That for any new given task,

00:17:46 I’ll be able to use without thinking,

00:17:48 without having to work very hard on the problem at stake.

00:17:52 We have not found this recipe,

00:17:54 but I think the field is excited to find less tweaks

00:18:00 or tricks that people find when they work

00:18:02 on important problems specific to those

00:18:05 and more of a general algorithm, right?

00:18:07 So at an algorithmic level,

00:18:09 I would say we have something general already,

00:18:11 which is this formula of training a very powerful model,

00:18:14 a neural network on a lot of data.

00:18:17 And in many cases, you need some specificity

00:18:21 to the actual problem you’re solving,

00:18:23 protein folding being such an important problem

00:18:26 has some basic recipe that is learned from before, right?

00:18:30 Like transformer models, graph neural networks,

00:18:34 ideas coming from NLP, like something called BERT,

00:18:38 that is a kind of loss that you can emplace

00:18:41 to help the knowledge distillation is another technique,

00:18:45 right?

00:18:46 So this is the formula.

00:18:47 We still had to find some particular things

00:18:50 that were specific to alpha fold, right?

00:18:53 That’s very important because protein folding

00:18:55 is such a high value problem that as humans,

00:18:59 we should solve it no matter

00:19:00 if we need to be a bit specific.

00:19:02 And it’s possible that some of these learnings

00:19:04 will apply then to the next iteration of this recipe

00:19:07 that deep learners are about.

00:19:09 But it is true that so far, the recipe is what’s common,

00:19:13 but the weights you generally throw away,

00:19:15 which feels very sad.

00:19:17 Although, maybe in the last,

00:19:20 especially in the last two, three years,

00:19:22 and when we last spoke,

00:19:23 I mentioned this area of meta learning,

00:19:25 which is the idea of learning to learn.

00:19:28 That idea and some progress has been had starting,

00:19:32 I would say, mostly from GPT3 on the language domain only,

00:19:36 in which you could conceive a model that is trained once.

00:19:41 And then this model is not narrow in that it only knows

00:19:44 how to translate a pair of languages or even a set of

00:19:47 or it only knows how to assign sentiment to a sentence.

00:19:51 These actually, you could teach it by a prompting,

00:19:55 it’s called, and this prompting is essentially

00:19:56 just showing it a few more examples,

00:19:59 almost like you do show examples, input, output examples,

00:20:03 algorithmically speaking to the process

00:20:04 of creating this model.

00:20:06 But now you’re doing it through language,

00:20:07 which is very natural way for us to learn from one another.

00:20:11 I tell you, hey, you should do this new task.

00:20:13 I’ll tell you a bit more.

00:20:14 Maybe you ask me some questions

00:20:16 and now you know the task, right?

00:20:17 You didn’t need to retrain it from scratch.

00:20:20 And we’ve seen these magical moments almost

00:20:24 in this way to do few shot promptings through language

00:20:26 on language only domain.

00:20:28 And then in the last two years,

00:20:30 we’ve seen these expanded to beyond language,

00:20:34 adding vision, adding actions and games,

00:20:38 lots of progress to be had.

00:20:39 But this is maybe, if you ask me like about

00:20:42 how are we gonna crack this problem?

00:20:43 This is perhaps one way in which you have a single model.

00:20:48 The problem of this model is it’s hard to grow

00:20:52 in weights or capacity,

00:20:54 but the model is certainly so powerful

00:20:56 that you can teach it some tasks, right?

00:20:58 In this way that I teach you,

00:21:00 I could teach you a new task now,

00:21:02 if we were all at a text based task

00:21:05 or a classification of vision style task.

00:21:08 But it still feels like more breakthroughs should be had,

00:21:12 but it’s a great beginning, right?

00:21:14 We have a good baseline.

00:21:15 We have an idea that this maybe is the way we want

00:21:18 to benchmark progress towards AGI.

00:21:20 And I think in my view, that’s critical

00:21:22 to always have a way to benchmark the community

00:21:25 sort of converging to these overall,

00:21:27 which is good to see.

00:21:29 And then this is actually what excites me

00:21:33 in terms of also next steps for deep learning

00:21:36 is how to make these models more powerful,

00:21:39 how do you train them, how to grow them

00:21:41 if they must grow, should they change their weights

00:21:44 as you teach it task or not?

00:21:46 There’s some interesting questions, many to be answered.

00:21:48 Yeah, you’ve opened the door

00:21:49 about to a bunch of questions I want to ask,

00:21:52 but let’s first return to your tweet

00:21:55 and read it like a Shakespeare.

00:21:57 You wrote, God is not the end, it’s the beginning.

00:22:01 And then you wrote meow and then an emoji of a cat.

00:22:06 So first two questions.

00:22:07 First, can you explain the meow and the cat emoji?

00:22:10 And second, can you explain what Godot is and how it works?

00:22:13 Right, indeed.

00:22:14 I mean, thanks for reminding me

00:22:16 that we’re all exposing on Twitter and.

00:22:19 Permanently there.

00:22:20 Yes, permanently there.

00:22:21 One of the greatest AI researchers of all time,

00:22:25 meow and cat emoji.

00:22:27 Yes. There you go.

00:22:28 Right, so.

00:22:29 Can you imagine like touring, tweeting, meow and cat,

00:22:32 probably he would, probably would.

00:22:34 Probably.

00:22:35 So yeah, the tweet is important actually.

00:22:38 You know, I put thought on the tweets, I hope people.

00:22:40 Which part do you think?

00:22:41 Okay, so there’s three sentences.

00:22:44 Godot is not the end, Godot is the beginning,

00:22:48 meow, cat emoji.

00:22:50 Okay, which is the important part?

00:22:51 The meow, no, no.

00:22:53 Definitely that it is the beginning.

00:22:56 I mean, I probably was just explaining a bit

00:23:00 where the field is going, but let me tell you about Godot.

00:23:03 So first the name Godot comes from maybe a sequence

00:23:08 of releases that DeepMind had that named,

00:23:11 like used animal names to name some of their models

00:23:15 that are based on this idea of large sequence models.

00:23:19 Initially they’re only language,

00:23:20 but we are expanding to other modalities.

00:23:23 So we had, you know, we had Gopher, Chinchilla,

00:23:28 these were language only.

00:23:29 And then more recently we released Flamingo,

00:23:32 which adds vision to the equation.

00:23:35 And then Godot, which adds vision

00:23:38 and then also actions in the mix, right?

00:23:41 As we discuss actually actions,

00:23:44 especially discrete actions like up, down, left, right.

00:23:47 I just told you the actions, but they’re words.

00:23:49 So you can kind of see how actions naturally map

00:23:52 to sequence modeling of words,

00:23:54 which these models are very powerful.

00:23:57 So Godot was named after, I believe,

00:24:01 I can only from memory, right?

00:24:03 These, you know, these things always happen

00:24:06 with an amazing team of researchers behind.

00:24:08 So before the release, we had a discussion

00:24:12 about which animal would we pick, right?

00:24:14 And I think because of the word general agent, right?

00:24:18 And this is a property quite unique to Godot.

00:24:21 We kind of were playing with the GA words

00:24:24 and then, you know, Godot.

00:24:26 Rhymes with cat.

00:24:26 Yes.

00:24:28 And Godot is obviously a Spanish version of cat.

00:24:30 I had nothing to do with it, although I’m from Spain.

00:24:32 Oh, how do you, wait, sorry.

00:24:33 How do you say cat in Spanish?

00:24:34 Gato.

00:24:35 Oh, gato, okay.

00:24:36 Now it all makes sense.

00:24:37 Okay, okay, I see, I see, I see.

00:24:38 Now it all makes sense.

00:24:39 Okay, so.

00:24:39 How do you say meow in Spanish?

00:24:40 No, that’s probably the same.

00:24:41 I think you say it the same way,

00:24:44 but you write it as M, I, A, U.

00:24:48 Okay, it’s universal.

00:24:49 Yes.

00:24:50 All right, so then how does the thing work?

00:24:51 So you said general is, so you said language, vision.

00:24:57 And action. Action.

00:24:59 How does this, can you explain

00:25:01 what kind of neural networks are involved?

00:25:04 What does the training look like?

00:25:06 And maybe what do you,

00:25:09 are some beautiful ideas within the system?

00:25:11 Yeah, so maybe the basics of Gato

00:25:16 are not that dissimilar from many, many work that come.

00:25:19 So here is where the sort of the recipe,

00:25:22 I mean, hasn’t changed too much.

00:25:24 There is a transformer model

00:25:25 that’s the kind of recurrent neural network

00:25:28 that essentially takes a sequence of modalities,

00:25:33 observations that could be words,

00:25:36 could be vision or could be actions.

00:25:38 And then its own objective that you train it to do

00:25:42 when you train it is to predict what the next anything is.

00:25:46 And anything means what’s the next action.

00:25:48 If this sequence that I’m showing you to train

00:25:51 is a sequence of actions and observations,

00:25:53 then you’re predicting what’s the next action

00:25:55 and the next observation, right?

00:25:57 So you think of these really as a sequence of bites, right?

00:26:00 So take any sequence of words,

00:26:04 a sequence of interleaved words and images,

00:26:07 a sequence of maybe observations that are images

00:26:11 and moves in Atari up, down, left, right.

00:26:14 And these you just think of them as bites

00:26:17 and you’re modeling what’s the next bite gonna be like.

00:26:20 And you might interpret that as an action

00:26:23 and then play it in a game,

00:26:25 or you could interpret it as a word

00:26:27 and then write it down

00:26:29 if you’re chatting with the system and so on.

00:26:32 So Gato basically can be thought as inputs,

00:26:36 images, text, video, actions.

00:26:41 It also actually inputs some sort of proprioception sensors

00:26:45 from robotics because robotics is one of the tasks

00:26:48 that it’s been trained to do.

00:26:49 And then at the output, similarly,

00:26:51 it outputs words, actions.

00:26:53 It does not output images, that’s just by design,

00:26:57 we decided not to go that way for now.

00:27:00 That’s also in part why it’s the beginning

00:27:02 because there’s more to do clearly.

00:27:04 But that’s kind of what the Gato is,

00:27:06 is this brain that essentially you give it any sequence

00:27:09 of these observations and modalities

00:27:11 and it outputs the next step.

00:27:13 And then off you go, you feed the next step into

00:27:17 and predict the next one and so on.

00:27:20 Now, it is more than a language model

00:27:24 because even though you can chat with Gato,

00:27:26 like you can chat with Chinchilla or Flamingo,

00:27:30 it also is an agent, right?

00:27:33 So that’s why we call it A of Gato,

00:27:37 like the letter A and also it’s general.

00:27:41 It’s not an agent that’s been trained to be good

00:27:43 at only StarCraft or only Atari or only Go.

00:27:47 It’s been trained on a vast variety of datasets.

00:27:51 What makes it an agent, if I may interrupt,

00:27:53 the fact that it can generate actions?

00:27:56 Yes, so when we call it, I mean, it’s a good question, right?

00:28:00 When do we call a model?

00:28:02 I mean, everything is a model,

00:28:03 but what is an agent in my view is indeed the capacity

00:28:07 to take actions in an environment that you then send to it

00:28:11 and then the environment might return

00:28:13 with a new observation

00:28:15 and then you generate the next action.

00:28:17 This actually, this reminds me of the question

00:28:20 from the side of biology, what is life?

00:28:23 Which is actually a very difficult question as well.

00:28:25 What is living, what is living when you think about life

00:28:29 here on this planet Earth?

00:28:31 And a question interesting to me about aliens,

00:28:33 what is life when we visit another planet?

00:28:35 Would we be able to recognize it?

00:28:37 And this feels like, it sounds perhaps silly,

00:28:40 but I don’t think it is.

00:28:41 At which point is the neural network a being versus a tool?

00:28:48 And it feels like action, ability to modify its environment

00:28:52 is that fundamental leap.

00:28:54 Yeah, I think it certainly feels like action

00:28:57 is a necessary condition to be more alive,

00:29:01 but probably not sufficient either.

00:29:04 So sadly I…

00:29:05 It’s a soul consciousness thing, whatever.

00:29:06 Yeah, yeah, we can get back to that later.

00:29:09 But anyways, going back to the meow and the gato, right?

00:29:12 So one of the leaps forward and what took the team a lot

00:29:17 of effort and time was, as you were asking,

00:29:21 how has gato been trained?

00:29:23 So I told you gato is this transformer neural network,

00:29:26 models actions, sequences of actions, words, et cetera.

00:29:30 And then the way we train it is by essentially pulling

00:29:35 data sets of observations, right?

00:29:39 So it’s a massive imitation learning algorithm

00:29:42 that it imitates obviously to what

00:29:45 is the next word that comes next from the usual data

00:29:48 sets we use before, right?

00:29:50 So these are these web scale style data sets of people

00:29:54 writing on webs or chatting or whatnot, right?

00:29:58 So that’s an obvious source that we use on all language work.

00:30:02 But then we also took a lot of agents

00:30:05 that we have at DeepMind.

00:30:06 I mean, as you know, DeepMind, we’re quite interested

00:30:10 in learning reinforcement learning and learning agents

00:30:14 that play in different environments.

00:30:17 So we kind of created a data set of these trajectories,

00:30:20 as we call them, or agent experiences.

00:30:23 So in a way, there are other agents

00:30:25 we train for a single mind purpose to, let’s say,

00:30:29 control a 3D game environment and navigate a maze.

00:30:33 So we had all the experience that

00:30:35 was created through one agent interacting

00:30:38 with that environment.

00:30:39 And we added this to the data set, right?

00:30:41 And as I said, we just see all the data,

00:30:44 all these sequences of words or sequences

00:30:46 of this agent interacting with that environment or agents

00:30:51 playing Atari and so on.

00:30:52 We see it as the same kind of data.

00:30:54 And so we mix these data sets together.

00:30:57 And we train Gato.

00:31:00 That’s the G part, right?

00:31:01 It’s general because it really has mixed.

00:31:05 It doesn’t have different brains for each modality

00:31:07 or each narrow task.

00:31:09 It has a single brain.

00:31:10 It’s not that big of a brain compared

00:31:12 to most of the neural networks we see these days.

00:31:14 It has 1 billion parameters.

00:31:18 Some models we’re seeing get in the trillions these days.

00:31:21 And certainly, 100 billion feels like a size

00:31:25 that is very common from when you train these jobs.

00:31:29 So the actual agent is relatively small.

00:31:32 But it’s been trained on a very challenging, diverse data set,

00:31:36 not only containing all of the internet

00:31:38 but containing all these agent experience playing

00:31:40 very different, distinct environments.

00:31:43 So this brings us to the part of the tweet of this

00:31:46 is not the end, it’s the beginning.

00:31:48 It feels very cool to see Gato, in principle,

00:31:53 is able to control any sort of environments, especially

00:31:57 the ones that it’s been trained to do, these 3D games, Atari

00:32:00 games, all sorts of robotics tasks, and so on.

00:32:04 But obviously, it’s not as proficient

00:32:07 as the teachers it learned from on these environments.

00:32:10 Not obvious.

00:32:11 It’s not obvious that it wouldn’t be more proficient.

00:32:15 It’s just the current beginning part

00:32:17 is that the performance is such that it’s not as good

00:32:21 as if it’s specialized to that task.

00:32:23 Right.

00:32:23 So it’s not as good, although I would argue size matters here.

00:32:28 So the fact that I would argue always size always matters.

00:32:31 That’s a different conversation.

00:32:33 But for neural networks, certainly size does matter.

00:32:36 So it’s the beginning because it’s relatively small.

00:32:39 So obviously, scaling this idea up

00:32:42 might make the connections that exist between text

00:32:48 on the internet and playing Atari and so on more

00:32:51 synergistic with one another.

00:32:53 And you might gain.

00:32:54 And that moment, we didn’t quite see.

00:32:56 But obviously, that’s why it’s the beginning.

00:32:58 That synergy might emerge with scale.

00:33:00 Right, might emerge with scale.

00:33:02 And also, I believe there’s some new research or ways

00:33:05 in which you prepare the data that you

00:33:08 might need to make it more clear to the model

00:33:11 that you’re not only playing Atari,

00:33:14 and you start from a screen.

00:33:16 And here is up and a screen and down.

00:33:18 Maybe you can think of playing Atari

00:33:20 as there’s some sort of context that is needed for the agent

00:33:23 before it starts seeing, oh, this is an Atari screen.

00:33:26 I’m going to start playing.

00:33:28 You might require, for instance, to be told in words,

00:33:33 hey, in this sequence that I’m showing,

00:33:36 you’re going to be playing an Atari game.

00:33:39 So text might actually be a good driver to enhance the data.

00:33:44 So then these connections might be made more easily.

00:33:46 So that’s an idea that we start seeing in language.

00:33:51 But obviously, beyond, this is going to be effective.

00:33:55 It’s not like I don’t show you a screen,

00:33:57 and you, from scratch, you’re supposed to learn a game.

00:34:01 There is a lot of context we might set.

00:34:03 So there might be some work needed as well

00:34:05 to set that context.

00:34:07 But anyways, there’s a lot of work.

00:34:10 So that context puts all the different modalities

00:34:13 on the same level ground to provide the context best.

00:34:16 So maybe on that point, so there’s

00:34:19 this task, which may not seem trivial, of tokenizing the data,

00:34:25 of converting the data into pieces,

00:34:28 into basic atomic elements that then could cross modalities

00:34:34 somehow.

00:34:35 So what’s tokenization?

00:34:37 How do you tokenize text?

00:34:39 How do you tokenize images?

00:34:42 How do you tokenize games and actions and robotics tasks?

00:34:47 Yeah, that’s a great question.

00:34:48 So tokenization is the entry point

00:34:52 to actually make all the data look like a sequence,

00:34:55 because tokens then are just these little puzzle pieces.

00:34:59 We break down anything into these puzzle pieces,

00:35:01 and then we just model, what’s this puzzle look like when

00:35:05 you make it lay down in a line, so to speak, in a sequence?

00:35:09 So in Gato, the text, there’s a lot of work.

00:35:15 You tokenize text usually by looking

00:35:17 at commonly used substrings, right?

00:35:20 So there’s ING in English is a very common substring,

00:35:23 so that becomes a token.

00:35:25 There’s quite a well studied problem on tokenizing text.

00:35:29 And Gato just used the standard techniques

00:35:31 that have been developed from many years,

00:35:34 even starting from ngram models in the 1950s and so on.

00:35:38 Just for context, how many tokens,

00:35:40 what order, magnitude, number of tokens

00:35:42 is required for a word, usually?

00:35:45 What are we talking about?

00:35:46 Yeah, for a word in English, I mean,

00:35:48 every language is very different.

00:35:51 The current level or granularity of tokenization

00:35:53 generally means it’s maybe two to five.

00:35:57 I mean, I don’t know the statistics exactly,

00:36:00 but to give you an idea, we don’t tokenize

00:36:03 at the level of letters.

00:36:04 Then it would probably be, I don’t

00:36:05 know what the average length of a word is in English,

00:36:08 but that would be the minimum set of tokens you could use.

00:36:11 It was bigger than letters, smaller than words.

00:36:13 Yes, yes.

00:36:13 And you could think of very, very common words like the.

00:36:16 I mean, that would be a single token,

00:36:18 but very quickly you’re talking two, three, four tokens or so.

00:36:22 Have you ever tried to tokenize emojis?

00:36:24 Emojis are actually just sequences of letters, so.

00:36:30 Maybe to you, but to me they mean so much more.

00:36:33 Yeah, you can render the emoji, but you

00:36:35 might if you actually just.

00:36:36 Yeah, this is a philosophical question.

00:36:39 Is emojis an image or a text?

00:36:43 The way we do these things is they’re actually

00:36:46 mapped to small sequences of characters.

00:36:49 So you can actually play with these models

00:36:52 and input emojis, it will output emojis back,

00:36:55 which is actually quite a fun exercise.

00:36:57 You probably can find other tweets about these out there.

00:37:02 But yeah, so anyways, text.

00:37:04 It’s very clear how this is done.

00:37:06 And then in Gato, what we did for images

00:37:10 is we map images to essentially we compressed images,

00:37:14 so to speak, into something that looks more like less

00:37:19 like every pixel with every intensity.

00:37:21 That would mean we have a very long sequence, right?

00:37:23 Like if we were talking about 100 by 100 pixel images,

00:37:27 that would make the sequences far too long.

00:37:30 So what was done there is you just

00:37:32 use a technique that essentially compresses an image

00:37:35 into maybe 16 by 16 patches of pixels,

00:37:40 and then that is mapped, again, tokenized.

00:37:42 You just essentially quantize this space

00:37:45 into a special word that actually

00:37:48 maps to these little sequence of pixels.

00:37:51 And then you put the pixels together in some raster order,

00:37:55 and then that’s how you get out or in the image

00:37:59 that you’re processing.

00:38:00 But there’s no semantic aspect to that,

00:38:04 so you’re doing some kind of,

00:38:05 you don’t need to understand anything about the image

00:38:07 in order to tokenize it currently.

00:38:09 No, you’re only using this notion of compression.

00:38:12 So you’re trying to find common,

00:38:15 it’s like JPG or all these algorithms.

00:38:17 It’s actually very similar at the tokenization level.

00:38:20 All we’re doing is finding common patterns

00:38:23 and then making sure in a lossy way we compress these images

00:38:27 given the statistics of the images

00:38:29 that are contained in all the data we deal with.

00:38:31 Although you could probably argue that JPEG

00:38:34 does have some understanding of images.

00:38:38 Because visual information, maybe color,

00:38:44 compressing crudely based on color

00:38:46 does capture something important about an image

00:38:51 that’s about its meaning, not just about some statistics.

00:38:54 Yeah, I mean, JP, as I said,

00:38:56 the algorithms look actually very similar

00:38:58 to they use the cosine transform in JPG.

00:39:04 The approach we usually do in machine learning

00:39:07 when we deal with images and we do this quantization step

00:39:10 is a bit more data driven.

00:39:11 So rather than have some sort of Fourier basis

00:39:14 for how frequencies appear in the natural world,

00:39:18 we actually just use the statistics of the images

00:39:23 and then quantize them based on the statistics,

00:39:27 much like you do in words, right?

00:39:28 So common substrings are allocated a token

00:39:32 and images is very similar.

00:39:34 But there’s no connection.

00:39:36 The token space, if you think of,

00:39:39 oh, like the tokens are an integer

00:39:41 and in the end of the day.

00:39:42 So now like we work on, maybe we have about,

00:39:46 let’s say, I don’t know the exact numbers,

00:39:48 but let’s say 10,000 tokens for text, right?

00:39:51 Certainly more than characters

00:39:52 because we have groups of characters and so on.

00:39:55 So from one to 10,000, those are representing

00:39:58 all the language and the words we’ll see.

00:40:01 And then images occupy the next set of integers.

00:40:04 So they’re completely independent, right?

00:40:05 So from 10,001 to 20,000,

00:40:08 those are the tokens that represent

00:40:10 these other modality images.

00:40:12 And that is an interesting aspect that makes it orthogonal.

00:40:18 So what connects these concepts is the data, right?

00:40:21 Once you have a data set,

00:40:23 for instance, that captions images that tells you,

00:40:26 oh, this is someone playing a frisbee on a green field.

00:40:30 Now the model will need to predict the tokens

00:40:34 from the text green field to then the pixels.

00:40:37 And that will start making the connections

00:40:39 between the tokens.

00:40:40 So these connections happen as the algorithm learns.

00:40:43 And then the last, if we think of these integers,

00:40:45 the first few are words, the next few are images.

00:40:48 In Gato, we also allocated the highest order of integers

00:40:55 to actions, right?

00:40:56 Which we discretize and actions are very diverse, right?

00:40:59 In Atari, there’s, I don’t know if 17 discrete actions.

00:41:04 In robotics, actions might be torques

00:41:06 and forces that we apply.

00:41:08 So we just use kind of similar ideas

00:41:11 to compress these actions into tokens.

00:41:14 And then we just, that’s how we map now

00:41:18 all the space to these sequence of integers.

00:41:20 But they occupy different space

00:41:22 and what connects them is then the learning algorithm.

00:41:24 That’s where the magic happens.

00:41:26 So the modalities are orthogonal

00:41:28 to each other in token space.

00:41:30 So in the input, everything you add, you add extra tokens.

00:41:35 And then you’re shoving all of that into one place.

00:41:40 Yes, the transformer.

00:41:41 And that transformer, that transformer tries

00:41:46 to look at this gigantic token space

00:41:49 and tries to form some kind of representation,

00:41:52 some kind of unique wisdom

00:41:56 about all of these different modalities.

00:41:59 How’s that possible?

00:42:02 If you were to sort of like put your psychoanalysis hat on

00:42:06 and try to psychoanalyze this neural network,

00:42:09 is it schizophrenic?

00:42:11 Does it try to, given this very few weights,

00:42:17 represent multiple disjoint things

00:42:19 and somehow have them not interfere with each other?

00:42:22 Or is it somehow building on the joint strength,

00:42:27 on whatever is common to all the different modalities?

00:42:31 Like what, if you were to ask a question,

00:42:34 is it schizophrenic or is it of one mind?

00:42:38 I mean, it is one mind and it’s actually

00:42:42 the simplest algorithm, which that’s kind of in a way

00:42:46 how it feels like the field hasn’t changed

00:42:49 since back propagation and gradient descent

00:42:52 was purpose for learning neural networks.

00:42:55 So there is obviously details on the architecture.

00:42:58 This has evolved.

00:42:59 The current iteration is still the transformer,

00:43:03 which is a powerful sequence modeling architecture.

00:43:07 But then the goal of this, you know,

00:43:11 setting these weights to predict the data

00:43:13 is essentially the same as basically I could describe.

00:43:17 I mean, we described a few years ago,

00:43:18 Alpha star language modeling and so on, right?

00:43:21 We take, let’s say an Atari game,

00:43:24 we map it to a string of numbers

00:43:27 that will all be probably image space

00:43:30 and action space interleaved.

00:43:32 And all we’re gonna do is say, okay, given the numbers,

00:43:37 you know, 10,001, 10,004, 10,005,

00:43:40 the next number that comes is 20,006,

00:43:43 which is in the action space.

00:43:45 And you’re just optimizing these weights

00:43:48 via very simple gradients.

00:43:51 Like, you know, mathematical is almost

00:43:53 the most boring algorithm you could imagine.

00:43:55 We settle the weights so that

00:43:57 given this particular instance,

00:44:00 these weights are set to maximize the probability

00:44:04 of having seen this particular sequence of integers

00:44:07 for this particular game.

00:44:09 And then the algorithm does this

00:44:11 for many, many, many iterations,

00:44:14 looking at different modalities, different games, right?

00:44:17 That’s the mixture of the dataset we discussed.

00:44:20 So in a way, it’s a very simple algorithm

00:44:24 and the weights, right, they’re all shared, right?

00:44:27 So in terms of, is it focusing on one modality or not?

00:44:30 The intermediate weights that are converting

00:44:33 from these input of integers

00:44:35 to the target integer you’re predicting next,

00:44:37 those weights certainly are common.

00:44:40 And then the way that tokenization happens,

00:44:43 there is a special place in the neural network,

00:44:45 which is we map this integer, like number 10,001,

00:44:49 to a vector of real numbers.

00:44:51 Like real numbers, we can optimize them

00:44:54 with gradient descent, right?

00:44:56 The functions we learn

00:44:57 are actually surprisingly differentiable.

00:44:59 That’s why we compute gradients.

00:45:01 So this step is the only one

00:45:03 that this orthogonality you mentioned applies.

00:45:06 So mapping a certain token for text or image or actions,

00:45:12 each of these tokens gets its own little vector

00:45:15 of real numbers that represents this.

00:45:17 If you look at the field back many years ago,

00:45:19 people were talking about word vectors or word embeddings.

00:45:23 These are the same.

00:45:24 We have word vectors or embeddings.

00:45:26 We have image vector or embeddings

00:45:28 and action vector of embeddings.

00:45:30 And the beauty here is that as you train this model,

00:45:33 if you visualize these little vectors,

00:45:36 it might be that they start aligning

00:45:38 even though they’re independent parameters.

00:45:41 There could be anything,

00:45:42 but then it might be that you take the word gato or cat,

00:45:47 which maybe is common enough

00:45:48 that it actually has its own token.

00:45:50 And then you take pixels that have a cat

00:45:52 and you might start seeing

00:45:53 that these vectors look like they align, right?

00:45:57 So by learning from this vast amount of data,

00:46:00 the model is realizing the potential connections

00:46:03 between these modalities.

00:46:05 Now, I will say there will be another way,

00:46:07 at least in part, to not have these different vectors

00:46:13 for each different modality.

00:46:15 For instance, when I tell you about actions

00:46:18 in certain space, I’m defining actions by words, right?

00:46:22 So you could imagine a world in which I’m not learning

00:46:26 that the action app in Atari is its own number.

00:46:31 The action app in Atari maybe is literally the word

00:46:34 or the sentence app in Atari, right?

00:46:37 And that would mean we now leverage

00:46:39 much more from the language.

00:46:41 This is not what we did here,

00:46:42 but certainly it might make these connections

00:46:45 much easier to learn and also to teach the model

00:46:49 to correct its own actions and so on, right?

00:46:51 So all these to say that gato is indeed the beginning,

00:46:55 that it is a radical idea to do this this way,

00:46:59 but there’s probably a lot more to be done

00:47:02 and the results to be more impressive,

00:47:04 not only through scale, but also through some new research

00:47:07 that will come hopefully in the years to come.

00:47:10 So just to elaborate quickly,

00:47:12 you mean one possible next step

00:47:16 or one of the paths that you might take next

00:47:20 is doing the tokenization fundamentally

00:47:25 as a kind of linguistic communication.

00:47:28 So like you convert even images into language.

00:47:31 So doing something like a crude semantic segmentation,

00:47:35 trying to just assign a bunch of words to an image

00:47:38 that like have almost like a dumb entity

00:47:42 explaining as much as it can about the image.

00:47:45 And so you convert that into words

00:47:46 and then you convert games into words

00:47:49 and then you provide the context in words and all of it.

00:47:52 And eventually getting to a point

00:47:56 where everybody agrees with Noam Chomsky

00:47:58 that language is actually at the core of everything.

00:48:00 That’s it’s the base layer of intelligence

00:48:04 and consciousness and all that kind of stuff, okay.

00:48:07 You mentioned early on like size, it’s hard to grow.

00:48:11 What did you mean by that?

00:48:12 Because we’re talking about scale might change.

00:48:17 There might be, and we’ll talk about this too,

00:48:18 like there’s a emergent, there’s certain things

00:48:23 about these neural networks that are emergent.

00:48:25 So certain like performance we can see only with scale

00:48:28 and there’s some kind of threshold of scale.

00:48:30 So why is it hard to grow something like this Meow network?

00:48:36 So the Meow network, it’s not hard to grow

00:48:41 if you retrain it.

00:48:42 What’s hard is, well, we have now 1 billion parameters.

00:48:46 We train them for a while.

00:48:48 We spend some amount of work towards building these weights

00:48:53 that are an amazing initial brain

00:48:55 for doing these kinds of tasks we care about.

00:48:58 Could we reuse the weights and expand to a larger brain?

00:49:03 And that is extraordinarily hard,

00:49:06 but also exciting from a research perspective

00:49:10 and a practical perspective point of view, right?

00:49:12 So there’s this notion of modularity in software engineering

00:49:17 and we starting to see some examples

00:49:20 and work that leverages modularity.

00:49:23 In fact, if we go back one step from Gato

00:49:26 to a work that I would say train much larger,

00:49:29 much more capable network called Flamingo.

00:49:32 Flamingo did not deal with actions,

00:49:34 but it definitely dealt with images in an interesting way,

00:49:38 kind of akin to what Gato did,

00:49:40 but slightly different technique for tokenizing,

00:49:42 but we don’t need to go into that detail.

00:49:45 But what Flamingo also did, which Gato didn’t do,

00:49:49 and that just happens because these projects,

00:49:51 they’re different, it’s a bit of like the exploratory nature

00:49:55 of research, which is great.

00:49:57 The research behind these projects is also modular.

00:50:00 Yes, exactly.

00:50:01 And it has to be, right?

00:50:02 We need to have creativity

00:50:05 and sometimes you need to protect pockets of people,

00:50:09 researchers and so on.

00:50:10 By we, you mean humans.

00:50:11 Yes.

00:50:12 And also in particular researchers

00:50:14 and maybe even further DeepMind or other such labs.

00:50:18 And then the neural networks themselves.

00:50:20 So it’s modularity all the way down.

00:50:23 All the way down.

00:50:24 So the way that we did modularity very beautifully

00:50:27 in Flamingo is we took Chinchilla,

00:50:30 which is a language only model, not an agent,

00:50:33 if we think of actions being necessary for agency.

00:50:36 So we took Chinchilla, we took the weights of Chinchilla

00:50:40 and then we froze them.

00:50:42 We said, these don’t change.

00:50:44 We train them to be very good at predicting the next word.

00:50:47 It’s a very good language model, state of the art

00:50:50 at the time you release it, et cetera, et cetera.

00:50:52 We’re going to add a capability to see, right?

00:50:55 We are going to add the ability to see

00:50:56 to this language model.

00:50:58 So we’re going to attach small pieces of neural networks

00:51:01 at the right places in the model.

00:51:03 It’s almost like I’m injecting the network

00:51:07 with some weights and some substructures

00:51:10 in a good way, right?

00:51:12 So you need the research to say, what is effective?

00:51:15 How do you add this capability

00:51:16 without destroying others, et cetera.

00:51:18 So we created a small sub network initialized,

00:51:24 not from random, but actually from self supervised learning

00:51:28 that a model that understands vision in general.

00:51:32 And then we took data sets that connect the two modalities,

00:51:37 vision and language.

00:51:38 And then we froze the main part,

00:51:41 the largest portion of the network, which was Chinchilla,

00:51:43 that is 70 billion parameters.

00:51:45 And then we added a few more parameters on top,

00:51:49 trained from scratch, and then some others

00:51:51 that were pre trained with the capacity to see,

00:51:55 like it was not tokenization

00:51:57 in the way I described for Gato, but it’s a similar idea.

00:52:01 And then we trained the whole system.

00:52:03 Parts of it were frozen, parts of it were new.

00:52:06 And all of a sudden, we developed Flamingo,

00:52:09 which is an amazing model that is essentially,

00:52:12 I mean, describing it is a chatbot

00:52:14 where you can also upload images

00:52:16 and start conversing about images.

00:52:19 But it’s also kind of a dialogue style chatbot.

00:52:23 So the input is images and text and the output is text.

00:52:26 Exactly.

00:52:28 How many parameters, you said 70 billion for Chinchilla?

00:52:31 Yeah, Chinchilla is 70 billion.

00:52:33 And then the ones we add on top,

00:52:34 which kind of almost is almost like a way

00:52:38 to overwrite its little activations

00:52:40 so that when it sees vision,

00:52:42 it does kind of a correct computation of what it’s seeing,

00:52:45 mapping it back towards, so to speak.

00:52:47 That adds an extra 10 billion parameters, right?

00:52:50 So it’s total 80 billion, the largest one we released.

00:52:53 And then you train it on a few datasets

00:52:57 that contain vision and language.

00:52:59 And once you interact with the model,

00:53:01 you start seeing that you can upload an image

00:53:04 and start sort of having a dialogue about the image,

00:53:07 which is actually not something,

00:53:09 it’s very similar and akin to what we saw in language only.

00:53:12 These prompting abilities that it has,

00:53:15 you can teach it a new vision task, right?

00:53:17 It does things beyond the capabilities

00:53:20 that in theory the datasets provided in themselves,

00:53:24 but because it leverages a lot of the language knowledge

00:53:27 acquired from Chinchilla,

00:53:28 it actually has this few shot learning ability

00:53:31 and these emerging abilities

00:53:33 that we didn’t even measure

00:53:34 once we were developing the model,

00:53:36 but once developed, then as you play with the interface,

00:53:40 you can start seeing, wow, okay, yeah, it’s cool.

00:53:42 We can upload, I think one of the tweets

00:53:45 talking about Twitter was this image from Obama

00:53:47 that is placing a weight

00:53:49 and someone is kind of weighting themselves

00:53:52 and it’s kind of a joke style image.

00:53:54 And it’s notable because I think Andrew Carpati

00:53:57 a few years ago said,

00:53:59 no computer vision system can understand

00:54:02 the subtlety of this joke in this image,

00:54:04 all the things that go on.

00:54:06 And so what we try to do, and it’s very anecdotally,

00:54:09 I mean, this is not a proof that we solved this issue,

00:54:12 but it just shows that you can upload now this image

00:54:15 and start conversing with the model,

00:54:17 trying to make out if it gets that there’s a joke

00:54:21 because the person weighting themselves

00:54:23 doesn’t see that someone behind

00:54:25 is making the weight higher and so on and so forth.

00:54:27 So it’s a fascinating capability

00:54:30 and it comes from this key idea of modularity

00:54:33 where we took a frozen brain

00:54:34 and we just added a new capability.

00:54:37 So the question is, should we,

00:54:40 so in a way you can see even from DeepMind,

00:54:42 we have Flamingo that this moderate approach

00:54:46 and thus could leverage the scale a bit more reasonably

00:54:49 because we didn’t need to retrain a system from scratch.

00:54:52 And on the other hand, we had Gato,

00:54:54 which used the same data sets,

00:54:55 but then he trained it from scratch, right?

00:54:57 And so I guess big question for the community

00:55:00 is should we train from scratch

00:55:02 or should we embrace modularity?

00:55:04 And this lies, like this goes back to modularity

00:55:08 as a way to grow, but reuse seems like natural

00:55:12 and it was very effective, certainly.

00:55:14 The next question is, if you go the way of modularity,

00:55:18 is there a systematic way of freezing weights

00:55:22 and joining different modalities across,

00:55:27 you know, not just two or three or four networks,

00:55:29 but hundreds of networks from all different kinds of places,

00:55:32 maybe open source network that looks at weather patterns

00:55:36 and you shove that in somehow

00:55:37 and then you have networks that, I don’t know,

00:55:40 do all kinds of stuff, play StarCraft

00:55:42 and play all the other video games

00:55:43 and you can keep adding them in without significant effort,

00:55:49 like maybe the effort scales linearly or something like that

00:55:53 as opposed to like the more network you add,

00:55:54 the more you have to worry about the instabilities created.

00:55:57 Yeah, so that vision is beautiful.

00:55:59 I think there’s still the question

00:56:03 about within single modalities, like Chinchilla was reused,

00:56:06 but now if we train a next iteration of language models,

00:56:10 are we gonna use Chinchilla or not?

00:56:11 Yeah, how do you swap out Chinchilla?

00:56:13 Right, so there’s still big questions,

00:56:15 but that idea is actually really akin to software engineering,

00:56:19 which we’re not reimplementing libraries from scratch,

00:56:22 we’re reusing and then building ever more amazing things,

00:56:25 including neural networks with software that we’re reusing.

00:56:28 So I think this idea of modularity, I like it,

00:56:32 I think it’s here to stay

00:56:33 and that’s also why I mentioned

00:56:35 it’s just the beginning, not the end.

00:56:38 You’ve mentioned meta learning,

00:56:39 so given this promise of Gato,

00:56:42 can we try to redefine this term

00:56:45 that’s almost akin to consciousness

00:56:47 because it means different things to different people

00:56:50 throughout the history of artificial intelligence,

00:56:52 but what do you think meta learning is

00:56:56 and looks like now in the five years, 10 years,

00:57:00 will it look like the system like Gato, but scaled?

00:57:03 What’s your sense of, what does meta learning look like?

00:57:07 Do you think with all the wisdom we’ve learned so far?

00:57:10 Yeah, great question.

00:57:11 Maybe it’s good to give another data point

00:57:14 looking backwards rather than forward.

00:57:16 So when we talk in 2019,

00:57:22 meta learning meant something that has changed

00:57:26 mostly through the revolution of GPT3 and beyond.

00:57:31 So what meta learning meant at the time

00:57:35 was driven by what benchmarks people care about

00:57:37 in meta learning.

00:57:38 And the benchmarks were about a capability

00:57:42 to learn about object identities.

00:57:44 So it was very much overfitted to vision

00:57:48 and object classification.

00:57:50 And the part that was meta about that was that,

00:57:52 oh, we’re not just learning a thousand categories

00:57:55 that ImageNet tells us to learn.

00:57:57 We’re going to learn object categories

00:57:59 that can be defined when we interact with the model.

00:58:03 So it’s interesting to see the evolution, right?

00:58:06 The way this started was we have a special language

00:58:10 that was a data set, a small data set

00:58:13 that we prompted the model with saying,

00:58:15 hey, here is a new classification task.

00:58:18 I’ll give you one image and the name,

00:58:21 which was an integer at the time of the image

00:58:24 and a different image and so on.

00:58:25 So you have a small prompt in the form of a data set,

00:58:30 a machine learning data set.

00:58:31 And then you got then a system that could then predict

00:58:35 or classify these objects that you just

00:58:37 defined kind of on the fly.

00:58:40 So fast forward, it was revealed that language models

00:58:46 are few shot learners.

00:58:47 That’s the title of the paper.

00:58:49 So very good title.

00:58:50 Sometimes titles are really good.

00:58:51 So this one is really, really good.

00:58:53 Because that’s the point of GPT3 that showed that, look, sure,

00:58:58 we can focus on object classification

00:59:00 and what meta learning means within the space of learning

00:59:04 object categories.

00:59:05 This goes beyond, or before rather,

00:59:07 to also Omniglot, before ImageNet and so on.

00:59:10 So there’s a few benchmarks.

00:59:11 To now, all of a sudden, we’re a bit unlocked from benchmarks.

00:59:15 And through language, we can define tasks.

00:59:17 So we’re literally telling the model

00:59:20 some logical task or a little thing that we wanted to do.

00:59:23 We prompt it much like we did before,

00:59:26 but now we prompt it through natural language.

00:59:28 And then not perfectly, I mean, these models have failure modes

00:59:32 and that’s fine, but these models then

00:59:35 are now doing a new task.

00:59:37 And so they meta learn this new capability.

00:59:40 Now, that’s where we are now.

00:59:43 Flamingo expanded this to visual and language,

00:59:47 but it basically has the same abilities.

00:59:49 You can teach it, for instance, an emergent property

00:59:52 was that you can take pictures of numbers

00:59:55 and then do arithmetic with the numbers just by teaching it,

00:59:59 oh, when I show you 3 plus 6, I want you to output 9.

01:00:03 And you show it a few examples, and now it does that.

01:00:06 So it went way beyond the image net categorization of images

01:00:12 that we were a bit stuck maybe before this revelation

01:00:17 moment that happened in 2000.

01:00:19 I believe it was 19, but it was after we checked.

01:00:21 In that way, it has solved meta learning

01:00:24 as was previously defined.

01:00:26 Yes, it expanded what it meant.

01:00:27 So that’s what you say, what does it mean?

01:00:29 So it’s an evolving term.

01:00:31 But here is maybe now looking forward,

01:00:35 looking at what’s happening, obviously,

01:00:38 in the community with more modalities, what we can expect.

01:00:42 And I would certainly hope to see the following.

01:00:45 And this is a pretty drastic hope.

01:00:48 But in five years, maybe we chat again.

01:00:51 And we have a system, a set of weights

01:00:55 that we can teach it to play StarCraft.

01:00:59 Maybe not at the level of AlphaStar,

01:01:01 but play StarCraft, a complex game,

01:01:03 we teach it through interactions to prompting.

01:01:06 You can certainly prompt a system.

01:01:08 That’s what Gata shows to play some simple Atari games.

01:01:11 So imagine if you start talking to a system,

01:01:15 teaching it a new game, showing it

01:01:17 examples of in this particular game,

01:01:20 this user did something good.

01:01:22 Maybe the system can even play and ask you questions.

01:01:25 Say, hey, I played this game.

01:01:27 I just played this game.

01:01:28 Did I do well?

01:01:29 Can you teach me more?

01:01:30 So five, maybe to 10 years, these capabilities,

01:01:34 or what meta learning means, will

01:01:36 be much more interactive, much more rich,

01:01:38 and through domains that we were specializing.

01:01:41 So you see the difference.

01:01:42 We built AlphaStar Specialized to play StarCraft.

01:01:47 The algorithms were general, but the weights were specialized.

01:01:50 And what we’re hoping is that we can teach a network

01:01:54 to play games, to play any game, just using games as an example,

01:01:58 through interacting with it, teaching it,

01:02:01 uploading the Wikipedia page of StarCraft.

01:02:04 This is in the horizon.

01:02:06 And obviously, there are details that need to be filled

01:02:09 and research needs to be done.

01:02:11 But that’s how I see meta learning above,

01:02:13 which is going to be beyond prompting.

01:02:15 It’s going to be a bit more interactive.

01:02:18 The system might tell us to give it feedback

01:02:20 after it maybe makes mistakes or it loses a game.

01:02:24 But it’s nonetheless very exciting

01:02:26 because if you think about this this way,

01:02:28 the benchmarks are already there.

01:02:30 We just repurposed the benchmarks.

01:02:33 So in a way, I like to map the space of what

01:02:38 maybe AGI means to say, OK, we went 101% performance in Go,

01:02:45 in Chess, in StarCraft.

01:02:47 The next iteration might be 20% performance

01:02:51 across, quote unquote, all tasks.

01:02:54 And even if it’s not as good, it’s fine.

01:02:57 We have ways to also measure progress

01:02:59 because we have those specialized agents and so on.

01:03:04 So this is, to me, very exciting.

01:03:06 And these next iteration models are definitely

01:03:10 hinting at that direction of progress,

01:03:13 which hopefully we can have.

01:03:14 There are obviously some things that

01:03:16 could go wrong in terms of we might not have the tools.

01:03:20 Maybe transformers are not enough.

01:03:22 There are some breakthroughs to come, which

01:03:24 makes the field more exciting to people like me as well,

01:03:27 of course.

01:03:28 But that’s, if you ask me, five to 10 years,

01:03:32 you might see these models that start

01:03:33 to look more like weights that are already trained.

01:03:36 And then it’s more about teaching or make

01:03:40 their meta learn what you’re trying

01:03:44 to induce in terms of tasks and so on,

01:03:47 well beyond the simple now tasks we’re

01:03:49 starting to see emerge like small arithmetic tasks

01:03:53 and so on.

01:03:54 So a few questions around that.

01:03:55 This is fascinating.

01:03:57 So that kind of teaching, interactive,

01:04:01 so it’s beyond prompting.

01:04:02 So it’s interacting with the neural network.

01:04:05 That’s different than the training process.

01:04:08 So it’s different than the optimization

01:04:12 over differentiable functions.

01:04:15 This is already trained.

01:04:17 And now you’re teaching, I mean, it’s

01:04:21 almost akin to the brain, the neurons already

01:04:25 set with their connections.

01:04:26 On top of that, you’re now using that infrastructure

01:04:30 to build up further knowledge.

01:04:33 So that’s a really interesting distinction that’s actually

01:04:37 not obvious from a software engineering perspective,

01:04:40 that there’s a line to be drawn.

01:04:42 Because you always think for a neural network to learn,

01:04:44 it has to be retrained, trained and retrained.

01:04:49 And prompting is a way of teaching.

01:04:54 And you’ll now work a little bit of context

01:04:55 about whatever the heck you’re trying it to do.

01:04:57 So you can maybe expand this prompting capability

01:05:00 by making it interact.

01:05:03 That’s really, really interesting.

01:05:04 By the way, this is not, if you look at way back

01:05:08 at different ways to tackle even classification tasks.

01:05:11 So this comes from longstanding literature

01:05:16 in machine learning.

01:05:18 What I’m suggesting could sound to some

01:05:20 like a bit like nearest neighbor.

01:05:23 So nearest neighbor is almost the simplest algorithm

01:05:27 that does not require learning.

01:05:30 So it has this interesting, you don’t

01:05:32 need to compute gradients.

01:05:34 And what nearest neighbor does is you, quote unquote,

01:05:37 have a data set or upload a data set.

01:05:39 And then all you need to do is a way

01:05:42 to measure distance between points.

01:05:44 And then to classify a new point,

01:05:46 you’re just simply computing, what’s

01:05:48 the closest point in this massive amount of data?

01:05:51 And that’s my answer.

01:05:52 So you can think of prompting in a way

01:05:55 as you’re uploading not just simple points.

01:05:58 And the metric is not the distance between the images

01:06:02 or something simple.

01:06:03 It’s something that you compute that’s much more advanced.

01:06:06 But in a way, it’s very similar.

01:06:09 You simply are uploading some knowledge

01:06:12 to this pre trained system in nearest neighbor.

01:06:15 Maybe the metric is learned or not,

01:06:17 but you don’t need to further train it.

01:06:19 And then now you immediately get a classifier out of this.

01:06:23 Now it’s just an evolution of that concept,

01:06:25 very classical concept in machine learning, which

01:06:28 is just learning through what’s the closest point, closest

01:06:32 by some distance, and that’s it.

01:06:34 It’s an evolution of that.

01:06:36 And I will say how I saw meta learning when

01:06:39 we worked on a few ideas in 2016 was precisely

01:06:44 through the lens of nearest neighbor, which

01:06:47 is very common in computer vision community.

01:06:50 There’s a very active area of research

01:06:52 about how do you compute the distance between two images.

01:06:55 But if you have a good distance metric,

01:06:57 you also have a good classifier.

01:06:59 All I’m saying is now these distances and the points

01:07:02 are not just images.

01:07:03 They’re like words or sequences of words and images

01:07:08 and actions that teach you something new.

01:07:10 But it might be that technique wise those come back.

01:07:14 And I will say that it’s not necessarily true

01:07:18 that you might not ever train the weights a bit further.

01:07:21 Some aspect of meta learning, some techniques

01:07:24 in meta learning do actually do a bit of fine tuning

01:07:28 as it’s called.

01:07:29 They train the weights a little bit when they get a new task.

01:07:32 So as I call the how or how we’re going to achieve this,

01:07:37 as a deep learner, I’m very skeptic.

01:07:39 We’re going to try a few things, whether it’s

01:07:41 a bit of training, adding a few parameters,

01:07:44 thinking of these as nearest neighbor,

01:07:45 or just simply thinking of there’s a sequence of words,

01:07:49 it’s a prefix.

01:07:50 And that’s the new classifier.

01:07:53 We’ll see.

01:07:53 There’s the beauty of research.

01:07:55 But what’s important is that is a good goal in itself

01:08:00 that I see as very worthwhile pursuing for the next stages

01:08:03 of not only meta learning.

01:08:05 I think this is basically what’s exciting about machine learning

01:08:10 period to me.

01:08:11 Well, and the interactive aspect of that

01:08:13 is also very interesting, the interactive version

01:08:16 of nearest neighbor to help you pull out the classifier

01:08:22 from this giant thing.

01:08:23 OK, is this the way we can go in 5, 10 plus years

01:08:31 from any task, sorry, from many tasks to any task?

01:08:38 And what does that mean?

01:08:39 What does it need to be actually trained on?

01:08:42 Which point is the network had enough?

01:08:45 So what does a network need to learn about this world

01:08:50 in order to be able to perform any task?

01:08:52 Is it just as simple as language, image, and action?

01:08:57 Or do you need some set of representative images?

01:09:02 Like if you only see land images,

01:09:05 will you know anything about underwater?

01:09:06 Is that some fundamentally different?

01:09:08 I don’t know.

01:09:09 I mean, those are open questions, I would say.

01:09:12 I mean, the way you put, let me maybe further your example.

01:09:15 If all you see is land images but you’re

01:09:18 reading all about land and water worlds

01:09:21 but in books, imagine, would that be enough?

01:09:25 Good question.

01:09:26 We don’t know.

01:09:27 But I guess maybe you can join us

01:09:30 if you want in our quest to find this.

01:09:32 That’s precisely.

01:09:33 Water world, yeah.

01:09:34 Yes, that’s precisely, I mean, the beauty of research.

01:09:37 And that’s the research business we’re in,

01:09:42 I guess, is to figure this out and ask the right questions

01:09:46 and then iterate with the whole community,

01:09:49 publishing findings and so on.

01:09:52 But yeah, this is a question.

01:09:55 It’s not the only question, but it’s certainly, as you ask,

01:09:58 on my mind constantly.

01:10:00 And so we’ll need to wait for maybe the, let’s say, five

01:10:03 years, let’s hope it’s not 10, to see what are the answers.

01:10:09 Some people will largely believe in unsupervised or

01:10:12 self supervised learning of single modalities

01:10:15 and then crossing them.

01:10:18 Some people might think end to end learning is the answer.

01:10:21 Modularity is maybe the answer.

01:10:23 So we don’t know, but we’re just definitely excited

01:10:27 to find out.

01:10:27 But it feels like this is the right time

01:10:29 and we’re at the beginning of this journey.

01:10:31 We’re finally ready to do these kind of general big models

01:10:36 and agents.

01:10:37 What do you sort of specific technical thing

01:10:42 about Gato, Flamingo, Chinchilla, Gopher, any of these

01:10:48 that is especially beautiful, that was surprising, maybe?

01:10:51 Is there something that just jumps out at you?

01:10:55 Of course, there’s the general thing of like,

01:10:57 you didn’t think it was possible and then you

01:11:00 realize it’s possible in terms of the generalizability

01:11:03 across modalities and all that kind of stuff.

01:11:05 Or maybe how small of a network, relatively speaking,

01:11:08 Gato is, all that kind of stuff.

01:11:10 But is there some weird little things that were surprising?

01:11:15 Look, I’ll give you an answer that’s very important

01:11:18 because maybe people don’t quite realize this,

01:11:22 but the teams behind these efforts, the actual humans,

01:11:27 that’s maybe the surprising in an obviously positive way.

01:11:31 So anytime you see these breakthroughs,

01:11:34 I mean, it’s easy to map it to a few people.

01:11:37 There’s people that are great at explaining things and so on.

01:11:39 And that’s very nice.

01:11:40 But maybe the learnings or the method learnings

01:11:44 that I get as a human about this is, sure, we can move forward.

01:11:50 But the surprising bit is how important

01:11:55 are all the pieces of these projects,

01:11:58 how do they come together?

01:12:00 So I’ll give you maybe some of the ingredients of success

01:12:04 that are common across these, but not the obvious ones

01:12:07 on machine learning.

01:12:08 I can always also give you those.

01:12:11 But basically, there is engineering is critical.

01:12:17 So very good engineering because ultimately we’re

01:12:21 collecting data sets, right?

01:12:23 So the engineering of data and then

01:12:26 of deploying the models at scale into some compute cluster

01:12:31 that cannot go understated, that is a huge factor of success.

01:12:36 And it’s hard to believe that details matter so much.

01:12:41 We would like to believe that it’s

01:12:43 true that there is more and more of a standard formula,

01:12:47 as I was saying, like this recipe that

01:12:49 works for everything.

01:12:50 But then when you zoom into each of these projects,

01:12:53 then you realize the devil is indeed in the details.

01:12:57 And then the teams have to work together towards these goals.

01:13:03 So engineering of data and obviously clusters

01:13:07 and large scale is very important.

01:13:09 And then one that is often not, maybe nowadays it is more clear

01:13:15 is benchmark progress, right?

01:13:17 So we’re talking here about multiple months of tens

01:13:20 of researchers and people that are

01:13:24 trying to organize the research and so on working together.

01:13:28 And you don’t know that you can get there.

01:13:32 I mean, this is the beauty.

01:13:34 If you’re not risking to trying to do something

01:13:37 that feels impossible, you’re not going to get there.

01:13:41 But you need a way to measure progress.

01:13:43 So the benchmarks that you build are critical.

01:13:47 I’ve seen this beautifully play out in many projects.

01:13:50 I mean, maybe the one I’ve seen it more consistently,

01:13:53 which means we establish the metric,

01:13:56 actually the community did.

01:13:58 And then we leverage that massively is alpha fold.

01:14:01 This is a project where the data, the metrics

01:14:05 were all there.

01:14:06 And all it took was, and it’s easier said than done,

01:14:09 an amazing team working not to try

01:14:12 to find some incremental improvement

01:14:14 and publish, which is one way to do research that is valid,

01:14:17 but aim very high and work literally for years

01:14:22 to iterate over that process.

01:14:24 And working for years with the team,

01:14:25 I mean, it is tricky that also happened to happen partly

01:14:30 during a pandemic and so on.

01:14:32 So I think my meta learning from all this

01:14:34 is the teams are critical to the success.

01:14:37 And then if now going to the machine learning,

01:14:40 the part that’s surprising is so we like architectures

01:14:46 like neural networks.

01:14:48 And I would say this was a very rapidly evolving field

01:14:53 until the transformer came.

01:14:54 So attention might indeed be all you need,

01:14:58 which is the title, also a good title,

01:15:00 although in hindsight is good.

01:15:02 I don’t think at the time I thought

01:15:03 this is a great title for a paper.

01:15:05 But that architecture is proving that the dream of modeling

01:15:10 sequences of any bytes, there is something there that will stick.

01:15:15 And I think these advance in architectures

01:15:18 in how neural networks are architecture

01:15:21 to do what they do.

01:15:23 It’s been hard to find one that has been so stable

01:15:26 and relatively has changed very little

01:15:28 since it was invented five or so years ago.

01:15:33 So that is a surprising, is a surprise

01:15:35 that keeps recurring into other projects.

01:15:38 Try to, on a philosophical or technical level, introspect,

01:15:43 what is the magic of attention?

01:15:45 What is attention?

01:15:47 That’s attention in people that study cognition,

01:15:50 so human attention.

01:15:52 I think there’s giant wars over what attention means,

01:15:55 how it works in the human mind.

01:15:57 So there’s very simple looks at what

01:16:00 attention is in a neural network from the days of attention

01:16:03 is all you need.

01:16:04 But do you think there’s a general principle that’s

01:16:07 really powerful here?

01:16:08 Yeah, so a distinction between transformers and LSTMs,

01:16:13 which were what came before.

01:16:15 And there was a transitional period

01:16:17 where you could use both.

01:16:19 In fact, when we talked about AlphaStar,

01:16:22 we used transformers and LSTMs.

01:16:24 So it was still the beginning of transformers.

01:16:26 They were very powerful.

01:16:27 But LSTMs were also very powerful sequence models.

01:16:31 So the power of the transformer is

01:16:35 that it has built in what we call

01:16:38 an inductive bias of attention that makes the model.

01:16:43 When you think of a sequence of integers,

01:16:45 like we discussed this before, this is a sequence of words.

01:16:50 When you have to do very hard tasks over these words,

01:16:54 this could be we’re going to translate a whole paragraph

01:16:57 or we’re going to predict the next paragraph given

01:17:00 10 paragraphs before.

01:17:04 There’s some loose intuition from how we do it as a human

01:17:10 that is very nicely mimicked and replicated structurally

01:17:15 speaking in the transformer, which

01:17:16 is this idea of you’re looking for something.

01:17:21 So you’re sort of when you just read a piece of text,

01:17:25 now you’re thinking what comes next.

01:17:27 You might want to relook at the text or look it from scratch.

01:17:31 I mean, literally is because there’s no recurrence.

01:17:35 You’re just thinking what comes next.

01:17:37 And it’s almost hypothesis driven.

01:17:40 So if I’m thinking the next word that I write is cat or dog,

01:17:46 the way the transformer works almost philosophically

01:17:49 is it has these two hypotheses.

01:17:52 Is it going to be cat or is it going to be dog?

01:17:55 And then it says, OK, if it’s cat,

01:17:58 I’m going to look for certain words.

01:17:59 Not necessarily cat, although cat is an obvious word

01:18:01 you would look in the past to see

01:18:03 whether it makes more sense to output cat or dog.

01:18:05 And then it does some very deep computation

01:18:09 over the words and beyond.

01:18:11 So it combines the words, but it has the query

01:18:16 as we call it that is cat.

01:18:18 And then similarly for dog.

01:18:20 And so it’s a very computational way to think about, look,

01:18:24 if I’m thinking deeply about text,

01:18:27 I need to go back to look at all of the text, attend over it.

01:18:30 But it’s not just attention.

01:18:32 What is guiding the attention?

01:18:34 And that was the key insight from an earlier paper

01:18:36 is not how far away is it?

01:18:39 I mean, how far away is it is important?

01:18:40 What did I just write about?

01:18:42 That’s critical.

01:18:44 But what you wrote about 10 pages ago

01:18:46 might also be critical.

01:18:48 So you’re looking not positionally, but content wise.

01:18:53 And transformers have this beautiful way

01:18:56 to query for certain content and pull it out

01:18:59 in a compressed way.

01:19:00 So then you can make a more informed decision.

01:19:02 I mean, that’s one way to explain transformers.

01:19:05 But I think it’s a very powerful inductive bias.

01:19:10 There might be some details that might change over time,

01:19:12 but I think that is what makes transformers so much more

01:19:17 powerful than the recurrent networks that

01:19:20 were more recency bias based, which obviously works

01:19:23 in some tasks, but it has major flaws.

01:19:26 Transformer itself has flaws.

01:19:29 And I think the main one, the main challenge

01:19:31 is these prompts that we just were talking about,

01:19:35 they can be 1,000 words long.

01:19:38 But if I’m teaching you StarGraph,

01:19:40 I’ll have to show you videos.

01:19:41 I’ll have to point you to whole Wikipedia articles

01:19:44 about the game.

01:19:46 We’ll have to interact probably as you play.

01:19:48 You’ll ask me questions.

01:19:49 The context required for us to achieve

01:19:52 me being a good teacher to you on the game

01:19:54 as you would want to do it with a model, I think

01:19:58 goes well beyond the current capabilities.

01:20:01 So the question is, how do we benchmark this?

01:20:03 And then how do we change the structure of the architectures?

01:20:07 I think there’s ideas on both sides,

01:20:08 but we’ll have to see empirically, obviously,

01:20:11 what ends up working.

01:20:13 And as you talked about, some of the ideas

01:20:15 could be keeping the constraint of that length in place,

01:20:19 but then forming hierarchical representations

01:20:23 to where you can start being much clever in how

01:20:26 you use those 1,000 tokens.

01:20:28 Indeed.

01:20:30 Yeah, that’s really interesting.

01:20:32 But it also is possible that this attentional mechanism

01:20:34 where you basically, you don’t have a recency bias,

01:20:37 but you look more generally, you make it learnable.

01:20:42 The mechanism in which way you look back into the past,

01:20:45 you make that learnable.

01:20:46 It’s also possible we’re at the very beginning of that

01:20:50 because that, you might become smarter and smarter

01:20:54 in the way you query the past.

01:20:58 So recent past and distant past and maybe very, very distant

01:21:01 past.

01:21:02 So almost like the attention mechanism

01:21:04 will have to improve and evolve as good as the tokenization

01:21:11 mechanism so you can represent long term memory somehow.

01:21:14 Yes.

01:21:16 And I mean, hierarchies are very,

01:21:18 I mean, it’s a very nice word that sounds appealing.

01:21:22 There’s lots of work adding hierarchy to the memories.

01:21:25 In practice, it does seem like we keep coming back

01:21:29 to the main formula or main architecture.

01:21:33 That sometimes tells us something.

01:21:35 There is such a sentence that a friend of mine told me,

01:21:38 like, whether it wants to work or not.

01:21:41 So Transformer was clearly an idea that wanted to work.

01:21:44 And then I think there’s some principles

01:21:47 we believe will be needed.

01:21:49 But finding the exact details, details matter so much.

01:21:52 That’s going to be tricky.

01:21:54 I love the idea that there’s like you as a human being,

01:21:59 you want some ideas to work.

01:22:01 And then there’s the model that wants some ideas

01:22:03 to work and you get to have a conversation

01:22:05 to see which more likely the model will win in the end.

01:22:10 Because it’s the one, you don’t have to do any work.

01:22:12 The model is the one that has to do the work.

01:22:14 So you should listen to the model.

01:22:15 And I really love this idea that you

01:22:17 talked about the humans in this picture.

01:22:19 If I could just briefly ask, one is you’re

01:22:21 saying the benchmarks about the modular humans working on this,

01:22:28 the benchmarks providing a sturdy ground of a wish

01:22:32 to do these things that seem impossible.

01:22:34 They give you, in the darkest of times,

01:22:37 give you hope because little signs of improvement.

01:22:41 Yes.

01:22:42 Like somehow you’re not lost if you have metrics

01:22:46 to measure your improvement.

01:22:48 And then there’s other aspect.

01:22:50 You said elsewhere and here today, like titles matter.

01:22:56 I wonder how much humans matter in the evolution

01:23:01 of all of this, meaning individual humans.

01:23:06 Something about their interactions,

01:23:08 something about their ideas, how much they change

01:23:11 the direction of all of this.

01:23:12 Like if you change the humans in this picture,

01:23:15 is it that the model is sitting there

01:23:18 and it wants some idea to work?

01:23:22 Or is it the humans, or maybe the model

01:23:25 is providing you 20 ideas that could work.

01:23:27 And depending on the humans you pick,

01:23:29 they’re going to be able to hear some of those ideas.

01:23:33 Because you’re now directing all of deep learning and deep mind,

01:23:35 you get to interact with a lot of projects,

01:23:37 a lot of brilliant researchers.

01:23:40 How much variability is created by the humans in all of this?

01:23:44 Yeah, I mean, I do believe humans matter a lot,

01:23:47 at the very least at the time scale of years

01:23:53 on when things are happening and what’s the sequencing of it.

01:23:56 So you get to interact with people that, I mean,

01:24:00 you mentioned this.

01:24:02 Some people really want some idea to work

01:24:05 and they’ll persist.

01:24:07 And then some other people might be more practical,

01:24:09 like I don’t care what idea works.

01:24:12 I care about cracking protein folding.

01:24:16 And at least these two kind of seem opposite sides.

01:24:21 We need both.

01:24:22 And we’ve clearly had both historically,

01:24:25 and that made certain things happen earlier or later.

01:24:28 So definitely humans involved in all of this endeavor

01:24:33 have had, I would say, years of change or of ordering

01:24:38 how things have happened, which breakthroughs came before,

01:24:41 which other breakthroughs, and so on.

01:24:43 So certainly that does happen.

01:24:45 And so one other, maybe one other axis of distinction

01:24:50 is what I called, and this is most commonly used

01:24:53 in reinforcement learning, is the exploration exploitation

01:24:56 trade off as well.

01:24:57 It’s not exactly what I meant, although quite related.

01:25:00 So when you start trying to help others,

01:25:07 like you become a bit more of a mentor

01:25:11 to a large group of people, be it a project or the deep

01:25:14 learning team or something, or even in the community

01:25:17 when you interact with people in conferences and so on,

01:25:20 you’re identifying quickly some things that are explorative

01:25:26 or exploitative.

01:25:27 And it’s tempting to try to guide people, obviously.

01:25:30 I mean, that’s what makes our experience.

01:25:33 We bring it, and we try to shape things sometimes wrongly.

01:25:36 And there’s many times that I’ve been wrong in the past.

01:25:39 That’s great.

01:25:40 But it would be wrong to dismiss any sort of the research

01:25:47 styles that I’m observing.

01:25:49 And I often get asked, well, you’re in industry, right?

01:25:52 So we do have access to large compute scale and so on.

01:25:55 So there are certain kinds of research

01:25:57 I almost feel like we need to do responsibly and so on.

01:26:01 But it is, Carlos, we have the particle accelerator here,

01:26:05 so to speak, in physics.

01:26:06 So we need to use it.

01:26:07 We need to answer the questions that we

01:26:09 should be answering right now for the scientific progress.

01:26:12 But then at the same time, I look at many advances,

01:26:15 including attention, which was discovered in Montreal

01:26:19 initially because of lack of compute, right?

01:26:22 So we were working on sequence to sequence

01:26:24 with my friends over at Google Brain at the time.

01:26:27 And we were using, I think, eight GPUs,

01:26:30 which was somehow a lot at the time.

01:26:32 And then I think Montreal was a bit more limited in the scale.

01:26:36 But then they discovered this content based attention

01:26:38 concept that then has obviously triggered things

01:26:42 like Transformer.

01:26:43 Not everything obviously starts Transformer.

01:26:46 There’s always a history that is important to recognize

01:26:49 because then you can make sure that then those who might feel

01:26:53 now, well, we don’t have so much compute,

01:26:56 you need to then help them optimize

01:27:00 that kind of research that might actually

01:27:02 produce amazing change.

01:27:04 Perhaps it’s not as short term as some of these advancements

01:27:07 or perhaps it’s a different time scale.

01:27:09 But the people and the diversity of the field

01:27:13 is quite critical that we maintain it.

01:27:15 And at times, especially mixed a bit with hype or other things,

01:27:19 it’s a bit tricky to be observing maybe

01:27:23 too much of the same thinking across the board.

01:27:27 But the humans definitely are critical.

01:27:30 And I can think of quite a few personal examples

01:27:33 where also someone told me something

01:27:36 that had a huge effect onto some idea.

01:27:40 And then that’s why I’m saying at least in terms of years,

01:27:43 probably some things do happen.

01:27:44 Yeah, it’s fascinating.

01:27:46 And it’s also fascinating how constraints somehow

01:27:48 are essential for innovation.

01:27:51 And the other thing you mentioned about engineering,

01:27:53 I have a sneaking suspicion.

01:27:54 Maybe I over, my love is with engineering.

01:28:00 So I have a sneaky suspicion that all the genius,

01:28:04 a large percentage of the genius is

01:28:06 in the tiny details of engineering.

01:28:09 So I think we like to think our genius,

01:28:14 the genius is in the big ideas.

01:28:17 I have a sneaking suspicion that because I’ve

01:28:20 seen the genius of details, of engineering details,

01:28:24 make the night and day difference.

01:28:28 And I wonder if those kind of have a ripple effect over time.

01:28:32 So that too, so that’s sort of taking the engineering

01:28:36 perspective that sometimes that quiet innovation

01:28:39 at the level of an individual engineer

01:28:41 or maybe at the small scale of a few engineers

01:28:44 can make all the difference.

01:28:46 Because we’re working on computers that

01:28:50 are scaled across large groups, that one engineering decision

01:28:55 can lead to ripple effects.

01:28:57 It’s interesting to think about.

01:28:59 Yeah, I mean, engineering, there’s

01:29:01 also kind of a historical, it might be a bit random.

01:29:06 Because if you think of the history of how especially

01:29:10 deep learning and neural networks took off,

01:29:12 feels like a bit random because GPUs happened

01:29:16 to be there at the right time for a different purpose, which

01:29:19 was to play video games.

01:29:20 So even the engineering that goes into the hardware

01:29:24 and it might have a time, the time frame

01:29:27 might be very different.

01:29:28 I mean, the GPUs were evolved throughout many years

01:29:31 where we didn’t even were looking at that.

01:29:33 So even at that level, that revolution, so to speak,

01:29:38 the ripples are like, we’ll see when they stop.

01:29:42 But in terms of thinking of why is this happening,

01:29:46 I think that when I try to categorize it

01:29:49 in sort of things that might not be so obvious,

01:29:52 I mean, clearly there’s a hardware revolution.

01:29:54 We are surfing thanks to that.

01:29:58 Data centers as well.

01:29:59 I mean, data centers are like, I mean, at Google,

01:30:02 for instance, obviously they’re serving Google.

01:30:04 But there’s also now thanks to that

01:30:06 and to have built such amazing data centers,

01:30:09 we can train these models.

01:30:11 Software is an important one.

01:30:13 I think if I look at the state of how

01:30:16 I had to implement things to implement my ideas,

01:30:20 how I discarded ideas because they were too hard

01:30:22 to implement.

01:30:23 Yeah, clearly the times have changed.

01:30:25 And thankfully, we are in a much better software position

01:30:28 as well.

01:30:29 And then, I mean, obviously there’s

01:30:31 research that happens at scale and more people

01:30:34 enter the field.

01:30:35 That’s great to see.

01:30:35 But it’s almost enabled by these other things.

01:30:38 And last but not least is also data, right?

01:30:40 Curating data sets, labeling data sets,

01:30:43 these benchmarks we think about.

01:30:44 Maybe we’ll want to have all the benchmarks in one system.

01:30:48 But it’s still very valuable that someone

01:30:51 put the thought and the time and the vision

01:30:53 to build certain benchmarks.

01:30:54 We’ve seen progress thanks to.

01:30:56 But we’re going to repurpose the benchmarks.

01:30:59 That’s the beauty of Atari is like we solved it in a way.

01:31:04 But we use it in Gato.

01:31:06 It was critical.

01:31:06 And I’m sure there’s still a lot more

01:31:09 to do thanks to that amazing benchmark

01:31:10 that someone took the time to put,

01:31:13 even though at the time maybe, oh, you

01:31:15 have to think what’s the next iteration of architectures.

01:31:19 That’s what maybe the field recognizes.

01:31:21 But that’s another thing we need to balance

01:31:24 in terms of humans behind.

01:31:25 We need to recognize all these aspects

01:31:27 because they’re all critical.

01:31:29 And we tend to think of the genius, the scientist,

01:31:33 and so on.

01:31:34 But I’m glad I know you have a strong engineering background.

01:31:38 But also, I’m a lover of data.

01:31:40 And the pushback on the engineering comment

01:31:43 ultimately could be the creators of benchmarks

01:31:46 who have the most impact.

01:31:47 Andrej Karpathy, who you mentioned,

01:31:49 has recently been talking a lot of trash about ImageNet, which

01:31:52 he has the right to do because of how critical he is about

01:31:54 ImageNet, how essential he is to the development

01:31:57 and the success of deep learning around ImageNet.

01:32:01 And he’s saying that that’s actually

01:32:02 that benchmark is holding back the field.

01:32:05 Because I mean, especially in his context on Tesla Autopilot,

01:32:09 that’s looking at real world behavior of a system.

01:32:14 There’s something fundamentally missing

01:32:16 about ImageNet that doesn’t capture

01:32:17 the real worldness of things.

01:32:20 That we need to have data sets, benchmarks that

01:32:23 have the unpredictability, the edge cases, whatever

01:32:27 the heck it is that makes the real world so

01:32:30 difficult to operate in.

01:32:32 We need to have benchmarks of that.

01:32:34 But just to think about the impact of ImageNet

01:32:37 as a benchmark, and that really puts a lot of emphasis

01:32:42 on the importance of a benchmark,

01:32:43 both sort of internally a deep mind and as a community.

01:32:46 So one is coming in from within, like,

01:32:50 how do I create a benchmark for me to mark and make progress?

01:32:55 And how do I make benchmark for the community

01:32:58 to mark and push progress?

01:33:02 You have this amazing paper you coauthored,

01:33:05 a survey paper called Emergent Abilities

01:33:08 of Large Language Models.

01:33:10 It has, again, the philosophy here

01:33:12 that I’d love to ask you about.

01:33:14 What’s the intuition about the phenomena of emergence

01:33:17 in neural networks transformed as language models?

01:33:20 Is there a magic threshold beyond which

01:33:24 we start to see certain performance?

01:33:27 And is that different from task to task?

01:33:29 Is that us humans just being poetic and romantic?

01:33:32 Or is there literally some level at which we start

01:33:36 to see breakthrough performance?

01:33:38 Yeah, I mean, this is a property that we start seeing in systems

01:33:43 that actually tend to be so in machine learning,

01:33:48 traditionally, again, going to benchmarks.

01:33:51 I mean, if you have some input, output, right,

01:33:54 like that is just a single input and a single output,

01:33:58 you generally, when you train these systems,

01:34:01 you see reasonably smooth curves when

01:34:04 you analyze how much the data set size affects

01:34:10 the performance, or how the model size affects

01:34:12 the performance, or how long you train the system for affects

01:34:18 the performance, right?

01:34:19 So if we think of ImageNet, the training curves

01:34:23 look fairly smooth and predictable in a way.

01:34:28 And I would say that’s probably because it’s

01:34:31 kind of a one hop reasoning task, right?

01:34:36 It’s like, here is an input, and you

01:34:39 think for a few milliseconds or 100 milliseconds, 300

01:34:42 as a human, and then you tell me,

01:34:44 yeah, there’s an alpaca in this image.

01:34:47 So in language, we are seeing benchmarks that require more

01:34:55 pondering and more thought in a way, right?

01:34:58 This is just kind of you need to look for some subtleties.

01:35:02 It involves inputs that you might think of,

01:35:05 even if the input is a sentence describing

01:35:08 a mathematical problem, there is a bit more processing

01:35:13 required as a human and more introspection.

01:35:15 So I think how these benchmarks work

01:35:20 means that there is actually a threshold.

01:35:24 Just going back to how transformers

01:35:26 work in this way of querying for the right questions

01:35:29 to get the right answers, that might

01:35:31 mean that performance becomes random

01:35:35 until the right question is asked

01:35:37 by the querying system of a transformer or of a language

01:35:40 model like a transformer.

01:35:42 And then only then you might start

01:35:46 seeing performance going from random to nonrandom.

01:35:50 And this is more empirical.

01:35:53 There’s no formalism or theory behind this yet,

01:35:56 although it might be quite important.

01:35:57 But we are seeing these phase transitions

01:36:00 of random performance until some,

01:36:03 let’s say, scale of a model.

01:36:04 And then it goes beyond that.

01:36:06 And it might be that you need to fit

01:36:10 a few low order bits of thought before you can make progress

01:36:16 on the whole task.

01:36:17 And if you could measure, actually,

01:36:19 those breakdown of the task, maybe you

01:36:22 would see more smooth, like, yeah,

01:36:25 once you get these and these and these and these and these,

01:36:27 then you start making progress in the task.

01:36:30 But it’s somehow a bit annoying because then it

01:36:35 means that certain questions we might ask about architectures

01:36:40 possibly can only be done at a certain scale.

01:36:42 And one thing that, conversely, I’ve

01:36:46 seen great progress on in the last couple of years

01:36:49 is this notion of science of deep learning and science

01:36:53 of scale in particular.

01:36:55 So on the negative is that there are

01:36:57 some benchmarks for which progress might

01:37:01 need to be measured at minimum at a certain scale

01:37:04 until you see then what details of the model

01:37:07 matter to make that performance better.

01:37:09 So that’s a bit of a con.

01:37:11 But what we’ve also seen is that you can empirically

01:37:17 analyze behavior of models at scales that are smaller.

01:37:22 So let’s say, to put an example, we

01:37:25 had this Chinchilla paper that revised the so called scaling

01:37:30 laws of models.

01:37:31 And that whole study is done at a reasonably small scale,

01:37:35 that may be hundreds of millions up to 1 billion parameters.

01:37:38 And then the cool thing is that you create some loss,

01:37:41 some loss that some trends, you extract trends from data

01:37:45 that you see, OK, it looks like the amount of data required

01:37:49 to train now a 10x larger model would be this.

01:37:52 And these laws so far, these extrapolations

01:37:55 have helped us save compute and just get to a better place

01:37:59 in terms of the science of how should we

01:38:02 run these models at scale, how much data, how much depth,

01:38:05 and all sorts of questions we start

01:38:07 asking extrapolating from a small scale.

01:38:10 But then these emergence is sadly that not everything

01:38:13 can be extrapolated from scale depending on the benchmark.

01:38:16 And maybe the harder benchmarks are not

01:38:19 so good for extracting these laws.

01:38:21 But we have a variety of benchmarks at least.

01:38:24 So I wonder to which degree the threshold, the phase shift

01:38:29 scale is a function of the benchmark.

01:38:32 So some of the science of scale might

01:38:35 be engineering benchmarks where that threshold is low,

01:38:40 sort of taking a main benchmark and reducing it somehow

01:38:46 where the essential difficulty is left

01:38:48 but the scale of which the emergence happens

01:38:51 is lower just for the science aspect of it

01:38:54 versus the actual real world aspect.

01:38:56 Yeah, so luckily we have quite a few benchmarks, some of which

01:38:59 are simpler or maybe they’re more like I think people might

01:39:02 call these systems one versus systems two style.

01:39:05 So I think what we’re not seeing luckily

01:39:09 is that extrapolations from maybe slightly more smooth

01:39:14 or simpler benchmarks are translating to the harder ones.

01:39:18 But that is not to say that this extrapolation will

01:39:21 hit its limits.

01:39:22 And when it does, then how much we scale or how we scale

01:39:27 will sadly be a bit suboptimal until we find better laws.

01:39:31 And these laws, again, are very empirical laws.

01:39:33 They’re not like physical laws of models,

01:39:35 although I wish there would be better theory about these

01:39:39 things as well.

01:39:40 But so far, I would say empirical theory,

01:39:43 as I call it, is way ahead than actual theory

01:39:46 of machine learning.

01:39:47 Let me ask you almost for fun.

01:39:50 So this is not, Oriol, as a deep mind person or anything

01:39:55 to do with deep mind or Google, just as a human being,

01:39:59 looking at these news of a Google engineer who claimed

01:40:04 that, I guess, the lambda language model was sentient.

01:40:11 And you still need to look into the details of this.

01:40:14 But making an official report and the claim

01:40:19 that he believes there’s evidence that this system has

01:40:23 achieved sentience.

01:40:25 And I think this is a really interesting case

01:40:29 on a human level, on a psychological level,

01:40:31 on a technical machine learning level of how language models

01:40:37 transform our world, and also just philosophical level

01:40:39 of the role of AI systems in a human world.

01:40:44 So what do you find interesting?

01:40:48 What’s your take on all of this as a machine learning

01:40:51 engineer and a researcher and also as a human being?

01:40:54 Yeah, I mean, a few reactions.

01:40:57 Quite a few, actually.

01:40:58 Have you ever briefly thought, is this thing sentient?

01:41:02 Right, so never, absolutely never.

01:41:04 Like even with Alpha Star?

01:41:06 Wait a minute.

01:41:08 Sadly, though, I think, yeah, sadly, I have not.

01:41:11 Yeah, I think the current, any of the current models,

01:41:15 although very useful and very good,

01:41:18 yeah, I think we’re quite far from that.

01:41:22 And there’s kind of a converse side story.

01:41:25 So one of my passions is about science in general.

01:41:30 And I think I feel I’m a bit of a failed scientist.

01:41:34 That’s why I came to machine learning,

01:41:36 because you always feel, and you start seeing this,

01:41:40 that machine learning is maybe the science that

01:41:43 can help other sciences, as we’ve seen.

01:41:46 It’s such a powerful tool.

01:41:48 So thanks to that angle, that, OK, I love science.

01:41:52 I love, I mean, I love astronomy.

01:41:53 I love biology.

01:41:54 But I’m not an expert.

01:41:56 And I decided, well, the thing I can do better

01:41:58 at is computers.

01:41:59 But having, especially with when I was a bit more involved

01:42:04 in AlphaFold, learning a bit about proteins

01:42:07 and about biology and about life,

01:42:11 the complexity, it feels like it really is.

01:42:14 I mean, if you start looking at the things that are going on

01:42:19 at the atomic level, and also, I mean, there’s obviously the,

01:42:26 we are maybe inclined to try to think of neural networks

01:42:29 as like the brain.

01:42:30 But the complexities and the amount of magic

01:42:33 that it feels when, I mean, I’m not an expert,

01:42:37 so it naturally feels more magic.

01:42:38 But looking at biological systems,

01:42:40 as opposed to these computational brains,

01:42:46 just makes me like, wow, there’s such a level of complexity

01:42:50 difference still, like orders of magnitude complexity that,

01:42:54 sure, these weights, I mean, we train them

01:42:56 and they do nice things.

01:42:58 But they’re not at the level of biological entities, brains,

01:43:04 cells.

01:43:06 It just feels like it’s just not possible to achieve

01:43:09 the same level of complexity behavior.

01:43:12 And my belief, when I talk to other beings,

01:43:16 is certainly shaped by this amazement of biology

01:43:20 that, maybe because I know too much,

01:43:22 I don’t have about machine learning,

01:43:23 but I certainly feel it’s very far fetched and far

01:43:28 in the future to be calling or to be thinking,

01:43:31 well, this mathematical function that is differentiable

01:43:35 is, in fact, sentient and so on.

01:43:39 There’s something on that point that is very interesting.

01:43:42 So you know enough about machines and enough

01:43:46 about biology to know that there’s

01:43:47 many orders of magnitude of difference and complexity.

01:43:51 But you know how machine learning works.

01:43:56 So the interesting question for human beings

01:43:58 that are interacting with a system that don’t know

01:44:00 about the underlying complexity.

01:44:02 And I’ve seen people, probably including myself,

01:44:05 that have fallen in love with things that are quite simple.

01:44:08 And so maybe the complexity is one part of the picture,

01:44:11 but maybe that’s not a necessary condition for sentience,

01:44:18 for perception or emulation of sentience.

01:44:24 Right.

01:44:25 So I mean, I guess the other side of this

01:44:27 is that’s how I feel personally.

01:44:29 I mean, you asked me about the person, right?

01:44:32 Now, it’s very interesting to see how other humans feel

01:44:35 about things, right?

01:44:37 We are, again, I’m not as amazed about things

01:44:41 that I feel this is not as magical as this other thing

01:44:44 because of maybe how I got to learn about it

01:44:48 and how I see the curve a bit more smooth

01:44:50 because I’ve just seen the progress of language models

01:44:54 since Shannon in the 50s.

01:44:56 And actually looking at that time scale,

01:44:58 we’re not that fast progress, right?

01:45:00 I mean, what we were thinking at the time almost 100 years ago

01:45:06 is not that dissimilar to what we’re doing now.

01:45:08 But at the same time, yeah, obviously others,

01:45:11 my experience, the personal experience,

01:45:14 I think no one should tell others how they should feel.

01:45:20 I mean, the feelings are very personal, right?

01:45:22 So how others might feel about the models and so on.

01:45:26 That’s one part of the story that

01:45:27 is important to understand for me personally as a researcher.

01:45:31 And then when I maybe disagree or I

01:45:35 don’t understand or see that, yeah, maybe this is not

01:45:38 something I think right now is reasonable,

01:45:39 knowing all that I know, one of the other things

01:45:42 and perhaps partly why it’s great to be talking to you

01:45:46 and reaching out to the world about machine learning

01:45:49 is, hey, let’s demystify a bit the magic

01:45:53 and try to see a bit more of the math

01:45:56 and the fact that literally to create these models,

01:45:59 if we had the right software, it would be 10 lines of code

01:46:03 and then just a dump of the internet.

01:46:06 Versus then the complexity of the creation of humans

01:46:11 from their inception, right?

01:46:13 And also the complexity of evolution of the whole universe

01:46:17 to where we are that feels orders of magnitude

01:46:21 more complex and fascinating to me.

01:46:23 So I think, yeah, maybe part of the only thing

01:46:26 I’m thinking about trying to tell you is, yeah, I think

01:46:30 explaining a bit of the magic.

01:46:32 There is a bit of magic.

01:46:33 It’s good to be in love, obviously,

01:46:35 with what you do at work.

01:46:36 And I’m certainly fascinated and surprised quite often as well.

01:46:41 But I think, hopefully, as experts in biology,

01:46:45 hopefully will tell me this is not as magic.

01:46:47 And I’m happy to learn that through interactions

01:46:50 with the larger community, we can also

01:46:54 have a certain level of education

01:46:56 that in practice also will matter because, I mean,

01:46:58 one question is how you feel about this.

01:47:00 But then the other very important is

01:47:03 you starting to interact with these in products and so on.

01:47:06 It’s good to understand a bit what’s going on,

01:47:09 what’s not going on, what’s safe, what’s not safe,

01:47:12 and so on, right?

01:47:13 Otherwise, the technology will not

01:47:15 be used properly for good, which is obviously

01:47:18 the goal of all of us, I hope.

01:47:20 So let me then ask the next question.

01:47:22 Do you think in order to solve intelligence

01:47:25 or to replace the leg spot that does interviews

01:47:29 as we started this conversation with,

01:47:31 do you think the system needs to be sentient?

01:47:34 Do you think it needs to achieve something like consciousness?

01:47:38 And do you think about what consciousness

01:47:41 is in the human mind that could be instructive for creating AI

01:47:45 systems?

01:47:46 Yeah.

01:47:47 Honestly, I think probably not to the degree of intelligence

01:47:53 that there’s this brain that can learn,

01:47:58 can be extremely useful, can challenge you, can teach you.

01:48:02 Conversely, you can teach it to do things.

01:48:05 I’m not sure it’s necessary, personally speaking.

01:48:09 But if consciousness or any other biological or evolutionary

01:48:15 lesson can be repurposed to then influence

01:48:20 our next set of algorithms, that is a great way

01:48:24 to actually make progress, right?

01:48:25 And the same way I try to explain transformers a bit

01:48:28 how it feels we operate when we look at text specifically,

01:48:33 these insights are very important, right?

01:48:36 So there’s a distinction between details of how the brain might

01:48:41 be doing computation.

01:48:43 I think my understanding is, sure, there’s neurons

01:48:46 and there’s some resemblance to neural networks,

01:48:48 but we don’t quite understand enough of the brain in detail,

01:48:52 right, to be able to replicate it.

01:48:55 But then if you zoom out a bit, our thought process,

01:49:01 how memory works, maybe even how evolution got us here,

01:49:05 what’s exploration, exploitation,

01:49:07 like how these things happen, I think

01:49:09 these clearly can inform algorithmic level research.

01:49:12 And I’ve seen some examples of this

01:49:17 being quite useful to then guide the research,

01:49:19 even it might be for the wrong reasons, right?

01:49:21 So I think biology and what we know about ourselves

01:49:26 can help a whole lot to build, essentially,

01:49:30 what we call AGI, this general, the real ghetto, right?

01:49:34 The last step of the chain, hopefully.

01:49:36 But consciousness in particular, I don’t myself

01:49:40 at least think too hard about how to add that to the system.

01:49:44 But maybe my understanding is also very personal

01:49:47 about what it means, right?

01:49:48 I think even that in itself is a long debate

01:49:51 that I know people have often.

01:49:55 And maybe I should learn more about this.

01:49:57 Yeah, and I personally, I notice the magic often

01:50:01 on a personal level, especially with physical systems

01:50:04 like robots.

01:50:06 I have a lot of legged robots now in Austin

01:50:10 that I play with.

01:50:11 And even when you program them, when

01:50:13 they do things you didn’t expect,

01:50:15 there’s an immediate anthropomorphization.

01:50:18 And you notice the magic, and you

01:50:19 start to think about things like sentience

01:50:22 that has to do more with effective communication

01:50:26 and less with any of these kind of dramatic things.

01:50:30 It seems like a useful part of communication.

01:50:32 Having the perception of consciousness

01:50:36 seems like useful for us humans.

01:50:38 We treat each other more seriously.

01:50:40 We are able to do a nearest neighbor shoving of that entity

01:50:46 into your memory correctly, all that kind of stuff.

01:50:48 It seems useful, at least to fake it,

01:50:50 even if you never make it.

01:50:52 So maybe, like, yeah, mirroring the question.

01:50:55 And since you talked to a few people,

01:50:57 then you do think that we’ll need

01:50:59 to figure something out in order to achieve intelligence

01:51:04 in a grander sense of the word.

01:51:06 Yeah, I personally believe yes, but I don’t even

01:51:09 think it’ll be like a separate island we’ll have to travel to.

01:51:14 I think it will emerge quite naturally.

01:51:16 OK, that’s easier for us then.

01:51:19 Thank you.

01:51:20 But the reason I think it’s important to think about

01:51:22 is you will start, I believe, like with this Google

01:51:25 engineer, you will start seeing this a lot more, especially

01:51:29 when you have AI systems that are actually interacting

01:51:31 with human beings that don’t have an engineering background.

01:51:35 And we have to prepare for that.

01:51:38 Because I do believe there will be a civil rights

01:51:41 movement for robots, as silly as it is to say.

01:51:44 There’s going to be a large number of people

01:51:46 that realize there’s these intelligent entities with whom

01:51:49 I have a deep relationship, and I don’t want to lose them.

01:51:53 They’ve come to be a part of my life, and they mean a lot.

01:51:55 They have a name.

01:51:57 They have a story.

01:51:58 They have a memory.

01:51:59 And we start to ask questions about ourselves.

01:52:01 Well, this thing sure seems like it’s capable of suffering,

01:52:07 because it tells all these stories of suffering.

01:52:09 It doesn’t want to die and all those kinds of things.

01:52:11 And we have to start to ask ourselves questions.

01:52:14 What is the difference between a human being and this thing?

01:52:16 And so when you engineer, I believe

01:52:20 from an engineering perspective, from a deep mind or anybody

01:52:23 that builds systems, there might be laws in the future

01:52:26 where you’re not allowed to engineer systems

01:52:29 with displays of sentience, unless they’re explicitly

01:52:35 designed to be that, unless it’s a pet.

01:52:37 So if you have a system that’s just doing customer support,

01:52:41 you’re legally not allowed to display sentience.

01:52:44 We’ll start to ask ourselves that question.

01:52:47 And then so that’s going to be part of the software

01:52:49 engineering process.

01:52:52 Which features do we have?

01:52:53 And one of them is communications of the sentience.

01:52:56 But it’s important to start thinking about that stuff,

01:52:58 especially how much it captivates public attention.

01:53:01 Yeah, absolutely.

01:53:03 It’s definitely a topic that is important.

01:53:06 We think about.

01:53:07 And I think in a way, I always see not every movie

01:53:12 is equally on point with certain things.

01:53:16 But certainly science fiction in this sense

01:53:19 at least has prepared society to start

01:53:22 thinking about certain topics that even if it’s

01:53:25 too early to talk about, as long as we are reasonable,

01:53:29 it’s certainly going to prepare us for both the research

01:53:33 to come and how to.

01:53:34 I mean, there’s many important challenges and topics

01:53:38 that come with building an intelligent system, many of

01:53:43 which you just mentioned.

01:53:44 So I think we’re never going to be fully ready

01:53:49 unless we talk about these.

01:53:51 And we start also, as I said, just expanding the people

01:53:58 we talk to not include only our own researchers and so on.

01:54:03 And in fact, places like DeepMind but elsewhere,

01:54:06 there’s more interdisciplinary groups forming up

01:54:10 to start asking and really working

01:54:12 with us on these questions.

01:54:14 Because obviously, this is not initially

01:54:17 what your passion is when you do your PhD,

01:54:19 but certainly it is coming.

01:54:21 So it’s fascinating.

01:54:23 It’s the thing that brings me to one of my passions

01:54:27 that is learning.

01:54:28 So in this sense, this is a new area

01:54:31 that, as a learning system myself,

01:54:35 I want to keep exploring.

01:54:36 And I think it’s great to see parts of the debate.

01:54:41 And even I’ve seen a level of maturity

01:54:43 in the conferences that deal with AI.

01:54:46 If you look five years ago to now,

01:54:49 just the amount of workshops and so on has changed so much.

01:54:53 It’s impressive to see how much topics of safety, ethics,

01:54:58 and so on come to the surface, which is great.

01:55:01 And if it were too early, clearly it’s fine.

01:55:03 I mean, it’s a big field, and there’s

01:55:05 lots of people with lots of interests

01:55:09 that will do progress or make progress.

01:55:11 And obviously, I don’t believe we’re too late.

01:55:14 So in that sense, I think it’s great

01:55:16 that we’re doing this already.

01:55:18 It better be too early than too late

01:55:20 when it comes to super intelligent AI systems.

01:55:22 Let me ask, speaking of sentient AIs,

01:55:25 you gave props to your friend Ilyas Etzgever

01:55:28 for being elected the fellow of the Royal Society.

01:55:31 So just as a shout out to a fellow researcher

01:55:34 and a friend, what’s the secret to the genius of Ilyas

01:55:38 Etzgever?

01:55:39 And also, do you believe that his tweets,

01:55:42 as you’ve hypothesized and Andrej Karpathy did as well,

01:55:46 are generated by a language model?

01:55:48 Yeah.

01:55:49 So I strongly believe Ilya is going to visit in a few weeks,

01:55:54 actually.

01:55:54 So I’ll ask him in person.

01:55:58 Will he tell you the truth?

01:55:59 Yes, of course, hopefully.

01:56:00 I mean, ultimately, we all have shared paths,

01:56:04 and there’s friendships that go beyond, obviously,

01:56:08 institutions and so on.

01:56:09 So I hope he tells me the truth.

01:56:11 Well, maybe the AI system is holding him hostage somehow.

01:56:14 Maybe he has some videos that he doesn’t want to release.

01:56:16 So maybe it has taken control over him.

01:56:19 So he can’t tell the truth.

01:56:20 Well, if I see him in person, then I think he will know.

01:56:23 But I think Ilya’s personality, just knowing him for a while,

01:56:33 everyone in Twitter, I guess, gets a different persona.

01:56:36 And I think Ilya’s one does not surprise me.

01:56:40 So I think knowing Ilya from before social media

01:56:43 and before AI was so prevalent, I

01:56:46 recognize a lot of his character.

01:56:47 So that’s something for me that I

01:56:49 feel good about a friend that hasn’t changed

01:56:52 or is still true to himself.

01:56:55 Obviously, there is, though, a fact

01:56:58 that your field becomes more popular,

01:57:02 and he is obviously one of the main figures in the field,

01:57:05 having done a lot of advancement.

01:57:07 So I think that the tricky bit here

01:57:09 is how to balance your true self with the responsibility

01:57:12 that your words carry.

01:57:13 So in this sense, I appreciate the style, and I understand it.

01:57:19 But it created debates on some of his tweets

01:57:24 that maybe it’s good we have them early anyways.

01:57:27 But yeah, then the reactions are usually polarizing.

01:57:31 I think we’re just seeing the reality of social media

01:57:34 be there as well, reflected on that particular topic

01:57:38 or set of topics he’s tweeting about.

01:57:40 Yeah, I mean, it’s funny that he used to speak to this tension.

01:57:42 He was one of the early seminal figures

01:57:46 in the field of deep learning, so there’s

01:57:47 a responsibility with that.

01:57:48 But he’s also, from having interacted with him quite a bit,

01:57:53 he’s just a brilliant thinker about ideas, which, as are you.

01:58:01 And there’s a tension between becoming

01:58:03 the manager versus the actual thinking

01:58:06 through very novel ideas, the scientist versus the manager.

01:58:13 And he’s one of the great scientists of our time.

01:58:17 So this was quite interesting.

01:58:18 And also, people tell me quite silly,

01:58:20 which I haven’t quite detected yet.

01:58:23 But in private, we’ll have to see about that.

01:58:26 Yeah, yeah.

01:58:27 I mean, just on the point of, I mean,

01:58:30 Ilya has been an inspiration.

01:58:33 I mean, quite a few colleagues, I can think,

01:58:35 shaped the person you are.

01:58:38 Like, Ilya certainly gets probably the top spot,

01:58:42 if not close to the top.

01:58:43 And if we go back to the question about people in the field,

01:58:47 like how their role would have changed the field or not,

01:58:51 I think Ilya’s case is interesting

01:58:54 because he really has a deep belief in the scaling up

01:58:58 of neural networks.

01:58:59 There was a talk that is still famous to this day

01:59:03 from the Sequence to Sequence paper, where he was just

01:59:07 claiming, just give me supervised data

01:59:10 and a large neural network, and then you’ll

01:59:12 solve basically all the problems.

01:59:16 That vision was already there many years ago.

01:59:19 So it’s good to see someone who is, in this case,

01:59:22 very deeply into this style of research

01:59:27 and clearly has had a tremendous track record of successes

01:59:32 and so on.

01:59:34 The funny bit about that talk is that we rehearsed the talk

01:59:37 in a hotel room before, and the original version of that talk

01:59:42 would have been even more controversial.

01:59:44 So maybe I’m the only person that

01:59:46 has seen the unfiltered version of the talk.

01:59:49 And maybe when the time comes, maybe we

01:59:52 should revisit some of the skip slides

01:59:55 from the talk from Ilya.

01:59:57 But I really think the deep belief

02:00:01 into some certain style of research

02:00:03 pays out, is good to be practical sometimes.

02:00:06 And I actually think Ilya and myself are practical,

02:00:09 but it’s also good.

02:00:10 There’s some sort of long term belief and trajectory.

02:00:14 Obviously, there’s a bit of lack involved,

02:00:16 but it might be that that’s the right path.

02:00:18 Then you clearly are ahead and hugely influential to the field

02:00:22 as he has been.

02:00:23 Do you agree with that intuition that maybe

02:00:26 was written about by Rich Sutton in The Bitter Lesson,

02:00:33 that the biggest lesson that can be read from 70 years of AI

02:00:36 research is that general methods that leverage computation

02:00:40 are ultimately the most effective?

02:00:42 Do you think that intuition is ultimately correct?

02:00:48 General methods that leverage computation,

02:00:52 allowing the scaling of computation

02:00:54 to do a lot of the work.

02:00:56 And so the basic task of us humans

02:00:59 is to design methods that are more

02:01:01 and more general versus more and more specific to the tasks

02:01:05 at hand.

02:01:07 I certainly think this essentially mimics

02:01:10 a bit of the deep learning research,

02:01:14 almost like philosophy, that on the one hand,

02:01:18 we want to be data agnostic.

02:01:20 We don’t want to preprocess data sets.

02:01:22 We want to see the bytes, the true data as it is,

02:01:25 and then learn everything on top.

02:01:27 So very much agree with that.

02:01:30 And I think scaling up feels, at the very least, again,

02:01:33 necessary for building incredible complex systems.

02:01:38 It’s possibly not sufficient, barring that we

02:01:42 need a couple of breakthroughs.

02:01:45 I think Reed Sutton mentioned search

02:01:47 being part of the equation of scale and search.

02:01:52 I think search, I’ve seen it, that’s

02:01:55 been more mixed in my experience.

02:01:57 So from that lesson in particular,

02:01:59 search is a bit more tricky because it

02:02:02 is very appealing to search in domains like Go,

02:02:05 where you have a clear reward function that you can then

02:02:08 discard some search traces.

02:02:10 But then in some other tasks, it’s

02:02:13 not very clear how you would do that,

02:02:15 although recently one of our recent works, which actually

02:02:19 was mostly mimicking or a continuation,

02:02:22 and even the team and the people involved were pretty much very

02:02:25 intersecting with AlphaStar, was AlphaCode,

02:02:28 in which we actually saw the bitter lesson how

02:02:31 scale of the models and then a massive amount of search

02:02:34 yielded this kind of very interesting result

02:02:36 of being able to have human level code competition.

02:02:41 So I’ve seen examples of it being

02:02:43 literally mapped to search and scale.

02:02:46 I’m not so convinced about the search bit,

02:02:48 but certainly I’m convinced scale will be needed.

02:02:51 So we need general methods.

02:02:52 We need to test them, and maybe we

02:02:54 need to make sure that we can scale them given the hardware

02:02:57 that we have in practice.

02:02:59 But then maybe we should also shape how the hardware looks

02:03:01 like based on which methods might be needed to scale.

02:03:05 And that’s an interesting contrast of these GPU comments

02:03:11 that is we got it for free almost because games

02:03:14 were using these.

02:03:15 But maybe now if sparsity is required,

02:03:19 we don’t have the hardware.

02:03:20 Although in theory, many people are

02:03:22 building different kinds of hardware these days.

02:03:24 But there’s a bit of this notion of hardware lottery

02:03:27 for scale that might actually have an impact at least

02:03:31 on the scale of years on how fast we will make progress

02:03:35 to maybe a version of neural nets

02:03:37 or whatever comes next that might enable

02:03:41 truly intelligent agents.

02:03:44 Do you think in your lifetime we will build an AGI system that

02:03:50 would undeniably be a thing that achieves human level

02:03:55 intelligence and goes far beyond?

02:03:58 I definitely think it’s possible that it will go far beyond.

02:04:03 But I’m definitely convinced that it will

02:04:05 be human level intelligence.

02:04:08 And I’m hypothesizing about the beyond

02:04:11 because the beyond bit is a bit tricky to define,

02:04:16 especially when we look at the current formula of starting

02:04:21 from this imitation learning standpoint.

02:04:23 So we can certainly imitate humans at language and beyond.

02:04:30 So getting at human level through imitation

02:04:33 feels very possible.

02:04:34 Going beyond will require reinforcement learning

02:04:39 and other things.

02:04:39 And I think in some areas that certainly already has paid out.

02:04:43 I mean, Go being an example that’s

02:04:46 my favorite so far in terms of going

02:04:48 beyond human capabilities.

02:04:50 But in general, I’m not sure we can define reward functions

02:04:55 that from a seed of imitating human level

02:04:59 intelligence that is general and then going beyond.

02:05:02 That bit is not so clear in my lifetime.

02:05:05 But certainly, human level, yes.

02:05:08 And I mean, that in itself is already quite powerful,

02:05:11 I think.

02:05:11 So going beyond, I think it’s obviously not.

02:05:14 We’re not going to not try that if then we

02:05:17 get to superhuman scientists and discovery

02:05:20 and advancing the world.

02:05:22 But at least human level in general

02:05:25 is also very, very powerful.

02:05:27 Well, especially if human level or slightly beyond

02:05:31 is integrated deeply with human society

02:05:33 and there’s billions of agents like that,

02:05:36 do you think there’s a singularity moment beyond which

02:05:39 our world will be just very deeply transformed

02:05:44 by these kinds of systems?

02:05:45 Because now you’re talking about intelligence systems

02:05:47 that are just, I mean, this is no longer just going

02:05:53 from horse and buggy to the car.

02:05:56 It feels like a very different kind of shift

02:05:59 in what it means to be a living entity on Earth.

02:06:03 Are you afraid?

02:06:04 Are you excited of this world?

02:06:06 I’m afraid if there’s a lot more.

02:06:09 So I think maybe we’ll need to think about if we truly

02:06:13 get there just thinking of limited resources

02:06:18 like humanity clearly hit some limits

02:06:21 and then there’s some balance, hopefully,

02:06:23 that biologically the planet is imposing.

02:06:26 And we should actually try to get better at this.

02:06:28 As we know, there’s quite a few issues

02:06:31 with having too many people coexisting

02:06:35 in a resource limited way.

02:06:37 So for digital entities, it’s an interesting question.

02:06:40 I think such a limit maybe should exist.

02:06:43 But maybe it’s going to be imposed by energy availability

02:06:47 because this also consumes energy.

02:06:49 In fact, most systems are more inefficient

02:06:53 than we are in terms of energy required.

02:06:56 But definitely, I think as a society,

02:06:59 we’ll need to just work together to find

02:07:03 what would be reasonable in terms of growth

02:07:06 or how we coexist if that is to happen.

02:07:11 I am very excited about, obviously,

02:07:14 the aspects of automation that make people

02:07:17 that obviously don’t have access to certain resources

02:07:20 or knowledge, for them to have that access.

02:07:23 I think those are the applications in a way

02:07:26 that I’m most excited to see and to personally work towards.

02:07:30 Yeah, there’s going to be significant improvements

02:07:32 in productivity and the quality of life

02:07:34 across the whole population, which is very interesting.

02:07:36 But I’m looking even far beyond

02:07:39 us becoming a multiplanetary species.

02:07:42 And just as a quick bet, last question.

02:07:45 Do you think as humans become multiplanetary species,

02:07:49 go outside our solar system, all that kind of stuff,

02:07:52 do you think there will be more humans

02:07:54 or more robots in that future world?

02:07:57 So will humans be the quirky, intelligent being of the past

02:08:04 or is there something deeply fundamental

02:08:07 to human intelligence that’s truly special,

02:08:09 where we will be part of those other planets,

02:08:12 not just AI systems?

02:08:13 I think we’re all excited to build AGI

02:08:18 to empower or make us more powerful as human species.

02:08:25 Not to say there might be some hybridization.

02:08:27 I mean, this is obviously speculation,

02:08:29 but there are companies also trying to,

02:08:32 the same way medicine is making us better.

02:08:35 Maybe there are other things that are yet to happen on that.

02:08:39 But if the ratio is not at most one to one,

02:08:43 I would not be happy.

02:08:44 So I would hope that we are part of the equation,

02:08:49 but maybe there’s maybe a one to one ratio feels

02:08:53 like possible, constructive and so on,

02:08:56 but it would not be good to have a misbalance,

02:08:59 at least from my core beliefs and the why I’m doing

02:09:03 what I’m doing when I go to work and I research

02:09:05 what I research.

02:09:07 Well, this is how I know you’re human

02:09:09 and this is how you’ve passed the Turing test.

02:09:12 And you are one of the special humans, Oriel.

02:09:15 It’s a huge honor that you would talk with me

02:09:17 and I hope we get the chance to speak again,

02:09:19 maybe once before the singularity, once after

02:09:23 and see how our view of the world changes.

02:09:25 Thank you again for talking today.

02:09:26 Thank you for the amazing work you do.

02:09:28 You’re a shining example of a research

02:09:31 and a human being in this community.

02:09:32 Thanks a lot.

02:09:33 Like yeah, looking forward to before the singularity

02:09:36 certainly and maybe after.

02:09:39 Thanks for listening to this conversation

02:09:41 with Oriel Venialis.

02:09:43 To support this podcast, please check out our sponsors

02:09:45 in the description.

02:09:46 And now let me leave you with some words from Alan Turing.

02:09:51 Those who can imagine anything can create the impossible.

02:09:55 Thank you for listening and hope to see you next time.