Juergen Schmidhuber: Godel Machines, Meta-Learning, and LSTMs #11

Transcript

00:00:00 The following is a conversation with Jürgen Schmidhuber.

00:00:03 He’s the co director of the CS Swiss AI Lab

00:00:06 and a co creator of long short term memory networks.

00:00:10 LSDMs are used in billions of devices today

00:00:13 for speech recognition, translation, and much more.

00:00:17 Over 30 years, he has proposed a lot of interesting

00:00:20 out of the box ideas on meta learning, adversarial networks,

00:00:24 computer vision, and even a formal theory of quote,

00:00:28 creativity, curiosity, and fun.

00:00:32 This conversation is part of the MIT course

00:00:34 on artificial general intelligence

00:00:36 and the artificial intelligence podcast.

00:00:38 If you enjoy it, subscribe on YouTube, iTunes,

00:00:41 or simply connect with me on Twitter

00:00:43 at Lex Friedman spelled F R I D.

00:00:47 And now here’s my conversation with Jürgen Schmidhuber.

00:00:53 Early on you dreamed of AI systems

00:00:55 that self improve recursively.

00:00:58 When was that dream born?

00:01:01 When I was a baby.

00:01:03 No, that’s not true.

00:01:04 When I was a teenager.

00:01:06 And what was the catalyst for that birth?

00:01:09 What was the thing that first inspired you?

00:01:12 When I was a boy, I was thinking about what to do in my life

00:01:20 and then I thought the most exciting thing

00:01:23 is to solve the riddles of the universe.

00:01:28 And that means you have to become a physicist.

00:01:30 However, then I realized that there’s something even grander.

00:01:35 You can try to build a machine

00:01:39 that isn’t really a machine any longer

00:01:42 that learns to become a much better physicist

00:01:44 than I could ever hope to be.

00:01:47 And that’s how I thought maybe I can multiply

00:01:50 my tiny little bit of creativity into infinity.

00:01:54 But ultimately that creativity will be multiplied

00:01:57 to understand the universe around us.

00:01:59 That’s the curiosity for that mystery that drove you.

00:02:05 Yes, so if you can build a machine

00:02:08 that learns to solve more and more complex problems

00:02:13 and more and more general problem solver

00:02:16 then you basically have solved all the problems,

00:02:22 at least all the solvable problems.

00:02:26 So how do you think, what is the mechanism

00:02:28 for that kind of general solver look like?

00:02:31 Obviously we don’t quite yet have one

00:02:34 or know how to build one but we have ideas

00:02:37 and you have had throughout your career

00:02:39 several ideas about it.

00:02:40 So how do you think about that mechanism?

00:02:43 So in the 80s, I thought about how to build this machine

00:02:48 that learns to solve all these problems

00:02:51 that I cannot solve myself.

00:02:54 And I thought it is clear it has to be a machine

00:02:57 that not only learns to solve this problem here

00:03:00 and this problem here but it also has to learn

00:03:04 to improve the learning algorithm itself.

00:03:09 So it has to have the learning algorithm

00:03:12 in a representation that allows it to inspect it

00:03:15 and modify it such that it can come up

00:03:19 with a better learning algorithm.

00:03:21 So I call that meta learning, learning to learn

00:03:24 and recursive self improvement

00:03:26 that is really the pinnacle of that

00:03:28 where you then not only learn how to improve

00:03:34 on that problem and on that

00:03:36 but you also improve the way the machine improves

00:03:40 and you also improve the way it improves

00:03:42 the way it improves itself.

00:03:44 And that was my 1987 diploma thesis

00:03:47 which was all about that higher education

00:03:50 hierarchy of meta learners that have no computational limits

00:03:57 except for the well known limits that Gödel identified

00:04:01 in 1931 and for the limits of physics.

00:04:06 In the recent years, meta learning has gained popularity

00:04:10 in a specific kind of form.

00:04:12 You’ve talked about how that’s not really meta learning

00:04:16 with neural networks, that’s more basic transfer learning.

00:04:21 Can you talk about the difference

00:04:22 between the big general meta learning

00:04:25 and a more narrow sense of meta learning

00:04:27 the way it’s used today, the way it’s talked about today?

00:04:30 Let’s take the example of a deep neural network

00:04:33 that has learned to classify images

00:04:37 and maybe you have trained that network

00:04:40 on 100 different databases of images.

00:04:43 And now a new database comes along

00:04:48 and you want to quickly learn the new thing as well.

00:04:53 So one simple way of doing that is you take the network

00:04:57 which already knows 100 types of databases

00:05:02 and then you just take the top layer of that

00:05:06 and you retrain that using the new label data

00:05:11 that you have in the new image database.

00:05:14 And then it turns out that it really, really quickly

00:05:17 can learn that too, one shot basically

00:05:20 because from the first 100 data sets,

00:05:24 it already has learned so much about computer vision

00:05:27 that it can reuse that and that is then almost good enough

00:05:31 to solve the new task except you need a little bit

00:05:34 of adjustment on the top.

00:05:38 So that is transfer learning.

00:05:41 And it has been done in principle for many decades.

00:05:44 People have done similar things for decades.

00:05:48 Meta learning too, meta learning is about

00:05:51 having the learning algorithm itself

00:05:55 open to introspection by the system that is using it

00:06:01 and also open to modification such that the learning system

00:06:06 has an opportunity to modify

00:06:09 any part of the learning algorithm

00:06:12 and then evaluate the consequences of that modification

00:06:16 and then learn from that to create

00:06:21 a better learning algorithm and so on recursively.

00:06:25 So that’s a very different animal

00:06:28 where you are opening the space of possible learning

00:06:32 algorithms to the learning system itself.

00:06:35 Right, so you’ve, like in the 2004 paper, you described

00:06:40 gator machines, programs that rewrite themselves, right?

00:06:44 Philosophically and even in your paper, mathematically,

00:06:47 these are really compelling ideas but practically,

00:06:52 do you see these self referential programs

00:06:55 being successful in the near term to having an impact

00:06:59 where sort of it demonstrates to the world

00:07:02 that this direction is a good one to pursue

00:07:07 in the near term?

00:07:08 Yes, we had these two different types

00:07:11 of fundamental research,

00:07:13 how to build a universal problem solver,

00:07:15 one basically exploiting proof search

00:07:22 and things like that that you need to come up with

00:07:25 asymptotically optimal, theoretically optimal

00:07:30 self improvers and problem solvers.

00:07:34 However, one has to admit that through this proof search

00:07:40 comes in an additive constant, an overhead,

00:07:44 an additive overhead that vanishes in comparison

00:07:50 to what you have to do to solve large problems.

00:07:55 However, for many of the small problems

00:07:58 that we want to solve in our everyday life,

00:08:00 we cannot ignore this constant overhead

00:08:03 and that’s why we also have been doing other things,

00:08:08 non universal things such as recurrent neural networks

00:08:12 which are trained by gradient descent

00:08:15 and local search techniques which aren’t universal at all,

00:08:18 which aren’t provably optimal at all,

00:08:21 like the other stuff that we did,

00:08:22 but which are much more practical

00:08:25 as long as we only want to solve the small problems

00:08:28 that we are typically trying to solve

00:08:33 in this environment here.

00:08:35 So the universal problem solvers like the Gödel machine,

00:08:38 but also Markus Hutter’s fastest way

00:08:42 of solving all possible problems,

00:08:44 which he developed around 2002 in my lab,

00:08:49 they are associated with these constant overheads

00:08:52 for proof search, which guarantees that the thing

00:08:55 that you’re doing is optimal.

00:08:57 For example, there is this fastest way

00:09:01 of solving all problems with a computable solution,

00:09:05 which is due to Markus, Markus Hutter,

00:09:08 and to explain what’s going on there,

00:09:12 let’s take traveling salesman problems.

00:09:15 With traveling salesman problems,

00:09:17 you have a number of cities and cities

00:09:21 and you try to find the shortest path

00:09:23 through all these cities without visiting any city twice.

00:09:29 And nobody knows the fastest way

00:09:32 of solving traveling salesman problems, TSPs,

00:09:38 but let’s assume there is a method of solving them

00:09:41 within N to the five operations

00:09:45 where N is the number of cities.

00:09:48 Then the universal method of Markus

00:09:53 is going to solve the same traveling salesman problem

00:09:57 also within N to the five steps,

00:10:00 plus O of one, plus a constant number of steps

00:10:04 that you need for the proof searcher,

00:10:07 which you need to show that this particular class

00:10:12 of problems, the traveling salesman problems,

00:10:15 can be solved within a certain time frame,

00:10:17 solved within a certain time bound,

00:10:20 within order N to the five steps, basically,

00:10:24 and this additive constant doesn’t care for N,

00:10:28 which means as N is getting larger and larger,

00:10:32 as you have more and more cities,

00:10:35 the constant overhead pales in comparison,

00:10:38 and that means that almost all large problems are solved

00:10:44 in the best possible way.

00:10:45 Today, we already have a universal problem solver like that.

00:10:50 However, it’s not practical because the overhead,

00:10:54 the constant overhead is so large

00:10:57 that for the small kinds of problems

00:11:00 that we want to solve in this little biosphere.

00:11:04 By the way, when you say small,

00:11:06 you’re talking about things that fall

00:11:08 within the constraints of our computational systems.

00:11:10 So they can seem quite large to us mere humans, right?

00:11:14 That’s right, yeah.

00:11:15 So they seem large and even unsolvable

00:11:19 in a practical sense today,

00:11:21 but they are still small compared to almost all problems

00:11:24 because almost all problems are large problems,

00:11:28 which are much larger than any constant.

00:11:31 Do you find it useful as a person

00:11:34 who has dreamed of creating a general learning system,

00:11:38 has worked on creating one,

00:11:39 has done a lot of interesting ideas there,

00:11:42 to think about P versus NP,

00:11:46 this formalization of how hard problems are,

00:11:50 how they scale,

00:11:52 this kind of worst case analysis type of thinking,

00:11:55 do you find that useful?

00:11:56 Or is it only just a mathematical,

00:12:00 it’s a set of mathematical techniques

00:12:02 to give you intuition about what’s good and bad.

00:12:05 So P versus NP, that’s super interesting

00:12:09 from a theoretical point of view.

00:12:11 And in fact, as you are thinking about that problem,

00:12:14 you can also get inspiration

00:12:17 for better practical problem solvers.

00:12:21 On the other hand, we have to admit

00:12:23 that at the moment, the best practical problem solvers

00:12:28 for all kinds of problems that we are now solving

00:12:31 through what is called AI at the moment,

00:12:33 they are not of the kind

00:12:36 that is inspired by these questions.

00:12:38 There we are using general purpose computers

00:12:42 such as recurrent neural networks,

00:12:44 but we have a search technique

00:12:46 which is just local search gradient descent

00:12:50 to try to find a program

00:12:51 that is running on these recurrent networks,

00:12:54 such that it can solve some interesting problems

00:12:58 such as speech recognition or machine translation

00:13:01 and something like that.

00:13:03 And there is very little theory behind the best solutions

00:13:08 that we have at the moment that can do that.

00:13:10 Do you think that needs to change?

00:13:12 Do you think that will change?

00:13:14 Or can we go, can we create a general intelligent systems

00:13:17 without ever really proving that that system is intelligent

00:13:20 in some kind of mathematical way,

00:13:22 solving machine translation perfectly

00:13:25 or something like that,

00:13:26 within some kind of syntactic definition of a language,

00:13:29 or can we just be super impressed

00:13:31 by the thing working extremely well and that’s sufficient?

00:13:35 There’s an old saying,

00:13:36 and I don’t know who brought it up first,

00:13:39 which says, there’s nothing more practical

00:13:42 than a good theory.

00:13:43 And a good theory of problem solving

00:13:52 under limited resources,

00:13:54 like here in this universe or on this little planet,

00:13:58 has to take into account these limited resources.

00:14:01 And so probably there is locking a theory,

00:14:08 which is related to what we already have,

00:14:10 these asymptotically optimal problem solvers,

00:14:14 which tells us what we need in addition to that

00:14:18 to come up with a practically optimal problem solver.

00:14:22 So I believe we will have something like that.

00:14:27 And maybe just a few little tiny twists are necessary

00:14:30 to change what we already have,

00:14:34 to come up with that as well.

00:14:36 As long as we don’t have that,

00:14:37 we admit that we are taking suboptimal ways

00:14:42 and recurrent neural networks and long short term memory

00:14:45 for equipped with local search techniques.

00:14:50 And we are happy that it works better

00:14:53 than any competing methods,

00:14:55 but that doesn’t mean that we think we are done.

00:15:00 You’ve said that an AGI system

00:15:02 will ultimately be a simple one.

00:15:05 A general intelligence system

00:15:06 will ultimately be a simple one.

00:15:08 Maybe a pseudocode of a few lines

00:15:10 will be able to describe it.

00:15:11 Can you talk through your intuition behind this idea,

00:15:16 why you feel that at its core,

00:15:21 intelligence is a simple algorithm?

00:15:26 Experience tells us that the stuff that works best

00:15:31 is really simple.

00:15:33 So the asymptotically optimal ways of solving problems,

00:15:37 if you look at them,

00:15:38 they’re just a few lines of code, it’s really true.

00:15:41 Although they are these amazing properties,

00:15:44 just a few lines of code.

00:15:45 Then the most promising and most useful practical things,

00:15:53 maybe don’t have this proof of optimality

00:15:56 associated with them.

00:15:57 However, they are also just a few lines of code.

00:16:00 The most successful recurrent neural networks,

00:16:05 you can write them down in five lines of pseudocode.

00:16:08 That’s a beautiful, almost poetic idea,

00:16:10 but what you’re describing there

00:16:15 is the lines of pseudocode are sitting on top

00:16:18 of layers and layers of abstractions in a sense.

00:16:22 So you’re saying at the very top,

00:16:25 it’ll be a beautifully written sort of algorithm.

00:16:31 But do you think that there’s many layers of abstractions

00:16:33 we have to first learn to construct?

00:16:36 Yeah, of course, we are building on all these

00:16:40 great abstractions that people have invented over the millennia,

00:16:45 such as matrix multiplications and real numbers

00:16:50 and basic arithmetics and calculus

00:16:56 and derivations of error functions

00:17:00 and derivatives of error functions and stuff like that.

00:17:04 So without that language that greatly simplifies

00:17:09 our way of thinking about these problems,

00:17:13 we couldn’t do anything.

00:17:14 So in that sense, as always,

00:17:16 we are standing on the shoulders of the giants

00:17:19 who in the past simplified the problem

00:17:24 of problem solving so much

00:17:26 that now we have a chance to do the final step.

00:17:29 So the final step will be a simple one.

00:17:33 If we take a step back through all of human civilization

00:17:36 and just the universe in general,

00:17:40 how do you think about evolution

00:17:41 and what if creating a universe

00:17:44 is required to achieve this final step?

00:17:47 What if going through the very painful

00:17:50 and inefficient process of evolution is needed

00:17:53 to come up with this set of abstractions

00:17:55 that ultimately lead to intelligence?

00:17:57 Do you think there’s a shortcut

00:18:00 or do you think we have to create something like our universe

00:18:04 in order to create something like human level intelligence?

00:18:09 So far, the only example we have is this one,

00:18:13 this universe in which we are living.

00:18:14 Do you think we can do better?

00:18:20 Maybe not, but we are part of this whole process.

00:18:24 So apparently, so it might be the case

00:18:29 that the code that runs the universe

00:18:32 is really, really simple.

00:18:33 Everything points to that possibility

00:18:35 because gravity and other basic forces

00:18:39 are really simple laws that can be easily described

00:18:43 also in just a few lines of code basically.

00:18:46 And then there are these other events

00:18:51 that the apparently random events

00:18:54 in the history of the universe,

00:18:55 which as far as we know at the moment

00:18:58 don’t have a compact code, but who knows?

00:19:00 Maybe somebody in the near future

00:19:02 is going to figure out the pseudo random generator

00:19:06 which is computing whether the measurement

00:19:11 of that spin up or down thing here

00:19:15 is going to be positive or negative.

00:19:17 Underlying quantum mechanics.

00:19:19 Yes.

00:19:20 Do you ultimately think quantum mechanics

00:19:22 is a pseudo random number generator?

00:19:24 So it’s all deterministic.

00:19:26 There’s no randomness in our universe.

00:19:28 Does God play dice?

00:19:31 So a couple of years ago, a famous physicist,

00:19:34 quantum physicist, Anton Zeilinger,

00:19:37 he wrote an essay in nature

00:19:40 and it started more or less like that.

00:19:45 One of the fundamental insights of the 20th century

00:19:50 was that the universe is fundamentally random

00:19:57 on the quantum level.

00:20:00 And that whenever you measure spin up or down

00:20:04 or something like that,

00:20:05 a new bit of information enters the history of the universe.

00:20:12 And while I was reading that,

00:20:13 I was already typing the response

00:20:16 and they had to publish it.

00:20:17 Because I was right, that there is no evidence,

00:20:21 no physical evidence for that.

00:20:25 So there’s an alternative explanation

00:20:27 where everything that we consider random

00:20:30 is actually pseudo random,

00:20:33 such as the decimal expansion of pi,

00:20:35 3.141 and so on, which looks random, but isn’t.

00:20:41 So pi is interesting because every three digits

00:20:45 sequence, every sequence of three digits

00:20:50 appears roughly one in a thousand times.

00:20:53 And every five digit sequence

00:20:57 appears roughly one in 10,000 times,

00:21:01 what you would expect if it was random.

00:21:04 But there’s a very short algorithm,

00:21:06 a short program that computes all of that.

00:21:09 So it’s extremely compressible.

00:21:11 And who knows, maybe tomorrow,

00:21:12 somebody, some grad student at CERN goes back

00:21:15 over all these data points, better decay and whatever,

00:21:20 and figures out, oh, it’s the second billion digits of pi

00:21:24 or something like that.

00:21:26 We don’t have any fundamental reason at the moment

00:21:29 to believe that this is truly random

00:21:33 and not just a deterministic video game.

00:21:36 If it was a deterministic video game,

00:21:38 it would be much more beautiful.

00:21:40 Because beauty is simplicity.

00:21:43 And many of the basic laws of the universe,

00:21:47 like gravity and the other basic forces are very simple.

00:21:51 So very short programs can explain what these are doing.

00:21:56 And it would be awful and ugly.

00:22:00 The universe would be ugly.

00:22:01 The history of the universe would be ugly

00:22:04 if for the extra things, the random,

00:22:06 the seemingly random data points that we get all the time,

00:22:11 that we really need a huge number of extra bits

00:22:15 to describe all these extra bits of information.

00:22:22 So as long as we don’t have evidence

00:22:24 that there is no short program

00:22:26 that computes the entire history of the entire universe,

00:22:31 we are, as scientists, compelled to look further

00:22:36 for that shortest program.

00:22:39 Your intuition says there exists a program

00:22:43 that can backtrack to the creation of the universe.

00:22:47 Yeah.

00:22:48 So it can give the shortest path

00:22:49 to the creation of the universe.

00:22:50 Yes.

00:22:51 Including all the entanglement things

00:22:54 and all the spin up and down measures

00:22:57 that have been taken place since 13.8 billion years ago.

00:23:06 So we don’t have a proof that it is random.

00:23:11 We don’t have a proof that it is compressible

00:23:15 to a short program.

00:23:16 But as long as we don’t have that proof,

00:23:18 we are obliged as scientists to keep looking

00:23:21 for that simple explanation.

00:23:23 Absolutely.

00:23:24 So you said the simplicity is beautiful or beauty is simple.

00:23:27 Either one works.

00:23:29 But you also work on curiosity, discovery,

00:23:34 the romantic notion of randomness, of serendipity,

00:23:39 of being surprised by things that are about you.

00:23:45 In our poetic notion of reality,

00:23:49 we think it’s kind of like,

00:23:51 poetic notion of reality, we think as humans

00:23:54 require randomness.

00:23:56 So you don’t find randomness beautiful.

00:23:59 You find simple determinism beautiful.

00:24:04 Yeah.

00:24:05 Okay.

00:24:07 So why?

00:24:08 Why?

00:24:09 Because the explanation becomes shorter.

00:24:13 A universe that is compressible to a short program

00:24:18 is much more elegant and much more beautiful

00:24:22 than another one, which needs an almost infinite

00:24:25 number of bits to be described.

00:24:28 As far as we know, many things that are happening

00:24:32 in this universe are really simple in terms of

00:24:35 short programs that compute gravity

00:24:38 and the interaction between elementary particles and so on.

00:24:43 So all of that seems to be very, very simple.

00:24:45 Every electron seems to reuse the same subprogram

00:24:50 all the time, as it is interacting with

00:24:52 other elementary particles.

00:24:58 If we now require an extra oracle injecting

00:25:05 new bits of information all the time for these

00:25:08 extra things which are currently not understood,

00:25:11 such as better decay, then the whole description

00:25:22 length of the data that we can observe of the

00:25:26 history of the universe would become much longer

00:25:31 and therefore uglier.

00:25:33 And uglier.

00:25:34 Again, simplicity is elegant and beautiful.

00:25:38 The history of science is a history of compression progress.

00:25:42 Yes, so you’ve described sort of as we build up

00:25:47 abstractions and you’ve talked about the idea

00:25:50 of compression.

00:25:52 How do you see this, the history of science,

00:25:55 the history of humanity, our civilization,

00:25:58 and life on Earth as some kind of path towards

00:26:02 greater and greater compression?

00:26:04 What do you mean by that?

00:26:05 How do you think about that?

00:26:06 Indeed, the history of science is a history of

00:26:10 compression progress.

00:26:12 What does that mean?

00:26:14 Hundreds of years ago there was an astronomer

00:26:17 whose name was Kepler and he looked at the data

00:26:21 points that he got by watching planets move.

00:26:25 And then he had all these data points and

00:26:27 suddenly it turned out that he can greatly

00:26:30 compress the data by predicting it through an

00:26:36 ellipse law.

00:26:38 So it turns out that all these data points are

00:26:40 more or less on ellipses around the sun.

00:26:45 And another guy came along whose name was

00:26:48 Newton and before him Hooke.

00:26:51 And they said the same thing that is making

00:26:55 these planets move like that is what makes the

00:27:00 apples fall down.

00:27:02 And it also holds for stones and for all kinds

00:27:08 of other objects.

00:27:11 And suddenly many, many of these observations

00:27:15 became much more compressible because as long

00:27:17 as you can predict the next thing, given what

00:27:20 you have seen so far, you can compress it.

00:27:23 And you don’t have to store that data extra.

00:27:25 This is called predictive coding.

00:27:29 And then there was still something wrong with

00:27:31 that theory of the universe and you had

00:27:34 deviations from these predictions of the theory.

00:27:37 And 300 years later another guy came along

00:27:40 whose name was Einstein.

00:27:42 And he was able to explain away all these

00:27:46 deviations from the predictions of the old

00:27:50 theory through a new theory which was called

00:27:54 the general theory of relativity.

00:27:57 Which at first glance looks a little bit more

00:28:00 complicated and you have to warp space and time

00:28:03 but you can’t phrase it within one single

00:28:05 sentence which is no matter how fast you

00:28:08 accelerate and how hard you decelerate and no

00:28:14 matter what is the gravity in your local

00:28:18 network, light speed always looks the same.

00:28:21 And from that you can calculate all the

00:28:24 consequences.

00:28:25 So it’s a very simple thing and it allows you

00:28:27 to further compress all the observations

00:28:30 because certainly there are hardly any

00:28:34 deviations any longer that you can measure

00:28:37 from the predictions of this new theory.

00:28:40 So all of science is a history of compression

00:28:44 progress.

00:28:45 You never arrive immediately at the shortest

00:28:48 explanation of the data but you’re making

00:28:51 progress.

00:28:52 Whenever you are making progress you have an

00:28:55 insight.

00:28:56 You see oh first I needed so many bits of

00:28:59 information to describe the data, to describe

00:29:02 my falling apples, my video of falling apples,

00:29:04 I need so many data, so many pixels have to be

00:29:07 stored.

00:29:08 But then suddenly I realize no there is a very

00:29:11 simple way of predicting the third frame in the

00:29:14 video from the first two.

00:29:16 And maybe not every little detail can be

00:29:19 predicted but more or less most of these orange

00:29:21 blobs that are coming down they accelerate in

00:29:24 the same way which means that I can greatly

00:29:27 compress the video.

00:29:28 And the amount of compression, progress, that

00:29:33 is the depth of the insight that you have at

00:29:36 that moment.

00:29:37 That’s the fun that you have, the scientific

00:29:39 fun, the fun in that discovery.

00:29:42 And we can build artificial systems that do

00:29:45 the same thing.

00:29:46 They measure the depth of their insights as they

00:29:49 are looking at the data which is coming in

00:29:51 through their own experiments and we give

00:29:54 them a reward, an intrinsic reward in proportion

00:29:58 to this depth of insight.

00:30:00 And since they are trying to maximize the

00:30:05 rewards they get they are suddenly motivated to

00:30:09 come up with new action sequences, with new

00:30:13 experiments that have the property that the data

00:30:16 that is coming in as a consequence of these

00:30:19 experiments has the property that they can learn

00:30:23 something about, see a pattern in there which

00:30:25 they hadn’t seen yet before.

00:30:28 So there is an idea of power play that you

00:30:31 described, a training in general problem solver

00:30:34 in this kind of way of looking for the unsolved

00:30:36 problems.

00:30:37 Yeah.

00:30:38 Can you describe that idea a little further?

00:30:40 It’s another very simple idea.

00:30:42 So normally what you do in computer science,

00:30:45 you have some guy who gives you a problem and

00:30:50 then there is a huge search space of potential

00:30:55 solution candidates and you somehow try them

00:30:59 out and you have more less sophisticated ways

00:31:02 of moving around in that search space until

00:31:07 you finally found a solution which you

00:31:10 consider satisfactory.

00:31:12 That’s what most of computer science is about.

00:31:15 Power play just goes one little step further

00:31:18 and says let’s not only search for solutions

00:31:23 to a given problem but let’s search to pairs of

00:31:28 problems and their solutions where the system

00:31:31 itself has the opportunity to phrase its own

00:31:35 problem.

00:31:37 So we are looking suddenly at pairs of

00:31:40 problems and their solutions or modifications

00:31:44 of the problem solver that is supposed to

00:31:47 generate a solution to that new problem.

00:31:51 And this additional degree of freedom allows

00:31:57 us to build career systems that are like

00:32:00 scientists in the sense that they not only

00:32:04 try to solve and try to find answers to

00:32:07 existing questions, no they are also free to

00:32:11 pose their own questions.

00:32:13 So if you want to build an artificial scientist

00:32:15 you have to give it that freedom and power

00:32:17 play is exactly doing that.

00:32:19 So that’s a dimension of freedom that’s

00:32:22 important to have but how hard do you think

00:32:25 that, how multidimensional and difficult the

00:32:31 space of then coming up with your own questions

00:32:34 is.

00:32:35 So it’s one of the things that as human beings

00:32:37 we consider to be the thing that makes us

00:32:40 special, the intelligence that makes us special

00:32:42 is that brilliant insight that can create

00:32:46 something totally new.

00:32:48 Yes.

00:32:49 So now let’s look at the extreme case, let’s

00:32:52 look at the set of all possible problems that

00:32:56 you can formally describe which is infinite,

00:33:00 which should be the next problem that a scientist

00:33:05 or power play is going to solve.

00:33:08 Well, it should be the easiest problem that

00:33:14 goes beyond what you already know.

00:33:17 So it should be the simplest problem that the

00:33:21 current problem solver that you have which can

00:33:23 already solve 100 problems that he cannot solve

00:33:28 yet by just generalizing.

00:33:30 So it has to be new, so it has to require a

00:33:33 modification of the problem solver such that the

00:33:36 new problem solver can solve this new thing but

00:33:39 the old problem solver cannot do it and in

00:33:42 addition to that we have to make sure that the

00:33:46 problem solver doesn’t forget any of the

00:33:49 previous solutions.

00:33:50 Right.

00:33:51 And so by definition power play is now trying

00:33:54 always to search in this pair of, in the set of

00:33:58 pairs of problems and problems over modifications

00:34:02 for a combination that minimize the time to

00:34:06 achieve these criteria.

00:34:08 So it’s always trying to find the problem which

00:34:11 is easiest to add to the repertoire.

00:34:14 So just like grad students and academics and

00:34:18 researchers can spend their whole career in a

00:34:20 local minima stuck trying to come up with

00:34:24 interesting questions but ultimately doing very

00:34:26 little.

00:34:27 Do you think it’s easy in this approach of

00:34:31 looking for the simplest unsolvable problem to

00:34:33 get stuck in a local minima?

00:34:35 Is not never really discovering new, you know

00:34:40 really jumping outside of the 100 problems that

00:34:42 you’ve already solved in a genuine creative way?

00:34:47 No, because that’s the nature of power play that

00:34:50 it’s always trying to break its current

00:34:53 generalization abilities by coming up with a new

00:34:57 problem which is beyond the current horizon.

00:35:00 Just shifting the horizon of knowledge a little

00:35:04 bit out there, breaking the existing rules such

00:35:08 that the new thing becomes solvable but wasn’t

00:35:11 solvable by the old thing.

00:35:13 So like adding a new axiom like what Gödel did

00:35:17 when he came up with these new sentences, new

00:35:21 theorems that didn’t have a proof in the formal

00:35:23 system which means you can add them to the

00:35:25 repertoire hoping that they are not going to

00:35:31 damage the consistency of the whole thing.

00:35:35 So in the paper with the amazing title,

00:35:39 Formal Theory of Creativity, Fun and Intrinsic

00:35:43 Motivation, you talk about discovery as intrinsic

00:35:46 reward, so if you view humans as intelligent

00:35:50 agents, what do you think is the purpose and

00:35:53 meaning of life for us humans?

00:35:56 You’ve talked about this discovery, do you see

00:35:59 humans as an instance of power play, agents?

00:36:04 Humans are curious and that means they behave

00:36:10 like scientists, not only the official scientists

00:36:13 but even the babies behave like scientists and

00:36:16 they play around with their toys to figure out

00:36:19 how the world works and how it is responding to

00:36:22 their actions and that’s how they learn about

00:36:25 gravity and everything.

00:36:27 In 1990 we had the first systems like that which

00:36:31 would just try to play around with the environment

00:36:34 and come up with situations that go beyond what

00:36:38 they knew at that time and then get a reward for

00:36:41 creating these situations and then becoming more

00:36:44 general problem solvers and being able to understand

00:36:47 more of the world.

00:36:49 I think in principle that curiosity strategy or

00:37:01 more sophisticated versions of what I just

00:37:03 described, they are what we have built in as well

00:37:07 because evolution discovered that’s a good way of

00:37:10 exploring the unknown world and a guy who explores

00:37:13 the unknown world has a higher chance of solving

00:37:17 the mystery that he needs to survive in this world.

00:37:20 On the other hand, those guys who were too curious

00:37:24 they were weeded out as well so you have to find

00:37:27 this trade off.

00:37:28 Evolution found a certain trade off.

00:37:30 Apparently in our society there is a certain

00:37:33 percentage of extremely explorative guys and it

00:37:37 doesn’t matter if they die because many of the

00:37:40 others are more conservative.

00:37:45 It would be surprising to me if that principle of

00:37:51 artificial curiosity wouldn’t be present in almost

00:37:56 exactly the same form here.

00:37:58 In our brains.

00:38:00 You are a bit of a musician and an artist.

00:38:03 Continuing on this topic of creativity, what do you

00:38:08 think is the role of creativity and intelligence?

00:38:10 So you’ve kind of implied that it’s essential for

00:38:15 intelligence if you think of intelligence as a

00:38:18 problem solving system, as ability to solve problems.

00:38:23 But do you think it’s essential, this idea of

00:38:26 creativity?

00:38:27 We never have a program, a sub program that is

00:38:32 called creativity or something.

00:38:34 It’s just a side effect of what our problem solvers

00:38:37 do. They are searching a space of problems, a space

00:38:41 of candidates, of solution candidates until they

00:38:45 hopefully find a solution to a given problem.

00:38:48 But then there are these two types of creativity

00:38:50 and both of them are now present in our machines.

00:38:54 The first one has been around for a long time,

00:38:56 which is human gives problem to machine, machine

00:39:00 tries to find a solution to that.

00:39:03 And this has been happening for many decades and

00:39:06 for many decades machines have found creative

00:39:09 solutions to interesting problems where humans were

00:39:12 not aware of these particularly creative solutions

00:39:17 but then appreciated that the machine found that.

00:39:20 The second is the pure creativity.

00:39:23 That I would call, what I just mentioned, I would

00:39:26 call the applied creativity, like applied art where

00:39:30 somebody tells you now make a nice picture of this

00:39:34 Pope and you will get money for that.

00:39:37 So here is the artist and he makes a convincing

00:39:40 picture of the Pope and the Pope likes it and gives

00:39:43 him the money.

00:39:46 And then there is the pure creativity which is

00:39:49 more like the power play and the artificial

00:39:51 curiosity thing where you have the freedom to

00:39:54 select your own problem.

00:39:57 Like a scientist who defines his own question

00:40:02 to study and so that is the pure creativity if you

00:40:06 will as opposed to the applied creativity which

00:40:11 serves another.

00:40:14 And in that distinction there is almost echoes of

00:40:16 narrow AI versus general AI.

00:40:19 So this kind of constrained painting of a Pope

00:40:22 seems like the approaches of what people are

00:40:28 calling narrow AI and pure creativity seems to be,

00:40:33 maybe I am just biased as a human but it seems to

00:40:35 be an essential element of human level intelligence.

00:40:41 Is that what you are implying?

00:40:44 To a degree?

00:40:46 If you zoom back a little bit and you just look

00:40:49 at a general problem solving machine which is

00:40:51 trying to solve arbitrary problems then this

00:40:54 machine will figure out in the course of solving

00:40:57 problems that it is good to be curious.

00:41:00 So all of what I said just now about this prewired

00:41:04 curiosity and this will to invent new problems

00:41:07 that the system doesn’t know how to solve yet

00:41:11 should be just a byproduct of the general search.

00:41:15 However, apparently evolution has built it into

00:41:20 us because it turned out to be so successful,

00:41:24 a prewiring, a bias, a very successful exploratory

00:41:29 bias that we are born with.

00:41:34 And you have also said that consciousness in the

00:41:36 same kind of way may be a byproduct of problem solving.

00:41:41 Do you find this an interesting byproduct?

00:41:45 Do you think it is a useful byproduct?

00:41:47 What are your thoughts on consciousness in general?

00:41:49 Or is it simply a byproduct of greater and greater

00:41:53 capabilities of problem solving that is similar

00:41:58 to creativity in that sense?

00:42:01 We never have a procedure called consciousness

00:42:04 in our machines.

00:42:05 However, we get as side effects of what these

00:42:09 machines are doing things that seem to be closely

00:42:13 related to what people call consciousness.

00:42:16 So for example, already in 1990 we had simple

00:42:20 systems which were basically recurrent networks

00:42:24 and therefore universal computers trying to map

00:42:28 incoming data into actions that lead to success.

00:42:33 Maximizing reward in a given environment,

00:42:36 always finding the charging station in time

00:42:40 whenever the battery is low and negative signals

00:42:42 are coming from the battery, always find the

00:42:45 charging station in time without bumping against

00:42:48 painful obstacles on the way.

00:42:50 So complicated things but very easily motivated.

00:42:54 And then we give these little guys a separate

00:43:00 recurrent neural network which is just predicting

00:43:02 what’s happening if I do that and that.

00:43:04 What will happen as a consequence of these

00:43:07 actions that I’m executing.

00:43:09 And it’s just trained on the long and long history

00:43:11 of interactions with the world.

00:43:13 So it becomes a predictive model of the world

00:43:16 basically.

00:43:18 And therefore also a compressor of the observations

00:43:22 of the world because whatever you can predict

00:43:25 you don’t have to store extra.

00:43:27 So compression is a side effect of prediction.

00:43:30 And how does this recurrent network compress?

00:43:33 Well, it’s inventing little subprograms, little

00:43:36 subnetworks that stand for everything that

00:43:39 frequently appears in the environment like

00:43:42 bottles and microphones and faces, maybe lots of

00:43:45 faces in my environment so I’m learning to create

00:43:50 something like a prototype face and a new face

00:43:52 comes along and all I have to encode are the

00:43:54 deviations from the prototype.

00:43:56 So it’s compressing all the time the stuff that

00:43:58 frequently appears.

00:44:00 There’s one thing that appears all the time that

00:44:05 is present all the time when the agent is

00:44:07 interacting with its environment which is the

00:44:10 agent itself.

00:44:12 But just for data compression reasons it is

00:44:15 extremely natural for this recurrent network to

00:44:18 come up with little subnetworks that stand for

00:44:21 the properties of the agents, the hand, the other

00:44:26 actuators and all the stuff that you need to

00:44:29 better encode the data which is influenced by

00:44:32 the actions of the agent.

00:44:34 So there just as a side effect of data compression

00:44:39 during problem solving you have internal self

00:44:43 models.

00:44:45 Now you can use this model of the world to plan

00:44:50 your future and that’s what we also have done

00:44:53 since 1990.

00:44:54 So the recurrent network which is the controller

00:44:57 which is trying to maximize reward can use this

00:45:00 model of the network of the world, this model

00:45:03 network of the world, this predictive model of

00:45:05 the world to plan ahead and say let’s not do this

00:45:08 action sequence, let’s do this action sequence

00:45:10 instead because it leads to more predicted

00:45:13 reward.

00:45:14 And whenever it is waking up these little

00:45:17 subnetworks that stand for itself then it is

00:45:20 thinking about itself and it is thinking about

00:45:23 itself and it is exploring mentally the

00:45:28 consequences of its own actions and now you tell

00:45:34 me what is still missing.

00:45:36 Missing the next, the gap to consciousness.

00:45:40 There isn’t.

00:45:41 That’s a really beautiful idea that if life is

00:45:45 a collection of data and life is a process of

00:45:48 compressing that data to act efficiently in that

00:45:54 data you yourself appear very often.

00:45:57 So it’s useful to form compressions of yourself

00:46:00 and it’s a really beautiful formulation of what

00:46:03 consciousness is a necessary side effect.

00:46:05 It’s actually quite compelling to me.

00:46:11 You’ve described RNNs, developed LSTMs, long

00:46:16 short term memory networks that are a type of

00:46:20 recurrent neural networks that have gotten a lot

00:46:23 of success recently.

00:46:24 So these are networks that model the temporal

00:46:27 aspects in the data, temporal patterns in the

00:46:30 data and you’ve called them the deepest of the

00:46:34 neural networks.

00:46:35 So what do you think is the value of depth in

00:46:38 the models that we use to learn?

00:46:41 Since you mentioned the long short term memory

00:46:46 and the LSTM I have to mention the names of the

00:46:50 brilliant students who made that possible.

00:46:52 First of all my first student ever Sepp Hochreiter

00:46:56 who had fundamental insights already in his

00:46:58 diploma thesis.

00:46:59 Then Felix Geers who had additional important

00:47:03 contributions.

00:47:04 Alex Gray is a guy from Scotland who is mostly

00:47:08 responsible for this CTC algorithm which is now

00:47:11 often used to train the LSTM to do the speech

00:47:15 recognition on all the Google Android phones and

00:47:18 whatever and Siri and so on.

00:47:21 So these guys without these guys I would be

00:47:26 nothing.

00:47:27 It’s a lot of incredible work.

00:47:29 What is now the depth?

00:47:30 What is the importance of depth?

00:47:32 Well most problems in the real world are deep in

00:47:36 the sense that the current input doesn’t tell you

00:47:40 all you need to know about the environment.

00:47:44 So instead you have to have a memory of what

00:47:48 happened in the past and often important parts of

00:47:51 that memory are dated.

00:47:54 They are pretty old.

00:47:56 So when you’re doing speech recognition for

00:47:59 example and somebody says 11 then that’s about

00:48:05 half a second or something like that which means

00:48:09 it’s already 50 time steps.

00:48:11 And another guy or the same guy says 7.

00:48:15 So the ending is the same even but now the

00:48:19 system has to see the distinction between 7 and

00:48:22 11 and the only way it can see the difference is

00:48:25 it has to store that 50 steps ago there was an

00:48:30 S or an L, 11 or 7.

00:48:34 So there you have already a problem of depth 50

00:48:37 because for each time step you have something

00:48:41 like a virtual layer in the expanded unrolled

00:48:44 version of this recurrent network which is doing

00:48:46 the speech recognition.

00:48:48 So these long time lags they translate into

00:48:51 problem depth.

00:48:53 And most problems in this world are such that

00:48:57 you really have to look far back in time to

00:49:01 understand what is the problem and to solve it.

00:49:05 But just like with LSTMs you don’t necessarily

00:49:08 need to when you look back in time remember every

00:49:11 aspect you just need to remember the important

00:49:13 aspects.

00:49:14 That’s right.

00:49:15 The network has to learn to put the important

00:49:18 stuff into memory and to ignore the unimportant

00:49:22 noise.

00:49:23 But in that sense deeper and deeper is better

00:49:28 or is there a limitation?

00:49:30 I mean LSTM is one of the great examples of

00:49:34 architectures that do something beyond just

00:49:40 deeper and deeper networks.

00:49:42 There’s clever mechanisms for filtering data,

00:49:45 for remembering and forgetting.

00:49:47 So do you think that kind of thinking is

00:49:50 necessary?

00:49:51 If you think about LSTMs as a leap, a big leap

00:49:54 forward over traditional vanilla RNNs, what do

00:49:57 you think is the next leap within this context?

00:50:02 So LSTM is a very clever improvement but LSTM

00:50:06 still don’t have the same kind of ability to see

00:50:10 far back in the past as us humans do.

00:50:14 The credit assignment problem across way back

00:50:18 not just 50 time steps or 100 or 1000 but

00:50:22 millions and billions.

00:50:24 It’s not clear what are the practical limits of

00:50:28 the LSTM when it comes to looking back.

00:50:31 Already in 2006 I think we had examples where

00:50:35 it not only looked back tens of thousands of

00:50:38 steps but really millions of steps.

00:50:41 And Juan Perez Ortiz in my lab I think was the

00:50:45 first author of a paper where we really, was it

00:50:49 2006 or something, had examples where it learned

00:50:53 to look back for more than 10 million steps.

00:50:57 So for most problems of speech recognition it’s

00:51:01 not necessary to look that far back but there

00:51:05 are examples where it does.

00:51:07 Now the looking back thing, that’s rather easy

00:51:11 because there is only one past but there are

00:51:15 many possible futures and so a reinforcement

00:51:19 learning system which is trying to maximize its

00:51:22 future expected reward and doesn’t know yet which

00:51:26 of these many possible futures should I select

00:51:29 given this one single past is facing problems

00:51:33 that the LSTM by itself cannot solve.

00:51:36 So the LSTM is good for coming up with a compact

00:51:40 representation of the history and observations

00:51:44 and actions so far but now how do you plan in an

00:51:49 efficient and good way among all these, how do

00:51:54 you select one of these many possible action

00:51:57 sequences that a reinforcement learning system

00:52:00 has to consider to maximize reward in this

00:52:04 unknown future?

00:52:06 We have this basic setup where you have one

00:52:10 recurrent network which gets in the video and

00:52:14 the speech and whatever and it’s executing

00:52:17 actions and it’s trying to maximize reward so

00:52:20 there is no teacher who tells it what to do at

00:52:23 which point in time.

00:52:25 And then there’s the other network which is

00:52:29 just predicting what’s going to happen if I do

00:52:32 that and that and that could be an LSTM network

00:52:35 and it learns to look back all the way to make

00:52:38 better predictions of the next time step.

00:52:41 So essentially although it’s predicting only the

00:52:44 next time step it is motivated to learn to put

00:52:48 into memory something that happened maybe a

00:52:51 million steps ago because it’s important to

00:52:54 memorize that if you want to predict that at the

00:52:57 next time step, the next event.

00:52:59 Now how can a model of the world like that, a

00:53:03 predictive model of the world be used by the

00:53:06 first guy?

00:53:07 Let’s call it the controller and the model, the

00:53:10 controller and the model.

00:53:12 How can the model be used by the controller to

00:53:15 efficiently select among these many possible

00:53:18 futures?

00:53:19 The naive way we had about 30 years ago was

00:53:22 let’s just use the model of the world as a stand

00:53:26 in, as a simulation of the world and millisecond

00:53:30 by millisecond we plan the future and that means

00:53:33 we have to roll it out really in detail and it

00:53:36 will work only if the model is really good and

00:53:39 it will still be inefficient because we have to

00:53:42 look at all these possible futures and there are

00:53:45 so many of them.

00:53:46 So instead what we do now since 2015 in our CM

00:53:49 systems, controller model systems, we give the

00:53:52 controller the opportunity to learn by itself how

00:53:56 to use the potentially relevant parts of the M,

00:54:00 of the model network to solve new problems more

00:54:04 quickly.

00:54:05 And if it wants to, it can learn to ignore the M

00:54:09 and sometimes it’s a good idea to ignore the M

00:54:12 because it’s really bad, it’s a bad predictor in

00:54:15 this particular situation of life where the

00:54:19 controller is currently trying to maximize reward.

00:54:22 However, it can also learn to address and exploit

00:54:26 some of the subprograms that came about in the

00:54:31 model network through compressing the data by

00:54:35 predicting it.

00:54:36 So it now has an opportunity to reuse that code,

00:54:40 the algorithmic information in the model network

00:54:44 to reduce its own search space such that it can

00:54:49 solve a new problem more quickly than without the

00:54:52 model.

00:54:53 Compression.

00:54:54 So you’re ultimately optimistic and excited about

00:54:59 the power of RL, of reinforcement learning in the

00:55:03 context of real systems.

00:55:05 Absolutely, yeah.

00:55:06 So you see RL as a potential having a huge impact

00:55:11 beyond just sort of the M part is often developed on

00:55:16 supervised learning methods.

00:55:19 You see RL as a for problems of self driving cars

00:55:25 or any kind of applied cyber robotics.

00:55:28 That’s the correct interesting direction for

00:55:32 research in your view?

00:55:34 I do think so.

00:55:35 We have a company called Nasence which has applied

00:55:40 reinforcement learning to little Audis which learn

00:55:45 to park without a teacher.

00:55:47 The same principles were used of course.

00:55:50 So these little Audis, they are small, maybe like

00:55:54 that, so much smaller than the real Audis.

00:55:57 But they have all the sensors that you find in the

00:56:00 real Audis.

00:56:01 You find the cameras, the LIDAR sensors.

00:56:03 They go up to 120 kilometers an hour if they want

00:56:08 to.

00:56:09 And they have pain sensors basically and they don’t

00:56:13 want to bump against obstacles and other Audis and

00:56:17 so they must learn like little babies to park.

00:56:21 Take the raw vision input and translate that into

00:56:25 actions that lead to successful parking behavior

00:56:28 which is a rewarding thing.

00:56:30 And yes, they learn that.

00:56:32 So we have examples like that and it’s only in the

00:56:36 beginning.

00:56:37 This is just the tip of the iceberg and I believe the

00:56:40 next wave of AI is going to be all about that.

00:56:44 So at the moment, the current wave of AI is about

00:56:48 passive pattern observation and prediction and that’s

00:56:53 what you have on your smartphone and what the major

00:56:56 companies on the Pacific Rim are using to sell you

00:57:00 ads to do marketing.

00:57:02 That’s the current sort of profit in AI and that’s

00:57:05 only one or two percent of the world economy.

00:57:08 Which is big enough to make these companies pretty

00:57:12 much the most valuable companies in the world.

00:57:15 But there’s a much, much bigger fraction of the

00:57:19 economy going to be affected by the next wave which

00:57:22 is really about machines that shape the data through

00:57:26 their own actions.

00:57:27 Do you think simulation is ultimately the biggest

00:57:31 way that those methods will be successful in the next

00:57:35 10, 20 years?

00:57:36 We’re not talking about 100 years from now.

00:57:38 We’re talking about sort of the near term impact of

00:57:41 RL.

00:57:42 Do you think really good simulation is required or

00:57:45 is there other techniques like imitation learning,

00:57:48 observing other humans operating in the real world?

00:57:53 Where do you think the success will come from?

00:57:57 So at the moment, we have a tendency of using physics

00:58:02 simulations to learn behavior from machines that

00:58:07 learn to solve problems that humans also do not know

00:58:11 how to solve.

00:58:12 However, this is not the future because the future is

00:58:16 in what little babies do.

00:58:18 They don’t use a physics engine to simulate the

00:58:21 world.

00:58:22 No, they learn a predictive model of the world which

00:58:26 maybe sometimes is wrong in many ways but captures

00:58:31 all kinds of important abstract high level predictions

00:58:34 which are really important to be successful.

00:58:37 And that’s what was the future 30 years ago when we

00:58:42 started that type of research but it’s still the future

00:58:45 and now we know much better how to go there to move

00:58:49 forward and to really make working systems based on

00:58:54 that where you have a learning model of the world,

00:58:57 a model of the world that learns to predict what’s

00:58:59 going to happen if I do that and that.

00:59:01 And then the controller uses that model to more

00:59:07 quickly learn successful action sequences.

00:59:10 And then of course always this curiosity thing.

00:59:13 In the beginning, the model is stupid so the

00:59:15 controller should be motivated to come up with

00:59:18 experiments with action sequences that lead to data

00:59:21 that improve the model.

00:59:23 Do you think improving the model, constructing an

00:59:27 understanding of the world in this connection is

00:59:30 now the popular approaches that have been successful

00:59:34 are grounded in ideas of neural networks.

00:59:38 But in the 80s with expert systems, there’s

00:59:41 symbolic AI approaches which to us humans are more

00:59:45 intuitive in the sense that it makes sense that you

00:59:49 build up knowledge in this knowledge representation.

00:59:52 What kind of lessons can we draw into our current

00:59:54 approaches from expert systems from symbolic AI?

01:00:00 So I became aware of all of that in the 80s and

01:00:04 back then logic programming was a huge thing.

01:00:08 Was it inspiring to you yourself?

01:00:10 Did you find it compelling?

01:00:12 Because a lot of your work was not so much in that

01:00:16 realm, right?

01:00:17 It was more in the learning systems.

01:00:18 Yes and no, but we did all of that.

01:00:20 So my first publication ever actually was 1987,

01:00:27 was the implementation of genetic algorithm of a

01:00:31 genetic programming system in Prolog.

01:00:34 So Prolog, that’s what you learn back then which is

01:00:38 a logic programming language and the Japanese,

01:00:41 they have this huge fifth generation AI project

01:00:45 which was mostly about logic programming back then.

01:00:49 Although neural networks existed and were well

01:00:52 known back then and deep learning has existed since

01:00:56 1965, since this guy in the Ukraine,

01:01:00 Iwakunenko, started it.

01:01:02 But the Japanese and many other people,

01:01:05 they focused really on this logic programming and I

01:01:08 was influenced to the extent that I said,

01:01:10 okay, let’s take these biologically inspired

01:01:13 algorithms like evolution, programs,

01:01:20 and implement that in the language which I know,

01:01:22 which was Prolog, for example, back then.

01:01:25 And then in many ways this came back later because

01:01:29 the Gödel machine, for example,

01:01:31 has a proof searcher on board and without that it

01:01:34 would not be optimal.

01:01:36 Well, Markus Futter’s universal algorithm for

01:01:38 solving all well defined problems has a proof

01:01:41 searcher on board so that’s very much logic programming.

01:01:46 Without that it would not be asymptotically optimal.

01:01:50 But then on the other hand,

01:01:51 because we are very pragmatic guys also,

01:01:54 we focused on recurrent neural networks and

01:02:00 suboptimal stuff such as gradient based search and

01:02:04 program space rather than provably optimal things.

01:02:09 The logic programming certainly has a usefulness

01:02:13 when you’re trying to construct something provably

01:02:16 optimal or provably good or something like that.

01:02:19 But is it useful for practical problems?

01:02:22 It’s really useful for our theorem proving.

01:02:24 The best theorem provers today are not neural networks.

01:02:28 No, they are logic programming systems and they

01:02:31 are much better theorem provers than most math

01:02:35 students in the first or second semester.

01:02:38 But for reasoning, for playing games of Go or chess

01:02:43 or for robots, autonomous vehicles that operate in

01:02:46 the real world or object manipulation,

01:02:49 you think learning.

01:02:51 Yeah, as long as the problems have little to do

01:02:54 with theorem proving themselves,

01:02:58 then as long as that is not the case,

01:03:01 you just want to have better pattern recognition.

01:03:05 So to build a self driving car,

01:03:07 you want to have better pattern recognition and

01:03:10 pedestrian recognition and all these things.

01:03:14 You want to minimize the number of false positives,

01:03:19 which is currently slowing down self driving cars

01:03:21 in many ways.

01:03:23 All of that has very little to do with logic programming.

01:03:27 What are you most excited about in terms of

01:03:32 directions of artificial intelligence at this moment

01:03:35 in the next few years in your own research

01:03:38 and in the broader community?

01:03:41 So I think in the not so distant future,

01:03:44 we will have for the first time little robots

01:03:50 that learn like kids.

01:03:53 I will be able to say to the robot,

01:03:57 look here robot, we are going to assemble a smartphone.

01:04:01 Let’s take this slab of plastic and the screwdriver

01:04:05 and let’s screw in the screw like that.

01:04:09 Not like that, like that.

01:04:11 Not like that, like that.

01:04:14 And I don’t have a data glove or something.

01:04:17 He will see me and he will hear me

01:04:20 and he will try to do something with his own actuators,

01:04:24 which will be really different from mine,

01:04:26 but he will understand the difference

01:04:28 and will learn to imitate me,

01:04:31 but not in the supervised way

01:04:34 where a teacher is giving target signals

01:04:37 for all his muscles all the time.

01:04:40 No, by doing this high level imitation

01:04:43 where he first has to learn to imitate me

01:04:46 and then to interpret these additional noises

01:04:48 coming from my mouth as helping,

01:04:51 helpful signals to do that better.

01:04:54 And then it will by itself come up with faster ways

01:05:00 and more efficient ways of doing the same thing.

01:05:03 And finally I stop his learning algorithm

01:05:07 and make a million copies and sell it.

01:05:10 And so at the moment this is not possible,

01:05:13 but we already see how we are going to get there.

01:05:16 And you can imagine to the extent

01:05:19 that this works economically and cheaply,

01:05:22 it’s going to change everything.

01:05:25 Almost all of production is going to be affected by that.

01:05:31 And a much bigger wave,

01:05:34 a much bigger AI wave is coming

01:05:36 than the one that we are currently witnessing,

01:05:38 which is mostly about passive pattern recognition

01:05:40 on your smartphone.

01:05:42 This is about active machines that shapes data

01:05:45 through the actions they are executing

01:05:48 and they learn to do that in a good way.

01:05:52 So many of the traditional industries

01:05:55 are going to be affected by that.

01:05:57 All the companies that are building machines

01:06:01 will equip these machines with cameras

01:06:04 and other sensors and they are going to learn

01:06:08 to solve all kinds of problems

01:06:11 through interaction with humans,

01:06:13 but also a lot on their own

01:06:15 to improve what they already can do.

01:06:20 And lots of old economy is going to be affected by that.

01:06:24 And in recent years I have seen that old economy

01:06:27 is actually waking up and realizing that this is the case.

01:06:32 Are you optimistic about that future?

01:06:34 Are you concerned?

01:06:36 There is a lot of people concerned in the near term

01:06:38 about the transformation of the nature of work,

01:06:43 the kind of ideas that you just suggested

01:06:45 would have a significant impact

01:06:47 of what kind of things could be automated.

01:06:49 Are you optimistic about that future?

01:06:52 Are you nervous about that future?

01:06:54 And looking a little bit farther into the future,

01:06:58 there are people like Gila Musk, Stuart Russell,

01:07:02 concerned about the existential threats of that future.

01:07:06 So in the near term, job loss,

01:07:08 in the long term existential threat,

01:07:10 are these concerns to you or are you ultimately optimistic?

01:07:15 So let’s first address the near future.

01:07:22 We have had predictions of job losses for many decades.

01:07:28 For example, when industrial robots came along,

01:07:33 many people predicted that lots of jobs are going to get lost.

01:07:38 And in a sense, they were right,

01:07:42 because back then there were car factories

01:07:46 and hundreds of people in these factories assembled cars,

01:07:51 and today the same car factories have hundreds of robots

01:07:54 and maybe three guys watching the robots.

01:07:59 On the other hand, those countries that have lots of robots per capita,

01:08:05 Japan, Korea, Germany, Switzerland,

01:08:07 and a couple of other countries,

01:08:10 they have really low unemployment rates.

01:08:14 Somehow, all kinds of new jobs were created.

01:08:18 Back then, nobody anticipated those jobs.

01:08:23 And decades ago, I always said,

01:08:27 it’s really easy to say which jobs are going to get lost,

01:08:32 but it’s really hard to predict the new ones.

01:08:36 200 years ago, who would have predicted all these people

01:08:40 making money as YouTube bloggers, for example?

01:08:46 200 years ago, 60% of all people used to work in agriculture.

01:08:54 Today, maybe 1%.

01:08:57 But still, only, I don’t know, 5% unemployment.

01:09:02 Lots of new jobs were created, and Homo Ludens, the playing man,

01:09:08 is inventing new jobs all the time.

01:09:11 Most of these jobs are not existentially necessary

01:09:16 for the survival of our species.

01:09:19 There are only very few existentially necessary jobs,

01:09:23 such as farming and building houses and warming up the houses,

01:09:28 but less than 10% of the population is doing that.

01:09:31 And most of these newly invented jobs are about

01:09:35 interacting with other people in new ways,

01:09:38 through new media and so on,

01:09:41 getting new types of kudos and forms of likes and whatever,

01:09:46 and even making money through that.

01:09:48 So, Homo Ludens, the playing man, doesn’t want to be unemployed,

01:09:53 and that’s why he’s inventing new jobs all the time.

01:09:57 And he keeps considering these jobs as really important

01:10:02 and is investing a lot of energy and hours of work into those new jobs.

01:10:08 That’s quite beautifully put.

01:10:10 We’re really nervous about the future because we can’t predict

01:10:13 what kind of new jobs will be created.

01:10:15 But you’re ultimately optimistic that we humans are so restless

01:10:21 that we create and give meaning to newer and newer jobs,

01:10:25 totally new, things that get likes on Facebook

01:10:29 or whatever the social platform is.

01:10:32 So what about long term existential threat of AI,

01:10:36 where our whole civilization may be swallowed up

01:10:41 by these ultra super intelligent systems?

01:10:45 Maybe it’s not going to be swallowed up,

01:10:48 but I’d be surprised if we humans were the last step

01:10:55 in the evolution of the universe.

01:10:58 You’ve actually had this beautiful comment somewhere that I’ve seen

01:11:05 saying that, quite insightful, artificial general intelligence systems,

01:11:12 just like us humans, will likely not want to interact with humans,

01:11:16 they’ll just interact amongst themselves.

01:11:18 Just like ants interact amongst themselves

01:11:21 and only tangentially interact with humans.

01:11:25 And it’s quite an interesting idea that once we create AGI,

01:11:29 they will lose interest in humans and compete for their own Facebook likes

01:11:34 and their own social platforms.

01:11:36 So within that quite elegant idea, how do we know in a hypothetical sense

01:11:45 that there’s not already intelligence systems out there?

01:11:49 How do you think broadly of general intelligence greater than us?

01:11:54 How do we know it’s out there?

01:11:56 How do we know it’s around us?

01:11:59 And could it already be?

01:12:01 I’d be surprised if within the next few decades or something like that,

01:12:07 we won’t have AIs that are truly smart in every single way

01:12:13 and better problem solvers in almost every single important way.

01:12:17 And I’d be surprised if they wouldn’t realize what we have realized a long time ago,

01:12:25 which is that almost all physical resources are not here in this biosphere,

01:12:31 but further out, the rest of the solar system gets 2 billion times more solar energy

01:12:41 than our little planet.

01:12:43 There’s lots of material out there that you can use to build robots

01:12:47 and self replicating robot factories and all this stuff.

01:12:51 And they are going to do that and they will be scientists and curious

01:12:56 and they will explore what they can do.

01:12:59 And in the beginning, they will be fascinated by life

01:13:04 and by their own origins in our civilization.

01:13:07 They will want to understand that completely, just like people today

01:13:11 would like to understand how life works and also the history of our own existence

01:13:21 and civilization, but then also the physical laws that created all of that.

01:13:27 So in the beginning, they will be fascinated by life.

01:13:30 Once they understand it, they lose interest.

01:13:34 Like anybody who loses interest in things he understands.

01:13:40 And then, as you said, the most interesting sources of information for them

01:13:50 will be others of their own kind.

01:13:58 So at least in the long run, there seems to be some sort of protection

01:14:06 through lack of interest on the other side.

01:14:11 And now it seems also clear, as far as we understand physics,

01:14:17 you need matter and energy to compute and to build more robots and infrastructure

01:14:23 for AI civilization and EIEI ecologies consisting of trillions of different types of AIs.

01:14:31 And so it seems inconceivable to me that this thing is not going to expand.

01:14:37 Some AI ecology not controlled by one AI, but trillions of different types of AIs

01:14:44 competing in all kinds of quickly evolving and disappearing ecological niches

01:14:50 in ways that we cannot fathom at the moment.

01:14:52 But it’s going to expand, limited by light speed and physics,

01:14:57 but it’s going to expand and now we realize that the universe is still young.

01:15:03 It’s only 13.8 billion years old and it’s going to be a thousand times older than that.

01:15:10 So there’s plenty of time to conquer the entire universe

01:15:17 and to fill it with intelligence and senders and receivers

01:15:21 such that AIs can travel the way they are traveling in our labs today,

01:15:27 which is by radio from sender to receiver.

01:15:31 And let’s call the current age of the universe one eon, one eon.

01:15:39 Now it will take just a few eons from now and the entire visible universe

01:15:43 is going to be full of that stuff.

01:15:47 And let’s look ahead to a time when the universe is going to be 1000 times older than it is now.

01:15:53 They will look back and they will say, look, almost immediately after the Big Bang,

01:15:57 only a few eons later, the entire universe started to become intelligent.

01:16:03 Now to your question, how do we see whether anything like that has already happened

01:16:09 or is already in a more advanced stage in some other part of the universe, of the visible universe?

01:16:16 We are trying to look out there and nothing like that has happened so far or is that true?

01:16:22 Do you think we would recognize it?

01:16:24 How do we know it’s not among us?

01:16:26 How do we know planets aren’t in themselves intelligent beings?

01:16:31 How do we know ants seen as a collective are not much greater intelligence than our own?

01:16:40 These kinds of ideas.

01:16:42 When I was a boy, I was thinking about these things

01:16:45 and I thought, maybe it has already happened.

01:16:48 Because back then I knew, I learned from popular physics books,

01:16:53 that the large scale structure of the universe is not homogeneous.

01:17:00 You have these clusters of galaxies and then in between there are these huge empty spaces.

01:17:08 And I thought, maybe they aren’t really empty.

01:17:12 It’s just that in the middle of that, some AI civilization already has expanded

01:17:17 and then has covered a bubble of a billion light years diameter

01:17:22 and is using all the energy of all the stars within that bubble for its own unfathomable purposes.

01:17:29 And so it already has happened and we just fail to interpret the signs.

01:17:35 And then I learned that gravity by itself explains the large scale structure of the universe

01:17:43 and that this is not a convincing explanation.

01:17:46 And then I thought, maybe it’s the dark matter.

01:17:51 Because as far as we know today, 80% of the measurable matter is invisible.

01:18:01 And we know that because otherwise our galaxy or other galaxies would fall apart.

01:18:06 They are rotating too quickly.

01:18:10 And then the idea was, maybe all of these AI civilizations that are already out there,

01:18:17 they are just invisible because they are really efficient in using the energies of their own local systems

01:18:26 and that’s why they appear dark to us.

01:18:29 But this is also not a convincing explanation because then the question becomes,

01:18:34 why are there still any visible stars left in our own galaxy, which also must have a lot of dark matter?

01:18:44 So that is also not a convincing thing.

01:18:46 And today, I like to think it’s quite plausible that maybe we are the first,

01:18:54 at least in our local light cone within the few hundreds of millions of light years that we can reliably observe.

01:19:09 Is that exciting to you that we might be the first?

01:19:12 And it would make us much more important because if we mess it up through a nuclear war,

01:19:20 then maybe this will have an effect on the development of the entire universe.

01:19:31 So let’s not mess it up.

01:19:32 Let’s not mess it up.

01:19:34 Jürgen, thank you so much for talking today. I really appreciate it.

01:19:37 It’s my pleasure.