Transcript
00:00:00 The following is a conversation with Jürgen Schmidhuber.
00:00:03 He’s the co director of the CS Swiss AI Lab
00:00:06 and a co creator of long short term memory networks.
00:00:10 LSDMs are used in billions of devices today
00:00:13 for speech recognition, translation, and much more.
00:00:17 Over 30 years, he has proposed a lot of interesting
00:00:20 out of the box ideas on meta learning, adversarial networks,
00:00:24 computer vision, and even a formal theory of quote,
00:00:28 creativity, curiosity, and fun.
00:00:32 This conversation is part of the MIT course
00:00:34 on artificial general intelligence
00:00:36 and the artificial intelligence podcast.
00:00:38 If you enjoy it, subscribe on YouTube, iTunes,
00:00:41 or simply connect with me on Twitter
00:00:43 at Lex Friedman spelled F R I D.
00:00:47 And now here’s my conversation with Jürgen Schmidhuber.
00:00:53 Early on you dreamed of AI systems
00:00:55 that self improve recursively.
00:00:58 When was that dream born?
00:01:01 When I was a baby.
00:01:03 No, that’s not true.
00:01:04 When I was a teenager.
00:01:06 And what was the catalyst for that birth?
00:01:09 What was the thing that first inspired you?
00:01:12 When I was a boy, I was thinking about what to do in my life
00:01:20 and then I thought the most exciting thing
00:01:23 is to solve the riddles of the universe.
00:01:28 And that means you have to become a physicist.
00:01:30 However, then I realized that there’s something even grander.
00:01:35 You can try to build a machine
00:01:39 that isn’t really a machine any longer
00:01:42 that learns to become a much better physicist
00:01:44 than I could ever hope to be.
00:01:47 And that’s how I thought maybe I can multiply
00:01:50 my tiny little bit of creativity into infinity.
00:01:54 But ultimately that creativity will be multiplied
00:01:57 to understand the universe around us.
00:01:59 That’s the curiosity for that mystery that drove you.
00:02:05 Yes, so if you can build a machine
00:02:08 that learns to solve more and more complex problems
00:02:13 and more and more general problem solver
00:02:16 then you basically have solved all the problems,
00:02:22 at least all the solvable problems.
00:02:26 So how do you think, what is the mechanism
00:02:28 for that kind of general solver look like?
00:02:31 Obviously we don’t quite yet have one
00:02:34 or know how to build one but we have ideas
00:02:37 and you have had throughout your career
00:02:39 several ideas about it.
00:02:40 So how do you think about that mechanism?
00:02:43 So in the 80s, I thought about how to build this machine
00:02:48 that learns to solve all these problems
00:02:51 that I cannot solve myself.
00:02:54 And I thought it is clear it has to be a machine
00:02:57 that not only learns to solve this problem here
00:03:00 and this problem here but it also has to learn
00:03:04 to improve the learning algorithm itself.
00:03:09 So it has to have the learning algorithm
00:03:12 in a representation that allows it to inspect it
00:03:15 and modify it such that it can come up
00:03:19 with a better learning algorithm.
00:03:21 So I call that meta learning, learning to learn
00:03:24 and recursive self improvement
00:03:26 that is really the pinnacle of that
00:03:28 where you then not only learn how to improve
00:03:34 on that problem and on that
00:03:36 but you also improve the way the machine improves
00:03:40 and you also improve the way it improves
00:03:42 the way it improves itself.
00:03:44 And that was my 1987 diploma thesis
00:03:47 which was all about that higher education
00:03:50 hierarchy of meta learners that have no computational limits
00:03:57 except for the well known limits that Gödel identified
00:04:01 in 1931 and for the limits of physics.
00:04:06 In the recent years, meta learning has gained popularity
00:04:10 in a specific kind of form.
00:04:12 You’ve talked about how that’s not really meta learning
00:04:16 with neural networks, that’s more basic transfer learning.
00:04:21 Can you talk about the difference
00:04:22 between the big general meta learning
00:04:25 and a more narrow sense of meta learning
00:04:27 the way it’s used today, the way it’s talked about today?
00:04:30 Let’s take the example of a deep neural network
00:04:33 that has learned to classify images
00:04:37 and maybe you have trained that network
00:04:40 on 100 different databases of images.
00:04:43 And now a new database comes along
00:04:48 and you want to quickly learn the new thing as well.
00:04:53 So one simple way of doing that is you take the network
00:04:57 which already knows 100 types of databases
00:05:02 and then you just take the top layer of that
00:05:06 and you retrain that using the new label data
00:05:11 that you have in the new image database.
00:05:14 And then it turns out that it really, really quickly
00:05:17 can learn that too, one shot basically
00:05:20 because from the first 100 data sets,
00:05:24 it already has learned so much about computer vision
00:05:27 that it can reuse that and that is then almost good enough
00:05:31 to solve the new task except you need a little bit
00:05:34 of adjustment on the top.
00:05:38 So that is transfer learning.
00:05:41 And it has been done in principle for many decades.
00:05:44 People have done similar things for decades.
00:05:48 Meta learning too, meta learning is about
00:05:51 having the learning algorithm itself
00:05:55 open to introspection by the system that is using it
00:06:01 and also open to modification such that the learning system
00:06:06 has an opportunity to modify
00:06:09 any part of the learning algorithm
00:06:12 and then evaluate the consequences of that modification
00:06:16 and then learn from that to create
00:06:21 a better learning algorithm and so on recursively.
00:06:25 So that’s a very different animal
00:06:28 where you are opening the space of possible learning
00:06:32 algorithms to the learning system itself.
00:06:35 Right, so you’ve, like in the 2004 paper, you described
00:06:40 gator machines, programs that rewrite themselves, right?
00:06:44 Philosophically and even in your paper, mathematically,
00:06:47 these are really compelling ideas but practically,
00:06:52 do you see these self referential programs
00:06:55 being successful in the near term to having an impact
00:06:59 where sort of it demonstrates to the world
00:07:02 that this direction is a good one to pursue
00:07:07 in the near term?
00:07:08 Yes, we had these two different types
00:07:11 of fundamental research,
00:07:13 how to build a universal problem solver,
00:07:15 one basically exploiting proof search
00:07:22 and things like that that you need to come up with
00:07:25 asymptotically optimal, theoretically optimal
00:07:30 self improvers and problem solvers.
00:07:34 However, one has to admit that through this proof search
00:07:40 comes in an additive constant, an overhead,
00:07:44 an additive overhead that vanishes in comparison
00:07:50 to what you have to do to solve large problems.
00:07:55 However, for many of the small problems
00:07:58 that we want to solve in our everyday life,
00:08:00 we cannot ignore this constant overhead
00:08:03 and that’s why we also have been doing other things,
00:08:08 non universal things such as recurrent neural networks
00:08:12 which are trained by gradient descent
00:08:15 and local search techniques which aren’t universal at all,
00:08:18 which aren’t provably optimal at all,
00:08:21 like the other stuff that we did,
00:08:22 but which are much more practical
00:08:25 as long as we only want to solve the small problems
00:08:28 that we are typically trying to solve
00:08:33 in this environment here.
00:08:35 So the universal problem solvers like the Gödel machine,
00:08:38 but also Markus Hutter’s fastest way
00:08:42 of solving all possible problems,
00:08:44 which he developed around 2002 in my lab,
00:08:49 they are associated with these constant overheads
00:08:52 for proof search, which guarantees that the thing
00:08:55 that you’re doing is optimal.
00:08:57 For example, there is this fastest way
00:09:01 of solving all problems with a computable solution,
00:09:05 which is due to Markus, Markus Hutter,
00:09:08 and to explain what’s going on there,
00:09:12 let’s take traveling salesman problems.
00:09:15 With traveling salesman problems,
00:09:17 you have a number of cities and cities
00:09:21 and you try to find the shortest path
00:09:23 through all these cities without visiting any city twice.
00:09:29 And nobody knows the fastest way
00:09:32 of solving traveling salesman problems, TSPs,
00:09:38 but let’s assume there is a method of solving them
00:09:41 within N to the five operations
00:09:45 where N is the number of cities.
00:09:48 Then the universal method of Markus
00:09:53 is going to solve the same traveling salesman problem
00:09:57 also within N to the five steps,
00:10:00 plus O of one, plus a constant number of steps
00:10:04 that you need for the proof searcher,
00:10:07 which you need to show that this particular class
00:10:12 of problems, the traveling salesman problems,
00:10:15 can be solved within a certain time frame,
00:10:17 solved within a certain time bound,
00:10:20 within order N to the five steps, basically,
00:10:24 and this additive constant doesn’t care for N,
00:10:28 which means as N is getting larger and larger,
00:10:32 as you have more and more cities,
00:10:35 the constant overhead pales in comparison,
00:10:38 and that means that almost all large problems are solved
00:10:44 in the best possible way.
00:10:45 Today, we already have a universal problem solver like that.
00:10:50 However, it’s not practical because the overhead,
00:10:54 the constant overhead is so large
00:10:57 that for the small kinds of problems
00:11:00 that we want to solve in this little biosphere.
00:11:04 By the way, when you say small,
00:11:06 you’re talking about things that fall
00:11:08 within the constraints of our computational systems.
00:11:10 So they can seem quite large to us mere humans, right?
00:11:14 That’s right, yeah.
00:11:15 So they seem large and even unsolvable
00:11:19 in a practical sense today,
00:11:21 but they are still small compared to almost all problems
00:11:24 because almost all problems are large problems,
00:11:28 which are much larger than any constant.
00:11:31 Do you find it useful as a person
00:11:34 who has dreamed of creating a general learning system,
00:11:38 has worked on creating one,
00:11:39 has done a lot of interesting ideas there,
00:11:42 to think about P versus NP,
00:11:46 this formalization of how hard problems are,
00:11:50 how they scale,
00:11:52 this kind of worst case analysis type of thinking,
00:11:55 do you find that useful?
00:11:56 Or is it only just a mathematical,
00:12:00 it’s a set of mathematical techniques
00:12:02 to give you intuition about what’s good and bad.
00:12:05 So P versus NP, that’s super interesting
00:12:09 from a theoretical point of view.
00:12:11 And in fact, as you are thinking about that problem,
00:12:14 you can also get inspiration
00:12:17 for better practical problem solvers.
00:12:21 On the other hand, we have to admit
00:12:23 that at the moment, the best practical problem solvers
00:12:28 for all kinds of problems that we are now solving
00:12:31 through what is called AI at the moment,
00:12:33 they are not of the kind
00:12:36 that is inspired by these questions.
00:12:38 There we are using general purpose computers
00:12:42 such as recurrent neural networks,
00:12:44 but we have a search technique
00:12:46 which is just local search gradient descent
00:12:50 to try to find a program
00:12:51 that is running on these recurrent networks,
00:12:54 such that it can solve some interesting problems
00:12:58 such as speech recognition or machine translation
00:13:01 and something like that.
00:13:03 And there is very little theory behind the best solutions
00:13:08 that we have at the moment that can do that.
00:13:10 Do you think that needs to change?
00:13:12 Do you think that will change?
00:13:14 Or can we go, can we create a general intelligent systems
00:13:17 without ever really proving that that system is intelligent
00:13:20 in some kind of mathematical way,
00:13:22 solving machine translation perfectly
00:13:25 or something like that,
00:13:26 within some kind of syntactic definition of a language,
00:13:29 or can we just be super impressed
00:13:31 by the thing working extremely well and that’s sufficient?
00:13:35 There’s an old saying,
00:13:36 and I don’t know who brought it up first,
00:13:39 which says, there’s nothing more practical
00:13:42 than a good theory.
00:13:43 And a good theory of problem solving
00:13:52 under limited resources,
00:13:54 like here in this universe or on this little planet,
00:13:58 has to take into account these limited resources.
00:14:01 And so probably there is locking a theory,
00:14:08 which is related to what we already have,
00:14:10 these asymptotically optimal problem solvers,
00:14:14 which tells us what we need in addition to that
00:14:18 to come up with a practically optimal problem solver.
00:14:22 So I believe we will have something like that.
00:14:27 And maybe just a few little tiny twists are necessary
00:14:30 to change what we already have,
00:14:34 to come up with that as well.
00:14:36 As long as we don’t have that,
00:14:37 we admit that we are taking suboptimal ways
00:14:42 and recurrent neural networks and long short term memory
00:14:45 for equipped with local search techniques.
00:14:50 And we are happy that it works better
00:14:53 than any competing methods,
00:14:55 but that doesn’t mean that we think we are done.
00:15:00 You’ve said that an AGI system
00:15:02 will ultimately be a simple one.
00:15:05 A general intelligence system
00:15:06 will ultimately be a simple one.
00:15:08 Maybe a pseudocode of a few lines
00:15:10 will be able to describe it.
00:15:11 Can you talk through your intuition behind this idea,
00:15:16 why you feel that at its core,
00:15:21 intelligence is a simple algorithm?
00:15:26 Experience tells us that the stuff that works best
00:15:31 is really simple.
00:15:33 So the asymptotically optimal ways of solving problems,
00:15:37 if you look at them,
00:15:38 they’re just a few lines of code, it’s really true.
00:15:41 Although they are these amazing properties,
00:15:44 just a few lines of code.
00:15:45 Then the most promising and most useful practical things,
00:15:53 maybe don’t have this proof of optimality
00:15:56 associated with them.
00:15:57 However, they are also just a few lines of code.
00:16:00 The most successful recurrent neural networks,
00:16:05 you can write them down in five lines of pseudocode.
00:16:08 That’s a beautiful, almost poetic idea,
00:16:10 but what you’re describing there
00:16:15 is the lines of pseudocode are sitting on top
00:16:18 of layers and layers of abstractions in a sense.
00:16:22 So you’re saying at the very top,
00:16:25 it’ll be a beautifully written sort of algorithm.
00:16:31 But do you think that there’s many layers of abstractions
00:16:33 we have to first learn to construct?
00:16:36 Yeah, of course, we are building on all these
00:16:40 great abstractions that people have invented over the millennia,
00:16:45 such as matrix multiplications and real numbers
00:16:50 and basic arithmetics and calculus
00:16:56 and derivations of error functions
00:17:00 and derivatives of error functions and stuff like that.
00:17:04 So without that language that greatly simplifies
00:17:09 our way of thinking about these problems,
00:17:13 we couldn’t do anything.
00:17:14 So in that sense, as always,
00:17:16 we are standing on the shoulders of the giants
00:17:19 who in the past simplified the problem
00:17:24 of problem solving so much
00:17:26 that now we have a chance to do the final step.
00:17:29 So the final step will be a simple one.
00:17:33 If we take a step back through all of human civilization
00:17:36 and just the universe in general,
00:17:40 how do you think about evolution
00:17:41 and what if creating a universe
00:17:44 is required to achieve this final step?
00:17:47 What if going through the very painful
00:17:50 and inefficient process of evolution is needed
00:17:53 to come up with this set of abstractions
00:17:55 that ultimately lead to intelligence?
00:17:57 Do you think there’s a shortcut
00:18:00 or do you think we have to create something like our universe
00:18:04 in order to create something like human level intelligence?
00:18:09 So far, the only example we have is this one,
00:18:13 this universe in which we are living.
00:18:14 Do you think we can do better?
00:18:20 Maybe not, but we are part of this whole process.
00:18:24 So apparently, so it might be the case
00:18:29 that the code that runs the universe
00:18:32 is really, really simple.
00:18:33 Everything points to that possibility
00:18:35 because gravity and other basic forces
00:18:39 are really simple laws that can be easily described
00:18:43 also in just a few lines of code basically.
00:18:46 And then there are these other events
00:18:51 that the apparently random events
00:18:54 in the history of the universe,
00:18:55 which as far as we know at the moment
00:18:58 don’t have a compact code, but who knows?
00:19:00 Maybe somebody in the near future
00:19:02 is going to figure out the pseudo random generator
00:19:06 which is computing whether the measurement
00:19:11 of that spin up or down thing here
00:19:15 is going to be positive or negative.
00:19:17 Underlying quantum mechanics.
00:19:19 Yes.
00:19:20 Do you ultimately think quantum mechanics
00:19:22 is a pseudo random number generator?
00:19:24 So it’s all deterministic.
00:19:26 There’s no randomness in our universe.
00:19:28 Does God play dice?
00:19:31 So a couple of years ago, a famous physicist,
00:19:34 quantum physicist, Anton Zeilinger,
00:19:37 he wrote an essay in nature
00:19:40 and it started more or less like that.
00:19:45 One of the fundamental insights of the 20th century
00:19:50 was that the universe is fundamentally random
00:19:57 on the quantum level.
00:20:00 And that whenever you measure spin up or down
00:20:04 or something like that,
00:20:05 a new bit of information enters the history of the universe.
00:20:12 And while I was reading that,
00:20:13 I was already typing the response
00:20:16 and they had to publish it.
00:20:17 Because I was right, that there is no evidence,
00:20:21 no physical evidence for that.
00:20:25 So there’s an alternative explanation
00:20:27 where everything that we consider random
00:20:30 is actually pseudo random,
00:20:33 such as the decimal expansion of pi,
00:20:35 3.141 and so on, which looks random, but isn’t.
00:20:41 So pi is interesting because every three digits
00:20:45 sequence, every sequence of three digits
00:20:50 appears roughly one in a thousand times.
00:20:53 And every five digit sequence
00:20:57 appears roughly one in 10,000 times,
00:21:01 what you would expect if it was random.
00:21:04 But there’s a very short algorithm,
00:21:06 a short program that computes all of that.
00:21:09 So it’s extremely compressible.
00:21:11 And who knows, maybe tomorrow,
00:21:12 somebody, some grad student at CERN goes back
00:21:15 over all these data points, better decay and whatever,
00:21:20 and figures out, oh, it’s the second billion digits of pi
00:21:24 or something like that.
00:21:26 We don’t have any fundamental reason at the moment
00:21:29 to believe that this is truly random
00:21:33 and not just a deterministic video game.
00:21:36 If it was a deterministic video game,
00:21:38 it would be much more beautiful.
00:21:40 Because beauty is simplicity.
00:21:43 And many of the basic laws of the universe,
00:21:47 like gravity and the other basic forces are very simple.
00:21:51 So very short programs can explain what these are doing.
00:21:56 And it would be awful and ugly.
00:22:00 The universe would be ugly.
00:22:01 The history of the universe would be ugly
00:22:04 if for the extra things, the random,
00:22:06 the seemingly random data points that we get all the time,
00:22:11 that we really need a huge number of extra bits
00:22:15 to describe all these extra bits of information.
00:22:22 So as long as we don’t have evidence
00:22:24 that there is no short program
00:22:26 that computes the entire history of the entire universe,
00:22:31 we are, as scientists, compelled to look further
00:22:36 for that shortest program.
00:22:39 Your intuition says there exists a program
00:22:43 that can backtrack to the creation of the universe.
00:22:47 Yeah.
00:22:48 So it can give the shortest path
00:22:49 to the creation of the universe.
00:22:50 Yes.
00:22:51 Including all the entanglement things
00:22:54 and all the spin up and down measures
00:22:57 that have been taken place since 13.8 billion years ago.
00:23:06 So we don’t have a proof that it is random.
00:23:11 We don’t have a proof that it is compressible
00:23:15 to a short program.
00:23:16 But as long as we don’t have that proof,
00:23:18 we are obliged as scientists to keep looking
00:23:21 for that simple explanation.
00:23:23 Absolutely.
00:23:24 So you said the simplicity is beautiful or beauty is simple.
00:23:27 Either one works.
00:23:29 But you also work on curiosity, discovery,
00:23:34 the romantic notion of randomness, of serendipity,
00:23:39 of being surprised by things that are about you.
00:23:45 In our poetic notion of reality,
00:23:49 we think it’s kind of like,
00:23:51 poetic notion of reality, we think as humans
00:23:54 require randomness.
00:23:56 So you don’t find randomness beautiful.
00:23:59 You find simple determinism beautiful.
00:24:04 Yeah.
00:24:05 Okay.
00:24:07 So why?
00:24:08 Why?
00:24:09 Because the explanation becomes shorter.
00:24:13 A universe that is compressible to a short program
00:24:18 is much more elegant and much more beautiful
00:24:22 than another one, which needs an almost infinite
00:24:25 number of bits to be described.
00:24:28 As far as we know, many things that are happening
00:24:32 in this universe are really simple in terms of
00:24:35 short programs that compute gravity
00:24:38 and the interaction between elementary particles and so on.
00:24:43 So all of that seems to be very, very simple.
00:24:45 Every electron seems to reuse the same subprogram
00:24:50 all the time, as it is interacting with
00:24:52 other elementary particles.
00:24:58 If we now require an extra oracle injecting
00:25:05 new bits of information all the time for these
00:25:08 extra things which are currently not understood,
00:25:11 such as better decay, then the whole description
00:25:22 length of the data that we can observe of the
00:25:26 history of the universe would become much longer
00:25:31 and therefore uglier.
00:25:33 And uglier.
00:25:34 Again, simplicity is elegant and beautiful.
00:25:38 The history of science is a history of compression progress.
00:25:42 Yes, so you’ve described sort of as we build up
00:25:47 abstractions and you’ve talked about the idea
00:25:50 of compression.
00:25:52 How do you see this, the history of science,
00:25:55 the history of humanity, our civilization,
00:25:58 and life on Earth as some kind of path towards
00:26:02 greater and greater compression?
00:26:04 What do you mean by that?
00:26:05 How do you think about that?
00:26:06 Indeed, the history of science is a history of
00:26:10 compression progress.
00:26:12 What does that mean?
00:26:14 Hundreds of years ago there was an astronomer
00:26:17 whose name was Kepler and he looked at the data
00:26:21 points that he got by watching planets move.
00:26:25 And then he had all these data points and
00:26:27 suddenly it turned out that he can greatly
00:26:30 compress the data by predicting it through an
00:26:36 ellipse law.
00:26:38 So it turns out that all these data points are
00:26:40 more or less on ellipses around the sun.
00:26:45 And another guy came along whose name was
00:26:48 Newton and before him Hooke.
00:26:51 And they said the same thing that is making
00:26:55 these planets move like that is what makes the
00:27:00 apples fall down.
00:27:02 And it also holds for stones and for all kinds
00:27:08 of other objects.
00:27:11 And suddenly many, many of these observations
00:27:15 became much more compressible because as long
00:27:17 as you can predict the next thing, given what
00:27:20 you have seen so far, you can compress it.
00:27:23 And you don’t have to store that data extra.
00:27:25 This is called predictive coding.
00:27:29 And then there was still something wrong with
00:27:31 that theory of the universe and you had
00:27:34 deviations from these predictions of the theory.
00:27:37 And 300 years later another guy came along
00:27:40 whose name was Einstein.
00:27:42 And he was able to explain away all these
00:27:46 deviations from the predictions of the old
00:27:50 theory through a new theory which was called
00:27:54 the general theory of relativity.
00:27:57 Which at first glance looks a little bit more
00:28:00 complicated and you have to warp space and time
00:28:03 but you can’t phrase it within one single
00:28:05 sentence which is no matter how fast you
00:28:08 accelerate and how hard you decelerate and no
00:28:14 matter what is the gravity in your local
00:28:18 network, light speed always looks the same.
00:28:21 And from that you can calculate all the
00:28:24 consequences.
00:28:25 So it’s a very simple thing and it allows you
00:28:27 to further compress all the observations
00:28:30 because certainly there are hardly any
00:28:34 deviations any longer that you can measure
00:28:37 from the predictions of this new theory.
00:28:40 So all of science is a history of compression
00:28:44 progress.
00:28:45 You never arrive immediately at the shortest
00:28:48 explanation of the data but you’re making
00:28:51 progress.
00:28:52 Whenever you are making progress you have an
00:28:55 insight.
00:28:56 You see oh first I needed so many bits of
00:28:59 information to describe the data, to describe
00:29:02 my falling apples, my video of falling apples,
00:29:04 I need so many data, so many pixels have to be
00:29:07 stored.
00:29:08 But then suddenly I realize no there is a very
00:29:11 simple way of predicting the third frame in the
00:29:14 video from the first two.
00:29:16 And maybe not every little detail can be
00:29:19 predicted but more or less most of these orange
00:29:21 blobs that are coming down they accelerate in
00:29:24 the same way which means that I can greatly
00:29:27 compress the video.
00:29:28 And the amount of compression, progress, that
00:29:33 is the depth of the insight that you have at
00:29:36 that moment.
00:29:37 That’s the fun that you have, the scientific
00:29:39 fun, the fun in that discovery.
00:29:42 And we can build artificial systems that do
00:29:45 the same thing.
00:29:46 They measure the depth of their insights as they
00:29:49 are looking at the data which is coming in
00:29:51 through their own experiments and we give
00:29:54 them a reward, an intrinsic reward in proportion
00:29:58 to this depth of insight.
00:30:00 And since they are trying to maximize the
00:30:05 rewards they get they are suddenly motivated to
00:30:09 come up with new action sequences, with new
00:30:13 experiments that have the property that the data
00:30:16 that is coming in as a consequence of these
00:30:19 experiments has the property that they can learn
00:30:23 something about, see a pattern in there which
00:30:25 they hadn’t seen yet before.
00:30:28 So there is an idea of power play that you
00:30:31 described, a training in general problem solver
00:30:34 in this kind of way of looking for the unsolved
00:30:36 problems.
00:30:37 Yeah.
00:30:38 Can you describe that idea a little further?
00:30:40 It’s another very simple idea.
00:30:42 So normally what you do in computer science,
00:30:45 you have some guy who gives you a problem and
00:30:50 then there is a huge search space of potential
00:30:55 solution candidates and you somehow try them
00:30:59 out and you have more less sophisticated ways
00:31:02 of moving around in that search space until
00:31:07 you finally found a solution which you
00:31:10 consider satisfactory.
00:31:12 That’s what most of computer science is about.
00:31:15 Power play just goes one little step further
00:31:18 and says let’s not only search for solutions
00:31:23 to a given problem but let’s search to pairs of
00:31:28 problems and their solutions where the system
00:31:31 itself has the opportunity to phrase its own
00:31:35 problem.
00:31:37 So we are looking suddenly at pairs of
00:31:40 problems and their solutions or modifications
00:31:44 of the problem solver that is supposed to
00:31:47 generate a solution to that new problem.
00:31:51 And this additional degree of freedom allows
00:31:57 us to build career systems that are like
00:32:00 scientists in the sense that they not only
00:32:04 try to solve and try to find answers to
00:32:07 existing questions, no they are also free to
00:32:11 pose their own questions.
00:32:13 So if you want to build an artificial scientist
00:32:15 you have to give it that freedom and power
00:32:17 play is exactly doing that.
00:32:19 So that’s a dimension of freedom that’s
00:32:22 important to have but how hard do you think
00:32:25 that, how multidimensional and difficult the
00:32:31 space of then coming up with your own questions
00:32:34 is.
00:32:35 So it’s one of the things that as human beings
00:32:37 we consider to be the thing that makes us
00:32:40 special, the intelligence that makes us special
00:32:42 is that brilliant insight that can create
00:32:46 something totally new.
00:32:48 Yes.
00:32:49 So now let’s look at the extreme case, let’s
00:32:52 look at the set of all possible problems that
00:32:56 you can formally describe which is infinite,
00:33:00 which should be the next problem that a scientist
00:33:05 or power play is going to solve.
00:33:08 Well, it should be the easiest problem that
00:33:14 goes beyond what you already know.
00:33:17 So it should be the simplest problem that the
00:33:21 current problem solver that you have which can
00:33:23 already solve 100 problems that he cannot solve
00:33:28 yet by just generalizing.
00:33:30 So it has to be new, so it has to require a
00:33:33 modification of the problem solver such that the
00:33:36 new problem solver can solve this new thing but
00:33:39 the old problem solver cannot do it and in
00:33:42 addition to that we have to make sure that the
00:33:46 problem solver doesn’t forget any of the
00:33:49 previous solutions.
00:33:50 Right.
00:33:51 And so by definition power play is now trying
00:33:54 always to search in this pair of, in the set of
00:33:58 pairs of problems and problems over modifications
00:34:02 for a combination that minimize the time to
00:34:06 achieve these criteria.
00:34:08 So it’s always trying to find the problem which
00:34:11 is easiest to add to the repertoire.
00:34:14 So just like grad students and academics and
00:34:18 researchers can spend their whole career in a
00:34:20 local minima stuck trying to come up with
00:34:24 interesting questions but ultimately doing very
00:34:26 little.
00:34:27 Do you think it’s easy in this approach of
00:34:31 looking for the simplest unsolvable problem to
00:34:33 get stuck in a local minima?
00:34:35 Is not never really discovering new, you know
00:34:40 really jumping outside of the 100 problems that
00:34:42 you’ve already solved in a genuine creative way?
00:34:47 No, because that’s the nature of power play that
00:34:50 it’s always trying to break its current
00:34:53 generalization abilities by coming up with a new
00:34:57 problem which is beyond the current horizon.
00:35:00 Just shifting the horizon of knowledge a little
00:35:04 bit out there, breaking the existing rules such
00:35:08 that the new thing becomes solvable but wasn’t
00:35:11 solvable by the old thing.
00:35:13 So like adding a new axiom like what Gödel did
00:35:17 when he came up with these new sentences, new
00:35:21 theorems that didn’t have a proof in the formal
00:35:23 system which means you can add them to the
00:35:25 repertoire hoping that they are not going to
00:35:31 damage the consistency of the whole thing.
00:35:35 So in the paper with the amazing title,
00:35:39 Formal Theory of Creativity, Fun and Intrinsic
00:35:43 Motivation, you talk about discovery as intrinsic
00:35:46 reward, so if you view humans as intelligent
00:35:50 agents, what do you think is the purpose and
00:35:53 meaning of life for us humans?
00:35:56 You’ve talked about this discovery, do you see
00:35:59 humans as an instance of power play, agents?
00:36:04 Humans are curious and that means they behave
00:36:10 like scientists, not only the official scientists
00:36:13 but even the babies behave like scientists and
00:36:16 they play around with their toys to figure out
00:36:19 how the world works and how it is responding to
00:36:22 their actions and that’s how they learn about
00:36:25 gravity and everything.
00:36:27 In 1990 we had the first systems like that which
00:36:31 would just try to play around with the environment
00:36:34 and come up with situations that go beyond what
00:36:38 they knew at that time and then get a reward for
00:36:41 creating these situations and then becoming more
00:36:44 general problem solvers and being able to understand
00:36:47 more of the world.
00:36:49 I think in principle that curiosity strategy or
00:37:01 more sophisticated versions of what I just
00:37:03 described, they are what we have built in as well
00:37:07 because evolution discovered that’s a good way of
00:37:10 exploring the unknown world and a guy who explores
00:37:13 the unknown world has a higher chance of solving
00:37:17 the mystery that he needs to survive in this world.
00:37:20 On the other hand, those guys who were too curious
00:37:24 they were weeded out as well so you have to find
00:37:27 this trade off.
00:37:28 Evolution found a certain trade off.
00:37:30 Apparently in our society there is a certain
00:37:33 percentage of extremely explorative guys and it
00:37:37 doesn’t matter if they die because many of the
00:37:40 others are more conservative.
00:37:45 It would be surprising to me if that principle of
00:37:51 artificial curiosity wouldn’t be present in almost
00:37:56 exactly the same form here.
00:37:58 In our brains.
00:38:00 You are a bit of a musician and an artist.
00:38:03 Continuing on this topic of creativity, what do you
00:38:08 think is the role of creativity and intelligence?
00:38:10 So you’ve kind of implied that it’s essential for
00:38:15 intelligence if you think of intelligence as a
00:38:18 problem solving system, as ability to solve problems.
00:38:23 But do you think it’s essential, this idea of
00:38:26 creativity?
00:38:27 We never have a program, a sub program that is
00:38:32 called creativity or something.
00:38:34 It’s just a side effect of what our problem solvers
00:38:37 do. They are searching a space of problems, a space
00:38:41 of candidates, of solution candidates until they
00:38:45 hopefully find a solution to a given problem.
00:38:48 But then there are these two types of creativity
00:38:50 and both of them are now present in our machines.
00:38:54 The first one has been around for a long time,
00:38:56 which is human gives problem to machine, machine
00:39:00 tries to find a solution to that.
00:39:03 And this has been happening for many decades and
00:39:06 for many decades machines have found creative
00:39:09 solutions to interesting problems where humans were
00:39:12 not aware of these particularly creative solutions
00:39:17 but then appreciated that the machine found that.
00:39:20 The second is the pure creativity.
00:39:23 That I would call, what I just mentioned, I would
00:39:26 call the applied creativity, like applied art where
00:39:30 somebody tells you now make a nice picture of this
00:39:34 Pope and you will get money for that.
00:39:37 So here is the artist and he makes a convincing
00:39:40 picture of the Pope and the Pope likes it and gives
00:39:43 him the money.
00:39:46 And then there is the pure creativity which is
00:39:49 more like the power play and the artificial
00:39:51 curiosity thing where you have the freedom to
00:39:54 select your own problem.
00:39:57 Like a scientist who defines his own question
00:40:02 to study and so that is the pure creativity if you
00:40:06 will as opposed to the applied creativity which
00:40:11 serves another.
00:40:14 And in that distinction there is almost echoes of
00:40:16 narrow AI versus general AI.
00:40:19 So this kind of constrained painting of a Pope
00:40:22 seems like the approaches of what people are
00:40:28 calling narrow AI and pure creativity seems to be,
00:40:33 maybe I am just biased as a human but it seems to
00:40:35 be an essential element of human level intelligence.
00:40:41 Is that what you are implying?
00:40:44 To a degree?
00:40:46 If you zoom back a little bit and you just look
00:40:49 at a general problem solving machine which is
00:40:51 trying to solve arbitrary problems then this
00:40:54 machine will figure out in the course of solving
00:40:57 problems that it is good to be curious.
00:41:00 So all of what I said just now about this prewired
00:41:04 curiosity and this will to invent new problems
00:41:07 that the system doesn’t know how to solve yet
00:41:11 should be just a byproduct of the general search.
00:41:15 However, apparently evolution has built it into
00:41:20 us because it turned out to be so successful,
00:41:24 a prewiring, a bias, a very successful exploratory
00:41:29 bias that we are born with.
00:41:34 And you have also said that consciousness in the
00:41:36 same kind of way may be a byproduct of problem solving.
00:41:41 Do you find this an interesting byproduct?
00:41:45 Do you think it is a useful byproduct?
00:41:47 What are your thoughts on consciousness in general?
00:41:49 Or is it simply a byproduct of greater and greater
00:41:53 capabilities of problem solving that is similar
00:41:58 to creativity in that sense?
00:42:01 We never have a procedure called consciousness
00:42:04 in our machines.
00:42:05 However, we get as side effects of what these
00:42:09 machines are doing things that seem to be closely
00:42:13 related to what people call consciousness.
00:42:16 So for example, already in 1990 we had simple
00:42:20 systems which were basically recurrent networks
00:42:24 and therefore universal computers trying to map
00:42:28 incoming data into actions that lead to success.
00:42:33 Maximizing reward in a given environment,
00:42:36 always finding the charging station in time
00:42:40 whenever the battery is low and negative signals
00:42:42 are coming from the battery, always find the
00:42:45 charging station in time without bumping against
00:42:48 painful obstacles on the way.
00:42:50 So complicated things but very easily motivated.
00:42:54 And then we give these little guys a separate
00:43:00 recurrent neural network which is just predicting
00:43:02 what’s happening if I do that and that.
00:43:04 What will happen as a consequence of these
00:43:07 actions that I’m executing.
00:43:09 And it’s just trained on the long and long history
00:43:11 of interactions with the world.
00:43:13 So it becomes a predictive model of the world
00:43:16 basically.
00:43:18 And therefore also a compressor of the observations
00:43:22 of the world because whatever you can predict
00:43:25 you don’t have to store extra.
00:43:27 So compression is a side effect of prediction.
00:43:30 And how does this recurrent network compress?
00:43:33 Well, it’s inventing little subprograms, little
00:43:36 subnetworks that stand for everything that
00:43:39 frequently appears in the environment like
00:43:42 bottles and microphones and faces, maybe lots of
00:43:45 faces in my environment so I’m learning to create
00:43:50 something like a prototype face and a new face
00:43:52 comes along and all I have to encode are the
00:43:54 deviations from the prototype.
00:43:56 So it’s compressing all the time the stuff that
00:43:58 frequently appears.
00:44:00 There’s one thing that appears all the time that
00:44:05 is present all the time when the agent is
00:44:07 interacting with its environment which is the
00:44:10 agent itself.
00:44:12 But just for data compression reasons it is
00:44:15 extremely natural for this recurrent network to
00:44:18 come up with little subnetworks that stand for
00:44:21 the properties of the agents, the hand, the other
00:44:26 actuators and all the stuff that you need to
00:44:29 better encode the data which is influenced by
00:44:32 the actions of the agent.
00:44:34 So there just as a side effect of data compression
00:44:39 during problem solving you have internal self
00:44:43 models.
00:44:45 Now you can use this model of the world to plan
00:44:50 your future and that’s what we also have done
00:44:53 since 1990.
00:44:54 So the recurrent network which is the controller
00:44:57 which is trying to maximize reward can use this
00:45:00 model of the network of the world, this model
00:45:03 network of the world, this predictive model of
00:45:05 the world to plan ahead and say let’s not do this
00:45:08 action sequence, let’s do this action sequence
00:45:10 instead because it leads to more predicted
00:45:13 reward.
00:45:14 And whenever it is waking up these little
00:45:17 subnetworks that stand for itself then it is
00:45:20 thinking about itself and it is thinking about
00:45:23 itself and it is exploring mentally the
00:45:28 consequences of its own actions and now you tell
00:45:34 me what is still missing.
00:45:36 Missing the next, the gap to consciousness.
00:45:40 There isn’t.
00:45:41 That’s a really beautiful idea that if life is
00:45:45 a collection of data and life is a process of
00:45:48 compressing that data to act efficiently in that
00:45:54 data you yourself appear very often.
00:45:57 So it’s useful to form compressions of yourself
00:46:00 and it’s a really beautiful formulation of what
00:46:03 consciousness is a necessary side effect.
00:46:05 It’s actually quite compelling to me.
00:46:11 You’ve described RNNs, developed LSTMs, long
00:46:16 short term memory networks that are a type of
00:46:20 recurrent neural networks that have gotten a lot
00:46:23 of success recently.
00:46:24 So these are networks that model the temporal
00:46:27 aspects in the data, temporal patterns in the
00:46:30 data and you’ve called them the deepest of the
00:46:34 neural networks.
00:46:35 So what do you think is the value of depth in
00:46:38 the models that we use to learn?
00:46:41 Since you mentioned the long short term memory
00:46:46 and the LSTM I have to mention the names of the
00:46:50 brilliant students who made that possible.
00:46:52 First of all my first student ever Sepp Hochreiter
00:46:56 who had fundamental insights already in his
00:46:58 diploma thesis.
00:46:59 Then Felix Geers who had additional important
00:47:03 contributions.
00:47:04 Alex Gray is a guy from Scotland who is mostly
00:47:08 responsible for this CTC algorithm which is now
00:47:11 often used to train the LSTM to do the speech
00:47:15 recognition on all the Google Android phones and
00:47:18 whatever and Siri and so on.
00:47:21 So these guys without these guys I would be
00:47:26 nothing.
00:47:27 It’s a lot of incredible work.
00:47:29 What is now the depth?
00:47:30 What is the importance of depth?
00:47:32 Well most problems in the real world are deep in
00:47:36 the sense that the current input doesn’t tell you
00:47:40 all you need to know about the environment.
00:47:44 So instead you have to have a memory of what
00:47:48 happened in the past and often important parts of
00:47:51 that memory are dated.
00:47:54 They are pretty old.
00:47:56 So when you’re doing speech recognition for
00:47:59 example and somebody says 11 then that’s about
00:48:05 half a second or something like that which means
00:48:09 it’s already 50 time steps.
00:48:11 And another guy or the same guy says 7.
00:48:15 So the ending is the same even but now the
00:48:19 system has to see the distinction between 7 and
00:48:22 11 and the only way it can see the difference is
00:48:25 it has to store that 50 steps ago there was an
00:48:30 S or an L, 11 or 7.
00:48:34 So there you have already a problem of depth 50
00:48:37 because for each time step you have something
00:48:41 like a virtual layer in the expanded unrolled
00:48:44 version of this recurrent network which is doing
00:48:46 the speech recognition.
00:48:48 So these long time lags they translate into
00:48:51 problem depth.
00:48:53 And most problems in this world are such that
00:48:57 you really have to look far back in time to
00:49:01 understand what is the problem and to solve it.
00:49:05 But just like with LSTMs you don’t necessarily
00:49:08 need to when you look back in time remember every
00:49:11 aspect you just need to remember the important
00:49:13 aspects.
00:49:14 That’s right.
00:49:15 The network has to learn to put the important
00:49:18 stuff into memory and to ignore the unimportant
00:49:22 noise.
00:49:23 But in that sense deeper and deeper is better
00:49:28 or is there a limitation?
00:49:30 I mean LSTM is one of the great examples of
00:49:34 architectures that do something beyond just
00:49:40 deeper and deeper networks.
00:49:42 There’s clever mechanisms for filtering data,
00:49:45 for remembering and forgetting.
00:49:47 So do you think that kind of thinking is
00:49:50 necessary?
00:49:51 If you think about LSTMs as a leap, a big leap
00:49:54 forward over traditional vanilla RNNs, what do
00:49:57 you think is the next leap within this context?
00:50:02 So LSTM is a very clever improvement but LSTM
00:50:06 still don’t have the same kind of ability to see
00:50:10 far back in the past as us humans do.
00:50:14 The credit assignment problem across way back
00:50:18 not just 50 time steps or 100 or 1000 but
00:50:22 millions and billions.
00:50:24 It’s not clear what are the practical limits of
00:50:28 the LSTM when it comes to looking back.
00:50:31 Already in 2006 I think we had examples where
00:50:35 it not only looked back tens of thousands of
00:50:38 steps but really millions of steps.
00:50:41 And Juan Perez Ortiz in my lab I think was the
00:50:45 first author of a paper where we really, was it
00:50:49 2006 or something, had examples where it learned
00:50:53 to look back for more than 10 million steps.
00:50:57 So for most problems of speech recognition it’s
00:51:01 not necessary to look that far back but there
00:51:05 are examples where it does.
00:51:07 Now the looking back thing, that’s rather easy
00:51:11 because there is only one past but there are
00:51:15 many possible futures and so a reinforcement
00:51:19 learning system which is trying to maximize its
00:51:22 future expected reward and doesn’t know yet which
00:51:26 of these many possible futures should I select
00:51:29 given this one single past is facing problems
00:51:33 that the LSTM by itself cannot solve.
00:51:36 So the LSTM is good for coming up with a compact
00:51:40 representation of the history and observations
00:51:44 and actions so far but now how do you plan in an
00:51:49 efficient and good way among all these, how do
00:51:54 you select one of these many possible action
00:51:57 sequences that a reinforcement learning system
00:52:00 has to consider to maximize reward in this
00:52:04 unknown future?
00:52:06 We have this basic setup where you have one
00:52:10 recurrent network which gets in the video and
00:52:14 the speech and whatever and it’s executing
00:52:17 actions and it’s trying to maximize reward so
00:52:20 there is no teacher who tells it what to do at
00:52:23 which point in time.
00:52:25 And then there’s the other network which is
00:52:29 just predicting what’s going to happen if I do
00:52:32 that and that and that could be an LSTM network
00:52:35 and it learns to look back all the way to make
00:52:38 better predictions of the next time step.
00:52:41 So essentially although it’s predicting only the
00:52:44 next time step it is motivated to learn to put
00:52:48 into memory something that happened maybe a
00:52:51 million steps ago because it’s important to
00:52:54 memorize that if you want to predict that at the
00:52:57 next time step, the next event.
00:52:59 Now how can a model of the world like that, a
00:53:03 predictive model of the world be used by the
00:53:06 first guy?
00:53:07 Let’s call it the controller and the model, the
00:53:10 controller and the model.
00:53:12 How can the model be used by the controller to
00:53:15 efficiently select among these many possible
00:53:18 futures?
00:53:19 The naive way we had about 30 years ago was
00:53:22 let’s just use the model of the world as a stand
00:53:26 in, as a simulation of the world and millisecond
00:53:30 by millisecond we plan the future and that means
00:53:33 we have to roll it out really in detail and it
00:53:36 will work only if the model is really good and
00:53:39 it will still be inefficient because we have to
00:53:42 look at all these possible futures and there are
00:53:45 so many of them.
00:53:46 So instead what we do now since 2015 in our CM
00:53:49 systems, controller model systems, we give the
00:53:52 controller the opportunity to learn by itself how
00:53:56 to use the potentially relevant parts of the M,
00:54:00 of the model network to solve new problems more
00:54:04 quickly.
00:54:05 And if it wants to, it can learn to ignore the M
00:54:09 and sometimes it’s a good idea to ignore the M
00:54:12 because it’s really bad, it’s a bad predictor in
00:54:15 this particular situation of life where the
00:54:19 controller is currently trying to maximize reward.
00:54:22 However, it can also learn to address and exploit
00:54:26 some of the subprograms that came about in the
00:54:31 model network through compressing the data by
00:54:35 predicting it.
00:54:36 So it now has an opportunity to reuse that code,
00:54:40 the algorithmic information in the model network
00:54:44 to reduce its own search space such that it can
00:54:49 solve a new problem more quickly than without the
00:54:52 model.
00:54:53 Compression.
00:54:54 So you’re ultimately optimistic and excited about
00:54:59 the power of RL, of reinforcement learning in the
00:55:03 context of real systems.
00:55:05 Absolutely, yeah.
00:55:06 So you see RL as a potential having a huge impact
00:55:11 beyond just sort of the M part is often developed on
00:55:16 supervised learning methods.
00:55:19 You see RL as a for problems of self driving cars
00:55:25 or any kind of applied cyber robotics.
00:55:28 That’s the correct interesting direction for
00:55:32 research in your view?
00:55:34 I do think so.
00:55:35 We have a company called Nasence which has applied
00:55:40 reinforcement learning to little Audis which learn
00:55:45 to park without a teacher.
00:55:47 The same principles were used of course.
00:55:50 So these little Audis, they are small, maybe like
00:55:54 that, so much smaller than the real Audis.
00:55:57 But they have all the sensors that you find in the
00:56:00 real Audis.
00:56:01 You find the cameras, the LIDAR sensors.
00:56:03 They go up to 120 kilometers an hour if they want
00:56:08 to.
00:56:09 And they have pain sensors basically and they don’t
00:56:13 want to bump against obstacles and other Audis and
00:56:17 so they must learn like little babies to park.
00:56:21 Take the raw vision input and translate that into
00:56:25 actions that lead to successful parking behavior
00:56:28 which is a rewarding thing.
00:56:30 And yes, they learn that.
00:56:32 So we have examples like that and it’s only in the
00:56:36 beginning.
00:56:37 This is just the tip of the iceberg and I believe the
00:56:40 next wave of AI is going to be all about that.
00:56:44 So at the moment, the current wave of AI is about
00:56:48 passive pattern observation and prediction and that’s
00:56:53 what you have on your smartphone and what the major
00:56:56 companies on the Pacific Rim are using to sell you
00:57:00 ads to do marketing.
00:57:02 That’s the current sort of profit in AI and that’s
00:57:05 only one or two percent of the world economy.
00:57:08 Which is big enough to make these companies pretty
00:57:12 much the most valuable companies in the world.
00:57:15 But there’s a much, much bigger fraction of the
00:57:19 economy going to be affected by the next wave which
00:57:22 is really about machines that shape the data through
00:57:26 their own actions.
00:57:27 Do you think simulation is ultimately the biggest
00:57:31 way that those methods will be successful in the next
00:57:35 10, 20 years?
00:57:36 We’re not talking about 100 years from now.
00:57:38 We’re talking about sort of the near term impact of
00:57:41 RL.
00:57:42 Do you think really good simulation is required or
00:57:45 is there other techniques like imitation learning,
00:57:48 observing other humans operating in the real world?
00:57:53 Where do you think the success will come from?
00:57:57 So at the moment, we have a tendency of using physics
00:58:02 simulations to learn behavior from machines that
00:58:07 learn to solve problems that humans also do not know
00:58:11 how to solve.
00:58:12 However, this is not the future because the future is
00:58:16 in what little babies do.
00:58:18 They don’t use a physics engine to simulate the
00:58:21 world.
00:58:22 No, they learn a predictive model of the world which
00:58:26 maybe sometimes is wrong in many ways but captures
00:58:31 all kinds of important abstract high level predictions
00:58:34 which are really important to be successful.
00:58:37 And that’s what was the future 30 years ago when we
00:58:42 started that type of research but it’s still the future
00:58:45 and now we know much better how to go there to move
00:58:49 forward and to really make working systems based on
00:58:54 that where you have a learning model of the world,
00:58:57 a model of the world that learns to predict what’s
00:58:59 going to happen if I do that and that.
00:59:01 And then the controller uses that model to more
00:59:07 quickly learn successful action sequences.
00:59:10 And then of course always this curiosity thing.
00:59:13 In the beginning, the model is stupid so the
00:59:15 controller should be motivated to come up with
00:59:18 experiments with action sequences that lead to data
00:59:21 that improve the model.
00:59:23 Do you think improving the model, constructing an
00:59:27 understanding of the world in this connection is
00:59:30 now the popular approaches that have been successful
00:59:34 are grounded in ideas of neural networks.
00:59:38 But in the 80s with expert systems, there’s
00:59:41 symbolic AI approaches which to us humans are more
00:59:45 intuitive in the sense that it makes sense that you
00:59:49 build up knowledge in this knowledge representation.
00:59:52 What kind of lessons can we draw into our current
00:59:54 approaches from expert systems from symbolic AI?
01:00:00 So I became aware of all of that in the 80s and
01:00:04 back then logic programming was a huge thing.
01:00:08 Was it inspiring to you yourself?
01:00:10 Did you find it compelling?
01:00:12 Because a lot of your work was not so much in that
01:00:16 realm, right?
01:00:17 It was more in the learning systems.
01:00:18 Yes and no, but we did all of that.
01:00:20 So my first publication ever actually was 1987,
01:00:27 was the implementation of genetic algorithm of a
01:00:31 genetic programming system in Prolog.
01:00:34 So Prolog, that’s what you learn back then which is
01:00:38 a logic programming language and the Japanese,
01:00:41 they have this huge fifth generation AI project
01:00:45 which was mostly about logic programming back then.
01:00:49 Although neural networks existed and were well
01:00:52 known back then and deep learning has existed since
01:00:56 1965, since this guy in the Ukraine,
01:01:00 Iwakunenko, started it.
01:01:02 But the Japanese and many other people,
01:01:05 they focused really on this logic programming and I
01:01:08 was influenced to the extent that I said,
01:01:10 okay, let’s take these biologically inspired
01:01:13 algorithms like evolution, programs,
01:01:20 and implement that in the language which I know,
01:01:22 which was Prolog, for example, back then.
01:01:25 And then in many ways this came back later because
01:01:29 the Gödel machine, for example,
01:01:31 has a proof searcher on board and without that it
01:01:34 would not be optimal.
01:01:36 Well, Markus Futter’s universal algorithm for
01:01:38 solving all well defined problems has a proof
01:01:41 searcher on board so that’s very much logic programming.
01:01:46 Without that it would not be asymptotically optimal.
01:01:50 But then on the other hand,
01:01:51 because we are very pragmatic guys also,
01:01:54 we focused on recurrent neural networks and
01:02:00 suboptimal stuff such as gradient based search and
01:02:04 program space rather than provably optimal things.
01:02:09 The logic programming certainly has a usefulness
01:02:13 when you’re trying to construct something provably
01:02:16 optimal or provably good or something like that.
01:02:19 But is it useful for practical problems?
01:02:22 It’s really useful for our theorem proving.
01:02:24 The best theorem provers today are not neural networks.
01:02:28 No, they are logic programming systems and they
01:02:31 are much better theorem provers than most math
01:02:35 students in the first or second semester.
01:02:38 But for reasoning, for playing games of Go or chess
01:02:43 or for robots, autonomous vehicles that operate in
01:02:46 the real world or object manipulation,
01:02:49 you think learning.
01:02:51 Yeah, as long as the problems have little to do
01:02:54 with theorem proving themselves,
01:02:58 then as long as that is not the case,
01:03:01 you just want to have better pattern recognition.
01:03:05 So to build a self driving car,
01:03:07 you want to have better pattern recognition and
01:03:10 pedestrian recognition and all these things.
01:03:14 You want to minimize the number of false positives,
01:03:19 which is currently slowing down self driving cars
01:03:21 in many ways.
01:03:23 All of that has very little to do with logic programming.
01:03:27 What are you most excited about in terms of
01:03:32 directions of artificial intelligence at this moment
01:03:35 in the next few years in your own research
01:03:38 and in the broader community?
01:03:41 So I think in the not so distant future,
01:03:44 we will have for the first time little robots
01:03:50 that learn like kids.
01:03:53 I will be able to say to the robot,
01:03:57 look here robot, we are going to assemble a smartphone.
01:04:01 Let’s take this slab of plastic and the screwdriver
01:04:05 and let’s screw in the screw like that.
01:04:09 Not like that, like that.
01:04:11 Not like that, like that.
01:04:14 And I don’t have a data glove or something.
01:04:17 He will see me and he will hear me
01:04:20 and he will try to do something with his own actuators,
01:04:24 which will be really different from mine,
01:04:26 but he will understand the difference
01:04:28 and will learn to imitate me,
01:04:31 but not in the supervised way
01:04:34 where a teacher is giving target signals
01:04:37 for all his muscles all the time.
01:04:40 No, by doing this high level imitation
01:04:43 where he first has to learn to imitate me
01:04:46 and then to interpret these additional noises
01:04:48 coming from my mouth as helping,
01:04:51 helpful signals to do that better.
01:04:54 And then it will by itself come up with faster ways
01:05:00 and more efficient ways of doing the same thing.
01:05:03 And finally I stop his learning algorithm
01:05:07 and make a million copies and sell it.
01:05:10 And so at the moment this is not possible,
01:05:13 but we already see how we are going to get there.
01:05:16 And you can imagine to the extent
01:05:19 that this works economically and cheaply,
01:05:22 it’s going to change everything.
01:05:25 Almost all of production is going to be affected by that.
01:05:31 And a much bigger wave,
01:05:34 a much bigger AI wave is coming
01:05:36 than the one that we are currently witnessing,
01:05:38 which is mostly about passive pattern recognition
01:05:40 on your smartphone.
01:05:42 This is about active machines that shapes data
01:05:45 through the actions they are executing
01:05:48 and they learn to do that in a good way.
01:05:52 So many of the traditional industries
01:05:55 are going to be affected by that.
01:05:57 All the companies that are building machines
01:06:01 will equip these machines with cameras
01:06:04 and other sensors and they are going to learn
01:06:08 to solve all kinds of problems
01:06:11 through interaction with humans,
01:06:13 but also a lot on their own
01:06:15 to improve what they already can do.
01:06:20 And lots of old economy is going to be affected by that.
01:06:24 And in recent years I have seen that old economy
01:06:27 is actually waking up and realizing that this is the case.
01:06:32 Are you optimistic about that future?
01:06:34 Are you concerned?
01:06:36 There is a lot of people concerned in the near term
01:06:38 about the transformation of the nature of work,
01:06:43 the kind of ideas that you just suggested
01:06:45 would have a significant impact
01:06:47 of what kind of things could be automated.
01:06:49 Are you optimistic about that future?
01:06:52 Are you nervous about that future?
01:06:54 And looking a little bit farther into the future,
01:06:58 there are people like Gila Musk, Stuart Russell,
01:07:02 concerned about the existential threats of that future.
01:07:06 So in the near term, job loss,
01:07:08 in the long term existential threat,
01:07:10 are these concerns to you or are you ultimately optimistic?
01:07:15 So let’s first address the near future.
01:07:22 We have had predictions of job losses for many decades.
01:07:28 For example, when industrial robots came along,
01:07:33 many people predicted that lots of jobs are going to get lost.
01:07:38 And in a sense, they were right,
01:07:42 because back then there were car factories
01:07:46 and hundreds of people in these factories assembled cars,
01:07:51 and today the same car factories have hundreds of robots
01:07:54 and maybe three guys watching the robots.
01:07:59 On the other hand, those countries that have lots of robots per capita,
01:08:05 Japan, Korea, Germany, Switzerland,
01:08:07 and a couple of other countries,
01:08:10 they have really low unemployment rates.
01:08:14 Somehow, all kinds of new jobs were created.
01:08:18 Back then, nobody anticipated those jobs.
01:08:23 And decades ago, I always said,
01:08:27 it’s really easy to say which jobs are going to get lost,
01:08:32 but it’s really hard to predict the new ones.
01:08:36 200 years ago, who would have predicted all these people
01:08:40 making money as YouTube bloggers, for example?
01:08:46 200 years ago, 60% of all people used to work in agriculture.
01:08:54 Today, maybe 1%.
01:08:57 But still, only, I don’t know, 5% unemployment.
01:09:02 Lots of new jobs were created, and Homo Ludens, the playing man,
01:09:08 is inventing new jobs all the time.
01:09:11 Most of these jobs are not existentially necessary
01:09:16 for the survival of our species.
01:09:19 There are only very few existentially necessary jobs,
01:09:23 such as farming and building houses and warming up the houses,
01:09:28 but less than 10% of the population is doing that.
01:09:31 And most of these newly invented jobs are about
01:09:35 interacting with other people in new ways,
01:09:38 through new media and so on,
01:09:41 getting new types of kudos and forms of likes and whatever,
01:09:46 and even making money through that.
01:09:48 So, Homo Ludens, the playing man, doesn’t want to be unemployed,
01:09:53 and that’s why he’s inventing new jobs all the time.
01:09:57 And he keeps considering these jobs as really important
01:10:02 and is investing a lot of energy and hours of work into those new jobs.
01:10:08 That’s quite beautifully put.
01:10:10 We’re really nervous about the future because we can’t predict
01:10:13 what kind of new jobs will be created.
01:10:15 But you’re ultimately optimistic that we humans are so restless
01:10:21 that we create and give meaning to newer and newer jobs,
01:10:25 totally new, things that get likes on Facebook
01:10:29 or whatever the social platform is.
01:10:32 So what about long term existential threat of AI,
01:10:36 where our whole civilization may be swallowed up
01:10:41 by these ultra super intelligent systems?
01:10:45 Maybe it’s not going to be swallowed up,
01:10:48 but I’d be surprised if we humans were the last step
01:10:55 in the evolution of the universe.
01:10:58 You’ve actually had this beautiful comment somewhere that I’ve seen
01:11:05 saying that, quite insightful, artificial general intelligence systems,
01:11:12 just like us humans, will likely not want to interact with humans,
01:11:16 they’ll just interact amongst themselves.
01:11:18 Just like ants interact amongst themselves
01:11:21 and only tangentially interact with humans.
01:11:25 And it’s quite an interesting idea that once we create AGI,
01:11:29 they will lose interest in humans and compete for their own Facebook likes
01:11:34 and their own social platforms.
01:11:36 So within that quite elegant idea, how do we know in a hypothetical sense
01:11:45 that there’s not already intelligence systems out there?
01:11:49 How do you think broadly of general intelligence greater than us?
01:11:54 How do we know it’s out there?
01:11:56 How do we know it’s around us?
01:11:59 And could it already be?
01:12:01 I’d be surprised if within the next few decades or something like that,
01:12:07 we won’t have AIs that are truly smart in every single way
01:12:13 and better problem solvers in almost every single important way.
01:12:17 And I’d be surprised if they wouldn’t realize what we have realized a long time ago,
01:12:25 which is that almost all physical resources are not here in this biosphere,
01:12:31 but further out, the rest of the solar system gets 2 billion times more solar energy
01:12:41 than our little planet.
01:12:43 There’s lots of material out there that you can use to build robots
01:12:47 and self replicating robot factories and all this stuff.
01:12:51 And they are going to do that and they will be scientists and curious
01:12:56 and they will explore what they can do.
01:12:59 And in the beginning, they will be fascinated by life
01:13:04 and by their own origins in our civilization.
01:13:07 They will want to understand that completely, just like people today
01:13:11 would like to understand how life works and also the history of our own existence
01:13:21 and civilization, but then also the physical laws that created all of that.
01:13:27 So in the beginning, they will be fascinated by life.
01:13:30 Once they understand it, they lose interest.
01:13:34 Like anybody who loses interest in things he understands.
01:13:40 And then, as you said, the most interesting sources of information for them
01:13:50 will be others of their own kind.
01:13:58 So at least in the long run, there seems to be some sort of protection
01:14:06 through lack of interest on the other side.
01:14:11 And now it seems also clear, as far as we understand physics,
01:14:17 you need matter and energy to compute and to build more robots and infrastructure
01:14:23 for AI civilization and EIEI ecologies consisting of trillions of different types of AIs.
01:14:31 And so it seems inconceivable to me that this thing is not going to expand.
01:14:37 Some AI ecology not controlled by one AI, but trillions of different types of AIs
01:14:44 competing in all kinds of quickly evolving and disappearing ecological niches
01:14:50 in ways that we cannot fathom at the moment.
01:14:52 But it’s going to expand, limited by light speed and physics,
01:14:57 but it’s going to expand and now we realize that the universe is still young.
01:15:03 It’s only 13.8 billion years old and it’s going to be a thousand times older than that.
01:15:10 So there’s plenty of time to conquer the entire universe
01:15:17 and to fill it with intelligence and senders and receivers
01:15:21 such that AIs can travel the way they are traveling in our labs today,
01:15:27 which is by radio from sender to receiver.
01:15:31 And let’s call the current age of the universe one eon, one eon.
01:15:39 Now it will take just a few eons from now and the entire visible universe
01:15:43 is going to be full of that stuff.
01:15:47 And let’s look ahead to a time when the universe is going to be 1000 times older than it is now.
01:15:53 They will look back and they will say, look, almost immediately after the Big Bang,
01:15:57 only a few eons later, the entire universe started to become intelligent.
01:16:03 Now to your question, how do we see whether anything like that has already happened
01:16:09 or is already in a more advanced stage in some other part of the universe, of the visible universe?
01:16:16 We are trying to look out there and nothing like that has happened so far or is that true?
01:16:22 Do you think we would recognize it?
01:16:24 How do we know it’s not among us?
01:16:26 How do we know planets aren’t in themselves intelligent beings?
01:16:31 How do we know ants seen as a collective are not much greater intelligence than our own?
01:16:40 These kinds of ideas.
01:16:42 When I was a boy, I was thinking about these things
01:16:45 and I thought, maybe it has already happened.
01:16:48 Because back then I knew, I learned from popular physics books,
01:16:53 that the large scale structure of the universe is not homogeneous.
01:17:00 You have these clusters of galaxies and then in between there are these huge empty spaces.
01:17:08 And I thought, maybe they aren’t really empty.
01:17:12 It’s just that in the middle of that, some AI civilization already has expanded
01:17:17 and then has covered a bubble of a billion light years diameter
01:17:22 and is using all the energy of all the stars within that bubble for its own unfathomable purposes.
01:17:29 And so it already has happened and we just fail to interpret the signs.
01:17:35 And then I learned that gravity by itself explains the large scale structure of the universe
01:17:43 and that this is not a convincing explanation.
01:17:46 And then I thought, maybe it’s the dark matter.
01:17:51 Because as far as we know today, 80% of the measurable matter is invisible.
01:18:01 And we know that because otherwise our galaxy or other galaxies would fall apart.
01:18:06 They are rotating too quickly.
01:18:10 And then the idea was, maybe all of these AI civilizations that are already out there,
01:18:17 they are just invisible because they are really efficient in using the energies of their own local systems
01:18:26 and that’s why they appear dark to us.
01:18:29 But this is also not a convincing explanation because then the question becomes,
01:18:34 why are there still any visible stars left in our own galaxy, which also must have a lot of dark matter?
01:18:44 So that is also not a convincing thing.
01:18:46 And today, I like to think it’s quite plausible that maybe we are the first,
01:18:54 at least in our local light cone within the few hundreds of millions of light years that we can reliably observe.
01:19:09 Is that exciting to you that we might be the first?
01:19:12 And it would make us much more important because if we mess it up through a nuclear war,
01:19:20 then maybe this will have an effect on the development of the entire universe.
01:19:31 So let’s not mess it up.
01:19:32 Let’s not mess it up.
01:19:34 Jürgen, thank you so much for talking today. I really appreciate it.
01:19:37 It’s my pleasure.