Vladimir Vapnik: Statistical Learning #5

Transcript

00:00:00 The following is a conversation with Vladimir Vapnik.

00:00:03 He’s the co inventor of support vector machines,

00:00:05 support vector clustering, VC theory,

00:00:07 and many foundational ideas in statistical learning.

00:00:11 He was born in the Soviet Union

00:00:13 and worked at the Institute of Control Sciences in Moscow.

00:00:16 Then in the United States, he worked at AT&T,

00:00:19 NEC Labs, Facebook Research,

00:00:22 and now is a professor at Columbia University.

00:00:25 His work has been cited over 170,000 times.

00:00:30 He has some very interesting ideas

00:00:31 about artificial intelligence and the nature of learning,

00:00:34 especially on the limits of our current approaches

00:00:37 and the open problems in the field.

00:00:40 This conversation is part of MIT course

00:00:42 on artificial general intelligence

00:00:44 and the artificial intelligence podcast.

00:00:46 If you enjoy it, please subscribe on YouTube

00:00:49 or rate it on iTunes or your podcast provider of choice,

00:00:52 or simply connect with me on Twitter

00:00:55 or other social networks at Lex Friedman spelled F R I D.

00:01:00 And now here’s my conversation with Vladimir Vapnik.

00:01:04 Einstein famously said that God doesn’t play dice.

00:01:08 Yeah.

00:01:09 You have studied the world through the eyes of statistics.

00:01:12 So let me ask you in terms of the nature of reality,

00:01:17 fundamental nature of reality, does God play dice?

00:01:21 We don’t know some factors.

00:01:25 And because we don’t know some factors,

00:01:28 which could be important,

00:01:30 it looks like God plays dice.

00:01:35 But we should describe it.

00:01:38 In philosophy, they distinguish between two positions,

00:01:42 positions of instrumentalism,

00:01:44 where you’re creating theory for prediction

00:01:48 and position of realism,

00:01:50 where you’re trying to understand what God did.

00:01:54 Can you describe instrumentalism and realism a little bit?

00:01:58 For example, if you have some mechanical laws,

00:02:04 what is that?

00:02:06 Is it law which is true always and everywhere?

00:02:11 Or it is law which allow you to predict

00:02:14 position of moving element?

00:02:19 What you believe.

00:02:23 You believe that it is God’s law,

00:02:25 that God created the world,

00:02:28 which obey to this physical law.

00:02:33 Or it is just law for predictions.

00:02:36 And which one is instrumentalism?

00:02:38 For predictions.

00:02:39 If you believe that this is law of God,

00:02:43 and it’s always true everywhere,

00:02:47 that means that you’re realist.

00:02:50 So you’re trying to really understand God’s thought.

00:02:55 So the way you see the world is as an instrumentalist?

00:03:00 You know, I’m working for some models,

00:03:03 model of machine learning.

00:03:07 So in this model, we can see setting,

00:03:12 and we try to solve,

00:03:15 resolve the setting to solve the problem.

00:03:18 And you can do in two different way.

00:03:20 From the point of view of instrumentalist,

00:03:23 and that’s what everybody does now.

00:03:27 Because they say that goal of machine learning

00:03:31 is to find the rule for classification.

00:03:36 That is true.

00:03:38 But it is instrument for prediction.

00:03:41 But I can say the goal of machine learning

00:03:46 is to learn about conditional probability.

00:03:50 So how God played use, and if he play,

00:03:54 what is probability for one,

00:03:56 what is probability for another, given situation.

00:04:00 But for prediction, I don’t need this.

00:04:02 I need the rule.

00:04:04 But for understanding, I need conditional probability.

00:04:08 So let me just step back a little bit first to talk about,

00:04:11 you mentioned, which I read last night,

00:04:14 the parts of the 1960 paper by Eugene Wigner,

00:04:21 Unreasonable Effectiveness of Mathematics

00:04:23 and Natural Sciences.

00:04:24 Such a beautiful paper, by the way.

00:04:29 Made me feel, to be honest,

00:04:32 to confess my own work in the past few years

00:04:35 on deep learning, heavily applied.

00:04:38 Made me feel that I was missing out

00:04:40 on some of the beauty of nature

00:04:43 in the way that math can uncover.

00:04:45 So let me just step away from the poetry of that for a second.

00:04:50 How do you see the role of math in your life?

00:04:53 Is it a tool, is it poetry?

00:04:55 Where does it sit?

00:04:57 And does math for you have limits of what it can describe?

00:05:01 Some people say that math is language which use God.

00:05:06 Use God.

00:05:08 So I believe that…

00:05:10 Speak to God or use God or…

00:05:12 Use God.

00:05:13 Use God.

00:05:14 Yeah.

00:05:15 So I believe that this article

00:05:23 about effectiveness, unreasonable effectiveness of math,

00:05:27 is that if you’re looking at mathematical structures,

00:05:32 they know something about reality.

00:05:36 And the most scientists from Natural Science,

00:05:41 they’re looking on equation and trying to understand reality.

00:05:47 So the same in machine learning.

00:05:50 If you try very carefully look on all equations

00:05:56 which define conditional probability,

00:05:59 you can understand something about reality

00:06:04 more than from your fantasy.

00:06:07 So math can reveal the simple underlying principles of reality perhaps.

00:06:13 You know what means simple?

00:06:16 It is very hard to discover them.

00:06:19 But then when you discover them and look at them,

00:06:23 you see how beautiful they are.

00:06:26 And it is surprising why people did not see that before.

00:06:33 You’re looking on equation and derive it from equations.

00:06:37 For example, I talked yesterday about least square method.

00:06:43 And people had a lot of fantasy how to improve least square method.

00:06:48 But if you’re going step by step by solving some equations,

00:06:52 you suddenly will get some term which after thinking,

00:06:59 you understand that it describes position of observation point.

00:07:04 In least square method, we throw out a lot of information.

00:07:08 We don’t look in composition of point of observations,

00:07:11 we’re looking only on residuals.

00:07:14 But when you understood that, that’s very simple idea,

00:07:19 but it’s not too simple to understand.

00:07:22 And you can derive this just from equations.

00:07:26 So some simple algebra, a few steps will take you to something surprising

00:07:31 that when you think about, you understand.

00:07:34 And that is proof that human intuition is not too rich and very primitive.

00:07:42 And it does not see very simple situations.

00:07:48 So let me take a step back.

00:07:50 In general, yes.

00:07:54 But what about human, as opposed to intuition, ingenuity?

00:08:01 Moments of brilliance.

00:08:06 Do you have to be so hard on human intuition?

00:08:09 Are there moments of brilliance in human intuition?

00:08:12 They can leap ahead of math and then the math will catch up?

00:08:17 I don’t think so.

00:08:19 I think that the best human intuition, it is putting in axioms.

00:08:26 And then it is technical.

00:08:28 See where the axioms take you.

00:08:31 But if they correctly take axioms.

00:08:35 But it axiom polished during generations of scientists.

00:08:41 And this is integral wisdom.

00:08:45 That is beautifully put.

00:08:47 But if you maybe look at, when you think of Einstein and special relativity,

00:08:56 what is the role of imagination coming first there in the moment of discovery of an idea?

00:09:04 So there is obviously a mix of math and out of the box imagination there.

00:09:10 That I don’t know.

00:09:12 Whatever I did, I exclude any imagination.

00:09:17 Because whatever I saw in machine learning that comes from imagination,

00:09:22 like features, like deep learning, they are not relevant to the problem.

00:09:29 When you are looking very carefully from mathematical equations,

00:09:34 you are deriving very simple theory, which goes far beyond theoretically

00:09:39 than whatever people can imagine.

00:09:42 Because it is not good fantasy.

00:09:44 It is just interpretation.

00:09:46 It is just fantasy.

00:09:48 But it is not what you need.

00:09:51 You don’t need any imagination to derive the main principle of machine learning.

00:09:59 When you think about learning and intelligence,

00:10:02 maybe thinking about the human brain and trying to describe mathematically

00:10:06 the process of learning, that is something like what happens in the human brain.

00:10:13 Do you think we have the tools currently?

00:10:17 Do you think we will ever have the tools to try to describe that process of learning?

00:10:21 It is not description what is going on.

00:10:25 It is interpretation.

00:10:27 It is your interpretation.

00:10:29 Your vision can be wrong.

00:10:32 You know, one guy invented microscope, Levenhuk, for the first time.

00:10:39 Only he got this instrument and he kept secret about microscope.

00:10:45 But he wrote a report in London Academy of Science.

00:10:49 In his report, when he was looking at the blood,

00:10:52 he looked everywhere, on the water, on the blood, on the sperm.

00:10:56 But he described blood like fight between queen and king.

00:11:04 So, he saw blood cells, red cells, and he imagined that it is army fighting each other.

00:11:12 And it was his interpretation of situation.

00:11:17 And he sent this report in Academy of Science.

00:11:20 They very carefully looked because they believed that he is right.

00:11:24 He saw something.

00:11:25 Yes.

00:11:26 But he gave wrong interpretation.

00:11:28 And I believe the same can happen with brain.

00:11:32 With brain, yeah.

00:11:33 The most important part.

00:11:35 You know, I believe in human language.

00:11:39 In some proverbs, there is so much wisdom.

00:11:43 For example, people say that it is better than thousand days of diligent studies one day with great teacher.

00:11:54 But if I will ask you what teacher does, nobody knows.

00:11:59 And that is intelligence.

00:12:01 But we know from history and now from math and machine learning that teacher can do a lot.

00:12:12 So, what from a mathematical point of view is the great teacher?

00:12:16 I don’t know.

00:12:17 That’s an open question.

00:12:18 No, but we can say what teacher can do.

00:12:25 He can introduce some invariants, some predicate for creating invariants.

00:12:32 How he doing it?

00:12:33 I don’t know because teacher knows reality and can describe from this reality a predicate, invariants.

00:12:41 But he knows that when you’re using invariant, you can decrease number of observations hundred times.

00:12:49 So, but maybe try to pull that apart a little bit.

00:12:53 I think you mentioned like a piano teacher saying to the student, play like a butterfly.

00:12:59 Yeah.

00:13:00 I play piano.

00:13:01 I play guitar for a long time.

00:13:03 Yeah, maybe it’s romantic, poetic, but it feels like there’s a lot of truth in that statement.

00:13:12 Like there is a lot of instruction in that statement.

00:13:15 And so, can you pull that apart?

00:13:17 What is that?

00:13:19 The language itself may not contain this information.

00:13:22 It is not blah, blah, blah.

00:13:24 It is not blah, blah, blah.

00:13:25 It affects you.

00:13:26 It’s what?

00:13:27 It affects you.

00:13:28 It affects your playing.

00:13:29 Yes, it does, but it’s not the laying.

00:13:34 It feels like what is the information being exchanged there?

00:13:38 What is the nature of information?

00:13:39 What is the representation of that information?

00:13:41 I believe that it is sort of predicate, but I don’t know.

00:13:45 That is exactly what intelligence and machine learning should be.

00:13:49 Yes.

00:13:50 Because the rest is just mathematical technique.

00:13:53 I think that what was discovered recently is that there is two mechanism of learning.

00:14:03 One called strong convergence mechanism and weak convergence mechanism.

00:14:08 Before, people use only one convergence.

00:14:11 In weak convergence mechanism, you can use predicate.

00:14:16 That’s what play like butterfly and it will immediately affect your playing.

00:14:23 You know, there is English proverb, great.

00:14:27 If it looks like a duck, swims like a duck, and quack like a duck, then it is probably duck.

00:14:35 Yes.

00:14:36 But this is exact about predicate.

00:14:40 Looks like a duck, what it means.

00:14:43 You saw many ducks that you’re training data.

00:14:47 So, you have description of how looks integral looks ducks.

00:14:56 Yeah.

00:14:57 The visual characteristics of a duck.

00:14:59 Yeah.

00:15:00 But you want and you have model for recognition.

00:15:04 So, you would like so that theoretical description from model coincide with empirical description,

00:15:12 which you saw on territory.

00:15:14 So, about looks like a duck, it is general.

00:15:18 But what about swims like a duck?

00:15:21 You should know that duck swims.

00:15:23 You can say it play chess like a duck.

00:15:26 Okay.

00:15:27 Duck doesn’t play chess.

00:15:29 And it is completely legal predicate, but it is useless.

00:15:35 So, half teacher can recognize not useless predicate.

00:15:41 So, up to now, we don’t use this predicate in existing machine learning.

00:15:47 So, why we need zillions of data.

00:15:50 But in this English proverb, they use only three predicate.

00:15:55 Looks like a duck, swims like a duck, and quack like a duck.

00:15:59 So, you can’t deny the fact that swims like a duck and quacks like a duck has humor in it, has ambiguity.

00:16:08 Let’s talk about swim like a duck.

00:16:12 It doesn’t say jump like a duck.

00:16:16 Why?

00:16:17 Because…

00:16:18 It’s not relevant.

00:16:20 But that means that you know ducks, you know different birds, you know animals.

00:16:27 And you derive from this that it is relevant to say swim like a duck.

00:16:32 So, underneath, in order for us to understand swims like a duck, it feels like we need to know millions of other little pieces of information.

00:16:42 Which we pick up along the way.

00:16:44 You don’t think so.

00:16:45 There doesn’t need to be this knowledge base in those statements carries some rich information that helps us understand the essence of duck.

00:16:55 Yeah.

00:16:57 How far are we from integrating predicates?

00:17:01 You know that when you consider complete theory of machine learning.

00:17:07 So, what it does, you have a lot of functions.

00:17:12 And then you’re talking it looks like a duck.

00:17:17 You see your training data.

00:17:20 From training data you recognize like expected duck should look.

00:17:31 Then you remove all functions which does not look like you think it should look from training data.

00:17:40 So, you decrease amount of function from which you pick up one.

00:17:46 Then you give a second predicate and again decrease the set of function.

00:17:52 And after that you pick up the best function you can find.

00:17:56 It is standard machine learning.

00:17:58 So, why you need not too many examples?

00:18:03 Because your predicates aren’t very good?

00:18:06 That means that predicates are very good because every predicate is invented to decrease admissible set of function.

00:18:17 So, you talk about admissible set of functions and you talk about good functions.

00:18:22 So, what makes a good function?

00:18:24 So, admissible set of function is set of function which has small capacity or small diversity, small VC dimension example.

00:18:35 Which contain good function inside.

00:18:37 So, by the way for people who don’t know VC, you’re the V in the VC.

00:18:45 So, how would you describe to lay person what VC theory is?

00:18:50 How would you describe VC?

00:18:52 So, when you have a machine.

00:18:54 So, machine capable to pick up one function from the admissible set of function.

00:19:02 But set of admissible function can be big.

00:19:07 So, it contain all continuous functions and it’s useless.

00:19:11 You don’t have so many examples to pick up function.

00:19:15 But it can be small.

00:19:17 Small, we call it capacity but maybe better called diversity.

00:19:24 So, not very different function in the set.

00:19:27 It’s infinite set of function but not very diverse.

00:19:31 So, it is small VC dimension.

00:19:34 When VC dimension is small, you need small amount of training data.

00:19:41 So, the goal is to create admissible set of functions which is have small VC dimension and contain good function.

00:19:53 Then you will be able to pick up the function using small amount of observations.

00:20:02 So, that is the task of learning?

00:20:06 Yeah.

00:20:07 Is creating a set of admissible functions that has a small VC dimension and then you’ve figure out a clever way of picking up?

00:20:17 No, that is goal of learning which I formulated yesterday.

00:20:22 Statistical learning theory does not involve in creating admissible set of function.

00:20:30 In classical learning theory, everywhere, 100% in textbook, the set of function, admissible set of function is given.

00:20:39 But this is science about nothing because the most difficult problem to create admissible set of functions

00:20:47 given, say, a lot of functions, continuum set of function, create admissible set of functions.

00:20:55 That means that it has finite VC dimension, small VC dimension and contain good function.

00:21:02 So, this was out of consideration.

00:21:05 So, what’s the process of doing that?

00:21:07 I mean, it’s fascinating.

00:21:08 What is the process of creating this admissible set of functions?

00:21:13 That is invariant.

00:21:15 That’s invariant.

00:21:16 Yeah, you’re looking of properties of training data and properties means that you have some function

00:21:30 and you just count what is value, average value of function on training data.

00:21:39 You have model and what is expectation of this function on the model and they should coincide.

00:21:46 So, the problem is about how to pick up functions.

00:21:51 It can be any function.

00:21:54 In fact, it is true for all functions.

00:22:00 But because when we’re talking, say, duck does not jumping, so you don’t ask question jump like a duck

00:22:11 because it is trivial.

00:22:13 It does not jumping and doesn’t help you to recognize jump.

00:22:16 But you know something, which question to ask and you’re asking it seems like a duck,

00:22:24 but looks like a duck at this general situation.

00:22:28 Looks like, say, guy who have this illness, this disease.

00:22:36 It is legal.

00:22:39 So, there is a general type of predicate looks like and special type of predicate,

00:22:47 which related to this specific problem.

00:22:51 And that is intelligence part of all this business and that where teacher is involved.

00:22:56 Incorporating the specialized predicates.

00:23:01 What do you think about deep learning as neural networks, these arbitrary architectures

00:23:08 as helping accomplish some of the tasks you’re thinking about?

00:23:13 Their effectiveness or lack thereof?

00:23:15 What are the weaknesses and what are the possible strengths?

00:23:20 You know, I think that this is fantasy, everything which like deep learning, like features.

00:23:29 Let me give you this example.

00:23:34 One of the greatest books is Churchill book about history of Second World War.

00:23:39 And he started this book describing that in old time when war is over, so the great kings,

00:23:53 they gathered together, almost all of them were relatives, and they discussed what should

00:24:00 be done, how to create peace.

00:24:03 And they came to agreement.

00:24:05 And when happened First World War, the general public came in power.

00:24:13 And they were so greedy that robbed Germany.

00:24:18 And it was clear for everybody that it is not peace, that peace will last only 20 years

00:24:24 because they were not professionals.

00:24:28 And the same I see in machine learning.

00:24:32 There are mathematicians who are looking for the problem from a very deep point of view,

00:24:38 mathematical point of view.

00:24:40 And there are computer scientists who mostly does not know mathematics.

00:24:46 They just have interpretation of that.

00:24:49 And they invented a lot of blah, blah, blah interpretations like deep learning.

00:24:54 Why you need deep learning?

00:24:55 Mathematic does not know deep learning.

00:24:57 Mathematic does not know neurons.

00:25:00 It is just function.

00:25:02 If you like to say piecewise linear function, say that and do in class of piecewise linear

00:25:09 function.

00:25:10 But they invent something.

00:25:12 And then they try to prove advantage of that through interpretations, which mostly wrong.

00:25:22 And when it’s not enough, they appeal to brain, which they know nothing about that.

00:25:27 Nobody knows what’s going on in the brain.

00:25:30 So, I think that more reliable work on math.

00:25:34 This is a mathematical problem.

00:25:36 Do your best to solve this problem.

00:25:39 Try to understand that there is not only one way of convergence, which is strong way of

00:25:45 convergence.

00:25:46 There is a weak way of convergence, which requires predicate.

00:25:49 And if you will go through all this stuff, you will see that you don’t need deep learning.

00:25:56 Even more, I would say one of the theory, which called represented theory.

00:26:03 It says that optimal solution of mathematical problem, which is described learning is on

00:26:16 shadow network, not on deep learning.

00:26:21 And a shallow network.

00:26:22 Yeah.

00:26:23 The ultimate problem is there.

00:26:24 Absolutely.

00:26:25 In the end, what you’re saying is exactly right.

00:26:29 The question is you have no value for throwing something on the table, playing with it, not

00:26:37 math.

00:26:38 It’s like a neural network where you said throwing something in the bucket or the biological

00:26:43 example and looking at kings and queens or the cells or the microscope.

00:26:47 You don’t see value in imagining the cells or kings and queens and using that as inspiration

00:26:55 and imagination for where the math will eventually lead you.

00:26:59 You think that interpretation basically deceives you in a way that’s not productive.

00:27:06 I think that if you’re trying to analyze this business of learning and especially discussion

00:27:17 about deep learning, it is discussion about interpretation, not about things, about what

00:27:24 you can say about things.

00:27:26 That’s right.

00:27:27 But aren’t you surprised by the beauty of it?

00:27:29 So not mathematical beauty, but the fact that it works at all or are you criticizing that

00:27:38 very beauty, our human desire to interpret, to find our silly interpretations in these

00:27:47 constructs?

00:27:49 Let me ask you this.

00:27:51 Are you surprised and does it inspire you?

00:27:57 How do you feel about the success of a system like AlphaGo at beating the game of Go?

00:28:03 Using neural networks to estimate the quality of a board and the quality of the position.

00:28:11 That is your interpretation, quality of the board.

00:28:14 Yeah, yes.

00:28:15 Yeah.

00:28:16 So it’s not our interpretation.

00:28:20 The fact is a neural network system, it doesn’t matter, a learning system that we don’t I

00:28:25 think mathematically understand that well, beats the best human player, does something

00:28:30 that was thought impossible.

00:28:31 That means that it’s not a very difficult problem.

00:28:35 So you empirically, we’ve empirically have discovered that this is not a very difficult

00:28:40 problem.

00:28:41 Yeah.

00:28:42 It’s true.

00:28:44 So maybe, can’t argue.

00:28:48 So even more I would say that if they use deep learning, it is not the most effective

00:28:56 way of learning theory.

00:29:00 And usually when people use deep learning, they’re using zillions of training data.

00:29:08 Yeah.

00:29:10 But you don’t need this.

00:29:13 So I describe challenge, can we do some problems which do well deep learning method, this deep

00:29:23 net, using hundred times less training data.

00:29:28 Even more, some problems deep learning cannot solve because it’s not necessary they create

00:29:38 admissible set of function.

00:29:40 To create deep architecture means to create admissible set of functions.

00:29:45 You cannot say that you’re creating good admissible set of functions.

00:29:50 You just, it’s your fantasy.

00:29:52 It does not come from us.

00:29:54 But it is possible to create admissible set of functions because you have your training

00:30:00 data.

00:30:01 That actually for mathematicians, when you consider a variant, you need to use law of

00:30:10 large numbers.

00:30:11 When you’re making training in existing algorithm, you need uniform law of large numbers, which

00:30:20 is much more difficult, it requires VC dimension and all this stuff.

00:30:25 But nevertheless, if you use both weak and strong way of convergence, you can decrease

00:30:33 a lot of training data.

00:30:35 You could do the three, the swims like a duck and quacks like a duck.

00:30:41 So let’s step back and think about human intelligence in general.

00:30:48 Clearly that has evolved in a non mathematical way.

00:30:54 It wasn’t, as far as we know, God or whoever didn’t come up with a model and place in our

00:31:04 brain of admissible functions.

00:31:05 It kind of evolved.

00:31:06 I don’t know, maybe you have a view on this.

00:31:09 So Alan Turing in the 50s, in his paper, asked and rejected the question, can machines think?

00:31:16 It’s not a very useful question, but can you briefly entertain this useful, useless question?

00:31:23 Can machines think?

00:31:25 So talk about intelligence and your view of it.

00:31:28 I don’t know that.

00:31:29 I know that Turing described imitation.

00:31:35 If computer can imitate human being, let’s call it intelligent.

00:31:43 And he understands that it is not thinking computer.

00:31:46 He completely understands what he’s doing.

00:31:49 But he set up problem of imitation.

00:31:53 So now we understand that the problem is not in imitation.

00:31:58 I’m not sure that intelligence is just inside of us.

00:32:04 It may be also outside of us.

00:32:06 I have several observations.

00:32:09 So when I prove some theorem, it’s very difficult theorem, in couple of years, in several places,

00:32:20 people prove the same theorem, say, Sawyer Lemma, after us was done, then another guys

00:32:27 proved the same theorem.

00:32:28 In the history of science, it’s happened all the time.

00:32:32 For example, geometry, it’s happened simultaneously, first it did Lobachevsky and then Gauss and

00:32:40 Boyai and another guys, and it’s approximately in 10 times period, 10 years period of time.

00:32:48 And I saw a lot of examples like that.

00:32:51 And many mathematicians think that when they develop something, they develop something

00:32:57 in general which affect everybody.

00:33:01 So maybe our model that intelligence is only inside of us is incorrect.

00:33:07 It’s our interpretation.

00:33:09 It might be there exists some connection with world intelligence.

00:33:15 I don’t know.

00:33:16 You’re almost like plugging in into…

00:33:19 Yeah, exactly.

00:33:21 And contributing to this…

00:33:22 Into a big network.

00:33:24 Into a big, maybe in your own network.

00:33:28 On the flip side of that, maybe you can comment on big O complexity and how you see classifying

00:33:37 algorithms by worst case running time in relation to their input.

00:33:42 So that way of thinking about functions, do you think p equals np, do you think that’s

00:33:47 an interesting question?

00:33:49 Yeah, it is an interesting question.

00:33:52 But let me talk about complexity in about worst case scenario.

00:34:00 There is a mathematical setting.

00:34:04 When I came to United States in 1990, people did not know, they did not know statistical

00:34:11 learning theory.

00:34:13 So in Russia, it was published to monographs, our monographs, but in America they didn’t

00:34:19 know.

00:34:20 Then they learned and somebody told me that it is worst case theory and they will create

00:34:26 real case theory, but till now it did not.

00:34:30 Because it is mathematical too.

00:34:34 You can do only what you can do using mathematics.

00:34:38 And which has a clear understanding and clear description.

00:34:45 And for this reason, we introduce complexity.

00:34:52 And we need this because using, actually it is diversity, I like this one more.

00:35:01 You see the mention, you can prove some theorems.

00:35:05 But we also create theory for case when you know probability measure.

00:35:12 And that is the best case which can happen, it is entropy theory.

00:35:18 So from mathematical point of view, you know the best possible case and the worst possible

00:35:24 case.

00:35:25 You can derive different model in medium, but it’s not so interesting.

00:35:30 You think the edges are interesting?

00:35:33 The edges are interesting because it is not so easy to get good bound, exact bound.

00:35:44 It’s not many cases where you have the bound is not exact.

00:35:49 But interesting principles which discover the mass.

00:35:54 Do you think it’s interesting because it’s challenging and reveals interesting principles

00:36:00 that allow you to get those bounds?

00:36:02 Or do you think it’s interesting because it’s actually very useful for understanding the

00:36:06 essence of a function of an algorithm?

00:36:11 So it’s like me judging your life as a human being by the worst thing you did and the best

00:36:17 thing you did versus all the stuff in the middle.

00:36:21 It seems not productive.

00:36:24 I don’t think so because you cannot describe situation in the middle.

00:36:31 So it will be not general.

00:36:34 So you can describe edges cases and it is clear it has some model, but you cannot describe

00:36:44 model for every new case.

00:36:47 So you will be never accurate when you’re using model.

00:36:53 But from a statistical point of view, the way you’ve studied functions and the nature

00:36:59 of learning in the world, don’t you think that the real world has a very long tail?

00:37:07 That the edge cases are very far away from the mean, the stuff in the middle or no?

00:37:19 I don’t know that.

00:37:21 I think that from my point of view, if you will use formal statistic, you need uniform

00:37:36 law of large numbers.

00:37:40 If you will use this invariance business, you will need just law of large numbers.

00:37:52 And there’s this huge difference between uniform law of large numbers and large numbers.

00:37:56 Is it useful to describe that a little more or should we just take it to…

00:38:01 For example, when I’m talking about duck, I give three predicates and that was enough.

00:38:09 But if you will try to do formal distinguish, you will need a lot of observations.

00:38:19 So that means that information about looks like a duck contain a lot of bit of information,

00:38:27 formal bits of information.

00:38:29 So we don’t know that how much bit of information contain things from artificial and from intelligence.

00:38:39 And that is the subject of analysis.

00:38:42 Till now, all business, I don’t like how people consider artificial intelligence.

00:38:54 They consider us some codes which imitate activity of human being.

00:39:01 It is not science, it is applications.

00:39:03 You would like to imitate go ahead, it is very useful and a good problem.

00:39:09 But you need to learn something more.

00:39:15 How people try to do, how people can to develop, say, predicates seems like a duck or play

00:39:25 like butterfly or something like that.

00:39:29 Not the teacher says you, how it came in his mind, how he choose this image.

00:39:37 So that process…

00:39:38 That is problem of intelligence.

00:39:39 That is the problem of intelligence and you see that connected to the problem of learning?

00:39:44 Absolutely.

00:39:45 Because you immediately give this predicate like specific predicate seems like a duck

00:39:52 or quack like a duck.

00:39:54 It was chosen somehow.

00:39:57 So what is the line of work, would you say?

00:40:01 If you were to formulate as a set of open problems, that will take us there, to play

00:40:08 like a butterfly.

00:40:09 We’ll get a system to be able to…

00:40:12 Let’s separate two stories.

00:40:14 One mathematical story that if you have predicate, you can do something.

00:40:20 And another story how to get predicate.

00:40:23 It is intelligence problem and people even did not start to understand intelligence.

00:40:32 Because to understand intelligence, first of all, try to understand what do teachers.

00:40:39 How teacher teach, why one teacher better than another one.

00:40:43 Yeah.

00:40:44 And so you think we really even haven’t started on the journey of generating the predicates?

00:40:50 No.

00:40:51 We don’t understand.

00:40:52 We even don’t understand that this problem exists.

00:40:56 Because did you hear…

00:40:57 You do.

00:40:58 No, I just know name.

00:41:02 I want to understand why one teacher better than another and how affect teacher, student.

00:41:13 It is not because he repeating the problem which is in textbook.

00:41:18 He makes some remarks.

00:41:20 He makes some philosophy of reasoning.

00:41:23 Yeah, that’s a beautiful…

00:41:24 So it is a formulation of a question that is the open problem.

00:41:31 Why is one teacher better than another?

00:41:34 Right.

00:41:35 What he does better.

00:41:37 Yeah.

00:41:38 What…

00:41:39 What…

00:41:40 Why in…

00:41:41 At every level?

00:41:42 How do they get better?

00:41:45 What does it mean to be better?

00:41:48 The whole…

00:41:49 Yeah.

00:41:50 Yeah.

00:41:51 From whatever model I have, one teacher can give a very good predicate.

00:41:56 One teacher can say swims like a dog and another can say jump like a dog.

00:42:03 And jump like a dog carries zero information.

00:42:09 So what is the most exciting problem in statistical learning you’ve ever worked on or are working

00:42:14 on now?

00:42:17 I just finished this invariant story and I’m happy that…

00:42:24 I believe that it is ultimate learning story.

00:42:30 At least I can show that there are no another mechanism, only two mechanisms.

00:42:38 But they separate statistical part from intelligent part and I know nothing about intelligent

00:42:46 part.

00:42:47 And if you will know this intelligent part, so it will help us a lot in teaching, in learning.

00:42:59 In learning.

00:43:00 Yeah.

00:43:01 You will know it when we see it?

00:43:02 So for example, in my talk, the last slide was a challenge.

00:43:07 So you have say NIST digit recognition problem and deep learning claims that they did it

00:43:14 very well, say 99.5% of correct answers.

00:43:22 But they use 60,000 observations.

00:43:25 Can you do the same using hundred times less?

00:43:29 But incorporating invariants, what it means, you know, digit one, two, three.

00:43:35 But looking on that, explain to me which invariant I should keep to use hundred examples or say

00:43:44 hundred times less examples to do the same job.

00:43:47 Yeah, that last slide, unfortunately your talk ended quickly, but that last slide was

00:43:56 a powerful open challenge and a formulation of the essence here.

00:44:01 What is the exact problem of intelligence?

00:44:06 Because everybody, when machine learning started and it was developed by mathematicians, they

00:44:15 immediately recognized that we use much more training data than humans needed.

00:44:22 But now again, we came to the same story, have to decrease.

00:44:27 That is the problem of learning.

00:44:30 It is not like in deep learning, they use zillions of training data because maybe zillions

00:44:37 are not enough if you have a good invariants.

00:44:44 Maybe you will never collect some number of observations.

00:44:49 But now it is a question to intelligence, how to do that?

00:44:56 Because statistical part is ready, as soon as you supply us with predicate, we can do

00:45:03 good job with small amount of observations.

00:45:06 And the very first challenge is well known digit recognition.

00:45:11 And you know digits, and please tell me invariants.

00:45:15 I think about that, I can say for digit three, I would introduce concept of horizontal symmetry.

00:45:25 So the digit three has horizontal symmetry, say more than, say, digit two or something

00:45:32 like that.

00:45:33 But as soon as I get the idea of horizontal symmetry, I can mathematically invent a lot

00:45:40 of measure of horizontal symmetry, or then vertical symmetry, or diagonal symmetry, whatever,

00:45:47 if I have idea of symmetry.

00:45:49 But what else?

00:45:52 I think on digit I see that it is meta predicate, which is not shape, it is something like symmetry,

00:46:07 like how dark is whole picture, something like that, which can self rise a predicate.

00:46:16 You think such a predicate could rise out of something that is not general, meaning

00:46:29 it feels like for me to be able to understand the difference between two and three, I would

00:46:35 need to have had a childhood of 10 to 15 years playing with kids, going to school, being

00:46:48 yelled by parents, all of that, walking, jumping, looking at ducks, and then I would be able

00:46:57 to generate the right predicate for telling the difference between two and a three.

00:47:03 Or do you think there’s a more efficient way?

00:47:05 I don’t know.

00:47:06 I know for sure that you must know something more than digits.

00:47:12 Yes.

00:47:13 And that’s a powerful statement.

00:47:15 Yeah.

00:47:16 But maybe there are several languages of description, these elements of digits.

00:47:24 So I’m talking about symmetry, about some properties of geometry, I’m talking about

00:47:32 something abstract.

00:47:33 I don’t know that.

00:47:34 But this is a problem of intelligence.

00:47:38 So in one of our articles, it is trivial to show that every example can carry not more

00:47:47 than one bit of information in real.

00:47:50 Because when you show example and you say this is one, you can remove, say, a function

00:48:00 which does not tell you one, say, is the best strategy.

00:48:05 If you can do it perfectly, it’s remove half of the functions.

00:48:10 But when you use one predicate, which looks like a duck, you can remove much more functions

00:48:17 than half.

00:48:18 And that means that it contains a lot of bit of information from formal point of view.

00:48:26 But when you have a general picture of what you want to recognize and general picture

00:48:34 of the world, can you invent this predicate?

00:48:40 And that predicate carries a lot of information.

00:48:47 Beautifully put.

00:48:48 Maybe just me, but in all the math you show, in your work, which is some of the most profound

00:48:56 mathematical work in the field of learning AI and just math in general, I hear a lot

00:49:02 of poetry and philosophy.

00:49:04 You really kind of talk about philosophy of science.

00:49:09 There’s a poetry and music to a lot of the work you’re doing and the way you’re thinking

00:49:13 about it.

00:49:14 So do you, where does that come from?

00:49:16 Do you escape to poetry?

00:49:18 Do you escape to music or not?

00:49:21 I think that there exists ground truth.

00:49:23 There exists ground truth?

00:49:25 Yeah.

00:49:26 And that can be seen everywhere.

00:49:30 The smart guy, philosopher, sometimes I’m surprised how they deep see.

00:49:39 Sometimes I see that some of them are completely out of subject.

00:49:45 But the ground truth I see in music.

00:49:50 Music is the ground truth?

00:49:51 Yeah.

00:49:52 And in poetry, many poets, they believe, they take dictation.

00:50:01 So what piece of music as a piece of empirical evidence gave you a sense that they are touching

00:50:12 something in the ground truth?

00:50:14 It is structure.

00:50:16 The structure of the math of music.

00:50:17 Yeah, because when you’re listening to Bach, you see the structure.

00:50:22 Very clear, very classic, very simple, and the same in math when you have axioms in geometry,

00:50:31 you have the same feeling.

00:50:32 And in poetry, sometimes you see the same.

00:50:38 And if you look back at your childhood, you grew up in Russia, you maybe were born as

00:50:44 a researcher in Russia, you’ve developed as a researcher in Russia, you’ve came to United

00:50:48 States and a few places.

00:50:51 If you look back, what was some of your happiest moments as a researcher, some of the most

00:51:00 profound moments, not in terms of their impact on society, but in terms of their impact on

00:51:09 how damn good you feel that day and you remember that moment?

00:51:15 You know, every time when you found something, it is great in the life, every simple things.

00:51:26 But my general feeling is that most of my time was wrong.

00:51:32 You should go again and again and again and try to be honest in front of yourself, not

00:51:39 to make interpretation, but try to understand that it’s related to ground truth, it is not

00:51:47 my blah, blah, blah interpretation and something like that.

00:51:52 But you’re allowed to get excited at the possibility of discovery.

00:51:56 Oh yeah.

00:51:57 You have to double check it.

00:51:59 No, but how it’s related to another ground truth, is it just temporary or it is for forever?

00:52:10 You know, you always have a feeling when you found something, how big is that?

00:52:19 So 20 years ago when we discovered statistical learning theory, nobody believed, except for

00:52:26 one guy, Dudley from MIT, and then in 20 years it became fashion, and the same with support

00:52:37 vector machines, that is kernel machines.

00:52:41 So with support vector machines and learning theory, when you were working on it, you had

00:52:49 a sense, you had a sense of the profundity of it, how this seems to be right, this seems

00:52:59 to be powerful.

00:53:00 Right.

00:53:01 Absolutely.

00:53:02 Immediately.

00:53:03 I recognized that it will last forever, and now when I found this invariant story, I have

00:53:18 a feeling that it is complete learning, because I have proof that there are no different mechanisms.

00:53:24 You can have some cosmetic improvement you can do, but in terms of invariants, you need

00:53:35 both invariants and statistical learning, and they should work together.

00:53:41 But also I’m happy that we can formulate what is intelligence from that, and to separate

00:53:52 from technical part, and that is completely different.

00:53:57 Absolutely.

00:53:58 Well, Vladimir, thank you so much for talking today.

00:54:00 Thank you.

00:54:01 It’s an honor.