Ian Goodfellow: Generative Adversarial Networks (GANs) #19

Transcript

00:00:00 The following is a conversation with Ian Goodfellow.

00:00:03 He’s the author of the popular textbook on deep learning

00:00:06 simply titled Deep Learning.

00:00:08 He coined the term of Generative Adversarial Networks,

00:00:12 otherwise known as GANs,

00:00:14 and with his 2014 paper is responsible

00:00:18 for launching the incredible growth

00:00:20 of research and innovation in this subfield

00:00:23 of deep learning.

00:00:24 He got his BS and MS at Stanford,

00:00:27 his PhD at University of Montreal

00:00:30 with Yoshua Bengio and Aaron Kerrville.

00:00:33 He held several research positions

00:00:35 including at OpenAI, Google Brain,

00:00:37 and now at Apple as the Director of Machine Learning.

00:00:41 This recording happened while Ian was still at Google Brain,

00:00:45 but we don’t talk about anything specific to Google

00:00:48 or any other organization.

00:00:50 This conversation is part

00:00:52 of the Artificial Intelligence Podcast.

00:00:54 If you enjoy it, subscribe on YouTube, iTunes,

00:00:57 or simply connect with me on Twitter at Lex Friedman,

00:01:00 spelled F R I D.

00:01:03 And now here’s my conversation with Ian Goodfellow.

00:01:08 You open your popular deep learning book

00:01:11 with a Russian doll type diagram

00:01:13 that shows deep learning is a subset

00:01:15 of representation learning,

00:01:17 which in turn is a subset of machine learning

00:01:19 and finally a subset of AI.

00:01:22 So this kind of implies that there may be limits

00:01:25 to deep learning in the context of AI.

00:01:27 So what do you think is the current limits of deep learning

00:01:31 and are those limits something

00:01:33 that we can overcome with time?

00:01:35 Yeah, I think one of the biggest limitations

00:01:37 of deep learning is that right now it requires

00:01:40 really a lot of data, especially labeled data.

00:01:43 There are some unsupervised

00:01:45 and semi supervised learning algorithms

00:01:47 that can reduce the amount of labeled data you need,

00:01:49 but they still require a lot of unlabeled data,

00:01:52 reinforcement learning algorithms.

00:01:53 They don’t need labels,

00:01:54 but they need really a lot of experiences.

00:01:57 As human beings, we don’t learn to play Pong

00:01:58 by failing at Pong 2 million times.

00:02:02 So just getting the generalization ability better

00:02:05 is one of the most important bottlenecks

00:02:08 in the capability of the technology today.

00:02:10 And then I guess I’d also say deep learning

00:02:12 is like a component of a bigger system.

00:02:16 So far, nobody is really proposing to have

00:02:19 only what you’d call deep learning

00:02:22 as the entire ingredient of intelligence.

00:02:25 You use deep learning as sub modules of other systems,

00:02:29 like AlphaGo has a deep learning model

00:02:32 that estimates the value function.

00:02:35 Most reinforcement learning algorithms

00:02:36 have a deep learning module

00:02:37 that estimates which action to take next,

00:02:40 but you might have other components.

00:02:42 So you’re basically building a function estimator.

00:02:46 Do you think it’s possible,

00:02:48 you said nobody’s kind of been thinking about this so far,

00:02:51 but do you think neural networks could be made to reason

00:02:54 in the way symbolic systems did in the 80s and 90s

00:02:57 to do more, create more like programs

00:03:00 as opposed to functions?

00:03:01 Yeah, I think we already see that a little bit.

00:03:04 I already kind of think of neural nets

00:03:06 as a kind of program.

00:03:08 I think of deep learning as basically learning programs

00:03:12 that have more than one step.

00:03:15 So if you draw a flow chart

00:03:16 or if you draw a TensorFlow graph

00:03:19 describing your machine learning model,

00:03:21 I think of the depth of that graph

00:03:23 as describing the number of steps that run in sequence.

00:03:25 And then the width of that graph

00:03:27 is the number of steps that run in parallel.

00:03:30 Now it’s been long enough

00:03:31 that we’ve had deep learning working

00:03:32 that it’s a little bit silly

00:03:33 to even discuss shallow learning anymore.

00:03:35 But back when I first got involved in AI,

00:03:38 when we used machine learning,

00:03:40 we were usually learning things like support vector machines.

00:03:43 You could have a lot of input features to the model

00:03:45 and you could multiply each feature by a different weight.

00:03:48 All those multiplications were done

00:03:49 in parallel to each other.

00:03:51 There wasn’t a lot done in series.

00:03:52 I think what we got with deep learning

00:03:54 was really the ability to have steps of a program

00:03:58 that run in sequence.

00:04:00 And I think that we’ve actually started to see

00:04:03 that what’s important with deep learning

00:04:05 is more the fact that we have a multi step program

00:04:07 rather than the fact that we’ve learned a representation.

00:04:10 If you look at things like resonance, for example,

00:04:15 they take one particular kind of representation

00:04:18 and they update it several times.

00:04:21 Back when deep learning first really took off

00:04:23 in the academic world in 2006,

00:04:25 when Jeff Hinton showed that you could train

00:04:28 deep belief networks,

00:04:30 everybody who was interested in the idea

00:04:31 thought of it as each layer

00:04:33 learns a different level of abstraction.

00:04:35 That the first layer trained on images

00:04:37 learns something like edges

00:04:38 and the second layer learns corners.

00:04:40 And eventually you get these kind of grandmother cell units

00:04:43 that recognize specific objects.

00:04:45 Today I think most people think of it more

00:04:48 as a computer program where as you add more layers

00:04:52 you can do more updates before you output your final number.

00:04:55 But I don’t think anybody believes that

00:04:57 layer 150 of the ResNet is a grandmother cell

00:05:02 and layer 100 is contours or something like that.

00:05:06 Okay, so you’re not thinking of it

00:05:08 as a singular representation that keeps building.

00:05:11 You think of it as a program,

00:05:14 sort of almost like a state.

00:05:15 Representation is a state of understanding.

00:05:18 Yeah, I think of it as a program

00:05:20 that makes several updates

00:05:21 and arrives at better and better understandings,

00:05:23 but it’s not replacing the representation at each step.

00:05:27 It’s refining it.

00:05:29 And in some sense, that’s a little bit like reasoning.

00:05:31 It’s not reasoning in the form of deduction,

00:05:33 but it’s reasoning in the form of taking a thought

00:05:36 and refining it and refining it carefully

00:05:39 until it’s good enough to use.

00:05:41 So do you think, and I hope you don’t mind,

00:05:43 we’ll jump philosophical every once in a while.

00:05:46 Do you think of cognition, human cognition,

00:05:50 or even consciousness as simply a result

00:05:53 of this kind of sequential representation learning?

00:05:58 Do you think that can emerge?

00:06:00 Cognition, yes, I think so.

00:06:02 Consciousness, it’s really hard to even define

00:06:05 what we mean by that.

00:06:07 I guess there’s, consciousness is often defined

00:06:09 as things like having self awareness,

00:06:12 and that’s relatively easy to turn into something actionable

00:06:16 for a computer scientist to reason about.

00:06:18 People also define consciousness

00:06:19 in terms of having qualitative states of experience,

00:06:22 like qualia, and there’s all these philosophical problems,

00:06:25 like could you imagine a zombie

00:06:27 who does all the same information processing as a human,

00:06:30 but doesn’t really have the qualitative experiences

00:06:33 that we have?

00:06:34 That sort of thing, I have no idea how to formalize

00:06:37 or turn it into a scientific question.

00:06:40 I don’t know how you could run an experiment

00:06:41 to tell whether a person is a zombie or not.

00:06:44 And similarly, I don’t know how you could run

00:06:46 an experiment to tell whether an advanced AI system

00:06:49 had become conscious in the sense of qualia or not.

00:06:53 But in the more practical sense,

00:06:54 like almost like self attention,

00:06:56 you think consciousness and cognition can,

00:06:58 in an impressive way, emerge from current types

00:07:03 of architectures that we think of as learning.

00:07:06 Or if you think of consciousness

00:07:07 in terms of self awareness and just making plans

00:07:12 based on the fact that the agent itself exists in the world,

00:07:16 reinforcement learning algorithms

00:07:18 are already more or less forced

00:07:20 to model the agent’s effect on the environment.

00:07:23 So that more limited version of consciousness

00:07:26 is already something that we get limited versions of

00:07:31 with reinforcement learning algorithms

00:07:32 if they’re trained well.

00:07:34 But you say limited, so the big question really

00:07:39 is how you jump from limited to human level, right?

00:07:42 And whether it’s possible,

00:07:46 even just building common sense reasoning

00:07:49 seems to be exceptionally difficult.

00:07:50 So if we scale things up,

00:07:52 if we get much better on supervised learning,

00:07:55 if we get better at labeling,

00:07:56 if we get bigger data sets, more compute,

00:08:00 do you think we’ll start to see really impressive things

00:08:03 that go from limited to something,

00:08:08 echoes of human level cognition?

00:08:10 I think so, yeah.

00:08:11 I’m optimistic about what can happen

00:08:13 just with more computation and more data.

00:08:16 I do think it’ll be important

00:08:17 to get the right kind of data.

00:08:20 Today, most of the machine learning systems we train

00:08:23 are mostly trained on one type of data for each model.

00:08:27 But the human brain, we get all of our different senses

00:08:31 and we have many different experiences

00:08:33 like riding a bike, driving a car,

00:08:36 talking to people, reading.

00:08:39 I think when we get that kind of integrated data set,

00:08:42 working with a machine learning model

00:08:44 that can actually close the loop and interact,

00:08:47 we may find that algorithms not so different

00:08:50 from what we have today learn really interesting things

00:08:53 when you scale them up a lot

00:08:54 and train them on a large amount of multimodal data.

00:08:58 So multimodal is really interesting,

00:08:59 but within, like you’re working adversarial examples.

00:09:04 So selecting within modal, within one mode of data,

00:09:11 selecting better at what are the difficult cases

00:09:13 from which you’re most useful to learn from.

00:09:16 Oh yeah, like could we get a whole lot of mileage

00:09:18 out of designing a model that’s resistant

00:09:22 to adversarial examples or something like that?

00:09:24 Right, that’s the question.

00:09:26 My thinking on that has evolved a lot

00:09:27 over the last few years.

00:09:29 When I first started to really invest

00:09:31 in studying adversarial examples,

00:09:32 I was thinking of it mostly as adversarial examples

00:09:36 reveal a big problem with machine learning

00:09:38 and we would like to close the gap

00:09:41 between how machine learning models respond

00:09:44 to adversarial examples and how humans respond.

00:09:47 After studying the problem more,

00:09:49 I still think that adversarial examples are important.

00:09:51 I think of them now more of as a security liability

00:09:55 than as an issue that necessarily shows

00:09:57 there’s something uniquely wrong

00:09:59 with machine learning as opposed to humans.

00:10:02 Also, do you see them as a tool

00:10:04 to improve the performance of the system?

00:10:06 Not on the security side, but literally just accuracy.

00:10:10 I do see them as a kind of tool on that side,

00:10:13 but maybe not quite as much as I used to think.

00:10:16 We’ve started to find that there’s a trade off

00:10:18 between accuracy on adversarial examples

00:10:21 and accuracy on clean examples.

00:10:24 Back in 2014, when I did the first

00:10:27 adversarily trained classifier that showed resistance

00:10:30 to some kinds of adversarial examples,

00:10:33 it also got better at the clean data on MNIST.

00:10:36 And that’s something we’ve replicated several times

00:10:37 on MNIST, that when we train

00:10:39 against weak adversarial examples,

00:10:41 MNIST classifiers get more accurate.

00:10:43 So far that hasn’t really held up on other data sets

00:10:47 and hasn’t held up when we train

00:10:48 against stronger adversaries.

00:10:50 It seems like when you confront

00:10:53 a really strong adversary,

00:10:55 you tend to have to give something up.

00:10:58 Interesting.

00:10:59 But it’s such a compelling idea

00:11:00 because it feels like that’s how us humans learn

00:11:04 is through the difficult cases.

00:11:06 We try to think of what would we screw up

00:11:08 and then we make sure we fix that.

00:11:11 It’s also in a lot of branches of engineering,

00:11:13 you do a worst case analysis

00:11:15 and make sure that your system will work in the worst case.

00:11:18 And then that guarantees that it’ll work

00:11:20 in all of the messy average cases that happen

00:11:24 when you go out into a really randomized world.

00:11:27 Yeah, with driving with autonomous vehicles,

00:11:29 there seems to be a desire to just look for,

00:11:33 think adversarially,

00:11:34 try to figure out how to mess up the system.

00:11:36 And if you can be robust to all those difficult cases,

00:11:40 then you can, it’s a hand wavy empirical way

00:11:43 to show your system is safe.

00:11:47 Today, most adversarial example research

00:11:49 isn’t really focused on a particular use case,

00:11:51 but there are a lot of different use cases

00:11:54 where you’d like to make sure that the adversary

00:11:56 can’t interfere with the operation of your system.

00:12:00 Like in finance,

00:12:01 if you have an algorithm making trades for you,

00:12:03 people go to a lot of an effort

00:12:04 to obfuscate their algorithm.

00:12:06 That’s both to protect their IP

00:12:08 because you don’t want to research

00:12:10 and develop a profitable trading algorithm

00:12:13 then have somebody else capture the gains.

00:12:16 But it’s at least partly

00:12:17 because you don’t want people to make adversarial examples

00:12:19 that fool your algorithm into making bad trades.

00:12:24 Or I guess one area that’s been popular

00:12:26 in the academic literature is speech recognition.

00:12:30 If you use speech recognition to hear an audio wave form

00:12:34 and then turn that into a command

00:12:37 that a phone executes for you,

00:12:39 you don’t want a malicious adversary

00:12:41 to be able to produce audio

00:12:43 that gets interpreted as malicious commands,

00:12:46 especially if a human in the room doesn’t realize

00:12:48 that something like that is happening.

00:12:50 And speech recognition,

00:12:52 has there been much success

00:12:53 in being able to create adversarial examples

00:12:58 that fool the system?

00:12:59 Yeah, actually.

00:13:00 I guess the first work that I’m aware of

00:13:02 is a paper called Hidden Voice Commands

00:13:05 that came out in 2016, I believe.

00:13:08 And they were able to show that they could make sounds

00:13:11 that are not understandable by a human

00:13:14 but are recognized as the target phrase

00:13:18 that the attacker wants the phone to recognize it as.

00:13:21 Since then, things have gotten a little bit better

00:13:24 on the attacker’s side

00:13:25 when worse on the defender’s side.

00:13:28 It’s become possible to make sounds

00:13:33 that sound like normal speech

00:13:35 but are actually interpreted as a different sentence

00:13:38 than the human hears.

00:13:40 The level of perceptibility

00:13:42 of the adversarial perturbation is still kind of high.

00:13:46 When you listen to the recording,

00:13:48 it sounds like there’s some noise in the background,

00:13:51 just like rustling sounds.

00:13:52 But those rustling sounds

00:13:53 are actually the adversarial perturbation

00:13:55 that makes the phone hear a completely different sentence.

00:13:58 Yeah, that’s so fascinating.

00:14:00 Peter Norvig mentioned

00:14:01 that you’re writing the deep learning chapter

00:14:02 for the fourth edition

00:14:04 of the Artificial Intelligence, A Modern Approach book.

00:14:07 So how do you even begin summarizing

00:14:10 the field of deep learning in a chapter?

00:14:13 Well, in my case, I waited like a year

00:14:16 before I actually wrote anything.

00:14:19 Even having written a full length textbook before,

00:14:22 it’s still pretty intimidating

00:14:25 to try to start writing just one chapter

00:14:27 that covers everything.

00:14:31 One thing that helped me make that plan

00:14:33 was actually the experience

00:14:34 of having written the full book before

00:14:36 and then watching how the field changed

00:14:39 after the book came out.

00:14:41 I’ve realized there’s a lot of topics

00:14:42 that were maybe extraneous in the first book

00:14:45 and just seeing what stood the test

00:14:47 of a few years of being published

00:14:49 and what seems a little bit less important

00:14:52 to have included now helped me pare down the topics

00:14:54 I wanted to cover for the book.

00:14:56 It’s also really nice now

00:14:58 that the field is kind of stabilized

00:15:00 to the point where some core ideas from the 1980s

00:15:02 are still used today.

00:15:04 When I first started studying machine learning,

00:15:06 almost everything from the 1980s had been rejected

00:15:09 and now some of it has come back.

00:15:11 So that stuff that’s really stood the test of time

00:15:13 is what I focused on putting into the book.

00:15:16 There’s also, I guess, two different philosophies

00:15:21 about how you might write a book.

00:15:23 One philosophy is you try to write a reference

00:15:24 that covers everything.

00:15:26 The other philosophy is you try to provide

00:15:28 a high level summary that gives people the language

00:15:31 to understand a field

00:15:32 and tells them what the most important concepts are.

00:15:35 The first deep learning book that I wrote

00:15:37 with Joshua and Aaron was somewhere

00:15:39 between the two philosophies,

00:15:41 that it’s trying to be both a reference

00:15:43 and an introductory guide.

00:15:45 Writing this chapter for Russell Norvig’s book,

00:15:48 I was able to focus more on just a concise introduction

00:15:52 of the key concepts and the language

00:15:54 you need to read about them more.

00:15:55 In a lot of cases, I actually just wrote paragraphs

00:15:57 that said, here’s a rapidly evolving area

00:16:00 that you should pay attention to.

00:16:02 It’s pointless to try to tell you what the latest

00:16:04 and best version of a learn to learn model is.

00:16:11 I can point you to a paper that’s recent right now,

00:16:13 but there isn’t a whole lot of a reason to delve

00:16:16 into exactly what’s going on

00:16:18 with the latest learning to learn approach

00:16:21 or the latest module produced

00:16:23 by a learning to learn algorithm.

00:16:25 You should know that learning to learn is a thing

00:16:26 and that it may very well be the source of the latest

00:16:30 and greatest convolutional net or recurrent net module

00:16:33 that you would want to use in your latest project.

00:16:36 But there isn’t a lot of point in trying to summarize

00:16:38 exactly which architecture and which learning approach

00:16:42 got to which level of performance.

00:16:44 So you maybe focus more on the basics of the methodology.

00:16:49 So from back propagation to feed forward

00:16:52 to recurrent neural networks, convolutional,

00:16:54 that kind of thing?

00:16:55 Yeah, yeah.

00:16:56 So if I were to ask you, I remember I took algorithms

00:17:00 and data structures algorithms course.

00:17:03 I remember the professor asked, what is an algorithm?

00:17:09 And yelled at everybody in a good way

00:17:12 that nobody was answering it correctly.

00:17:14 Everybody knew what the algorithm, it was graduate course.

00:17:16 Everybody knew what an algorithm was,

00:17:18 but they weren’t able to answer it well.

00:17:19 So let me ask you in that same spirit,

00:17:22 what is deep learning?

00:17:24 I would say deep learning is any kind of machine learning

00:17:29 that involves learning parameters of more than one

00:17:34 consecutive step.

00:17:37 So that, I mean, shallow learning is things

00:17:39 where you learn a lot of operations that happen in parallel.

00:17:43 You might have a system that makes multiple steps.

00:17:46 Like you might have hand designed feature extractors,

00:17:50 but really only one step is learned.

00:17:52 Deep learning is anything where you have multiple operations

00:17:55 in sequence, and that includes the things

00:17:58 that are really popular today,

00:17:59 like convolutional networks and recurrent networks.

00:18:03 But it also includes some of the things that have died out

00:18:06 like Bolton machines,

00:18:08 where we weren’t using back propagation.

00:18:11 Today, I hear a lot of people define deep learning

00:18:14 as gradient descent applied

00:18:18 to these differentiable functions.

00:18:21 And I think that’s a legitimate usage of the term.

00:18:24 It’s just different from the way that I use the term myself.

00:18:27 So what’s an example of deep learning

00:18:31 that is not gradient descent and differentiable functions?

00:18:34 In your, I mean, not specifically perhaps,

00:18:37 but more even looking into the future,

00:18:39 what’s your thought about that space of approaches?

00:18:44 Yeah, so I tend to think of machine learning algorithms

00:18:46 as decomposed into really three different pieces.

00:18:50 There’s the model, which can be something like a neural net

00:18:52 or a Bolton machine or a recurrent model.

00:18:56 And that basically just describes how do you take data

00:18:59 and how do you take parameters?

00:19:01 And what function do you use to make a prediction

00:19:04 given the data and the parameters?

00:19:07 Another piece of the learning algorithm

00:19:09 is the optimization algorithm.

00:19:12 Or not every algorithm can be really described

00:19:14 in terms of optimization,

00:19:15 but what’s the algorithm for updating the parameters

00:19:18 or updating whatever the state of the network is?

00:19:22 And then the last part is the data set,

00:19:26 like how do you actually represent the world

00:19:29 as it comes into your machine learning system?

00:19:33 So I think of deep learning as telling us something about

00:19:36 what does the model look like?

00:19:39 And basically to qualify as deep,

00:19:41 I say that it just has to have multiple layers.

00:19:44 That can be multiple steps

00:19:46 in a feed forward differentiable computation.

00:19:49 That can be multiple layers in a graphical model.

00:19:52 There’s a lot of ways that you could satisfy me

00:19:53 that something has multiple steps

00:19:56 that are each parameterized separately.

00:19:58 I think of gradient descent

00:19:59 as being all about that other piece,

00:20:01 the how do you actually update the parameters piece?

00:20:04 So you could imagine having a deep model

00:20:05 like a convolutional net

00:20:07 and training it with something like evolution

00:20:09 or a genetic algorithm.

00:20:11 And I would say that still qualifies as deep learning.

00:20:14 And then in terms of models

00:20:16 that aren’t necessarily differentiable,

00:20:18 I guess Bolton machines are probably

00:20:21 the main example of something

00:20:23 where you can’t really take a derivative

00:20:25 and use that for the learning process.

00:20:27 But you can still argue that the model

00:20:30 has many steps of processing that it applies

00:20:33 when you run inference in the model.

00:20:35 So it’s the steps of processing that’s key.

00:20:38 So Jeff Hinton suggests that we need to throw away

00:20:41 back propagation and start all over.

00:20:44 What do you think about that?

00:20:46 What could an alternative direction

00:20:48 of training neural networks look like?

00:20:50 I don’t know that back propagation

00:20:52 is gonna go away entirely.

00:20:54 Most of the time when we decide

00:20:57 that a machine learning algorithm

00:20:59 isn’t on the critical path to research for improving AI,

00:21:03 the algorithm doesn’t die.

00:21:04 It just becomes used for some specialized set of things.

00:21:08 A lot of algorithms like logistic regression

00:21:11 don’t seem that exciting to AI researchers

00:21:13 who are working on things like speech recognition

00:21:16 or autonomous cars today.

00:21:18 But there’s still a lot of use for logistic regression

00:21:21 and things like analyzing really noisy data

00:21:24 in medicine and finance

00:21:25 or making really rapid predictions

00:21:28 in really time limited contexts.

00:21:30 So I think back propagation and gradient descent

00:21:33 are around to stay, but they may not end up being

00:21:38 everything that we need to get to real human level

00:21:40 or super human AI.

00:21:42 Are you optimistic about us discovering

00:21:46 back propagation has been around for a few decades?

00:21:50 So are you optimistic about us as a community

00:21:54 being able to discover something better?

00:21:56 Yeah, I am.

00:21:57 I think we likely will find something that works better.

00:22:01 You could imagine things like having stacks of models

00:22:05 where some of the lower level models

00:22:07 predict parameters of the higher level models.

00:22:10 And so at the top level,

00:22:12 you’re not learning in terms of literally

00:22:13 calculating gradients,

00:22:14 but just predicting how different values will perform.

00:22:17 You can kind of see that already in some areas

00:22:19 like Bayesian optimization,

00:22:21 where you have a Gaussian process

00:22:22 that predicts how well different parameter values

00:22:24 will perform.

00:22:25 We already use those kinds of algorithms

00:22:27 for things like hyper parameter optimization.

00:22:30 And in general, we know a lot of things other than back prop

00:22:32 that work really well for specific problems.

00:22:34 The main thing we haven’t found is

00:22:37 a way of taking one of these other

00:22:38 non back prop based algorithms

00:22:41 and having it really advanced the state of the art

00:22:43 on an AI level problem.

00:22:46 Right.

00:22:47 But I wouldn’t be surprised if eventually

00:22:49 we find that some of these algorithms

00:22:50 that even the ones that already exist,

00:22:52 not even necessarily new one,

00:22:54 we might find some way of customizing

00:22:58 one of these algorithms to do something really interesting

00:23:00 at the level of cognition or the level of,

00:23:06 I think one system that we really don’t have working

00:23:08 quite right yet is like short term memory.

00:23:12 We have things like LSTMs,

00:23:14 they’re called long short term memory.

00:23:16 They still don’t do quite what a human does

00:23:20 with short term memory.

00:23:22 Like gradient descent to learn a specific fact

00:23:26 has to do multiple steps on that fact.

00:23:29 Like if I tell you the meeting today is at 3 p.m.,

00:23:34 I don’t need to say over and over again,

00:23:35 it’s at 3 p.m., it’s at 3 p.m., it’s at 3 p.m.,

00:23:37 it’s at 3 p.m.

00:23:38 for you to do a gradient step on each one.

00:23:40 You just hear it once and you remember it.

00:23:43 There’s been some work on things like self attention

00:23:46 and attention like mechanisms,

00:23:48 like the neural Turing machine

00:23:50 that can write to memory cells

00:23:52 and update themselves with facts like that right away.

00:23:54 But I don’t think we’ve really nailed it yet.

00:23:56 And that’s one area where I’d imagine

00:23:59 that new optimization algorithms

00:24:02 or different ways of applying

00:24:03 existing optimization algorithms

00:24:05 could give us a way of just lightning fast

00:24:08 updating the state of a machine learning system

00:24:11 to contain a specific fact like that

00:24:14 without needing to have it presented

00:24:15 over and over and over again.

00:24:16 So some of the success of symbolic systems in the 80s

00:24:21 is they were able to assemble these kinds of facts better.

00:24:26 But there’s a lot of expert input required

00:24:29 and it’s very limited in that sense.

00:24:31 Do you ever look back to that

00:24:33 as something that we’ll have to return to eventually?

00:24:36 Sort of dust off the book from the shelf

00:24:38 and think about how we build knowledge,

00:24:41 representation, knowledge base.

00:24:42 Like will we have to use graph searches?

00:24:44 Graph searches, right.

00:24:45 And like first order logic and entailment

00:24:47 and things like that.

00:24:48 That kind of thing, yeah, exactly.

00:24:49 In my particular line of work,

00:24:51 which has mostly been machine learning security

00:24:54 and also generative modeling,

00:24:56 I haven’t usually found myself moving in that direction.

00:25:00 For generative models, I could see a little bit of,

00:25:03 it could be useful if you had something

00:25:04 like a differentiable knowledge base

00:25:09 or some other kind of knowledge base

00:25:10 where it’s possible for some of our

00:25:13 fuzzier machine learning algorithms

00:25:14 to interact with a knowledge base.

00:25:16 I mean, your network is kind of like that.

00:25:19 It’s a differentiable knowledge base of sorts.

00:25:21 Yeah.

00:25:22 But.

00:25:23 If we had a really easy way of giving feedback

00:25:27 to machine learning models,

00:25:29 that would clearly help a lot with generative models.

00:25:32 And so you could imagine one way of getting there

00:25:33 would be get a lot better at natural language processing.

00:25:36 But another way of getting there would be

00:25:38 take some kind of knowledge base

00:25:40 and figure out a way for it to actually

00:25:42 interact with a neural network.

00:25:44 Being able to have a chat with a neural network.

00:25:46 Yeah.

00:25:47 So like one thing in generative models we see a lot today

00:25:50 is you’ll get things like faces that are not symmetrical,

00:25:54 like people that have two eyes that are different colors.

00:25:58 I mean, there are people with eyes

00:25:59 that are different colors in real life,

00:26:00 but not nearly as many of them as you tend to see

00:26:03 in the machine learning generated data.

00:26:06 So if you had either a knowledge base

00:26:08 that could contain the fact,

00:26:10 people’s faces are generally approximately symmetric

00:26:13 and eye color is especially likely

00:26:15 to be the same on both sides.

00:26:17 Being able to just inject that hint

00:26:20 into the machine learning model

00:26:22 without it having to discover that itself

00:26:23 after studying a lot of data

00:26:25 would be a really useful feature.

00:26:28 I could see a lot of ways of getting there

00:26:30 without bringing back some of the 1980s technology,

00:26:32 but I also see some ways that you could imagine

00:26:35 extending the 1980s technology to play nice with neural nets

00:26:38 and have it help get there.

00:26:40 Awesome.

00:26:40 So you talked about the story of you coming up

00:26:44 with the idea of GANs at a bar with some friends.

00:26:47 You were arguing that this, you know, GANs would work,

00:26:51 generative adversarial networks,

00:26:53 and the others didn’t think so.

00:26:54 Then you went home at midnight, coded it up, and it worked.

00:26:58 So if I was a friend of yours at the bar,

00:27:01 I would also have doubts.

00:27:02 It’s a really nice idea,

00:27:03 but I’m very skeptical that it would work.

00:27:06 What was the basis of their skepticism?

00:27:09 What was the basis of your intuition why it should work?

00:27:14 I don’t want to be someone who goes around

00:27:15 promoting alcohol for the purposes of science,

00:27:18 but in this case,

00:27:20 I do actually think that drinking helped a little bit.

00:27:23 When your inhibitions are lowered,

00:27:25 you’re more willing to try out things

00:27:27 that you wouldn’t try out otherwise.

00:27:29 So I have noticed in general

00:27:32 that I’m less prone to shooting down some of my own ideas

00:27:34 when I have had a little bit to drink.

00:27:37 I think if I had had that idea at lunchtime,

00:27:41 I probably would have thought,

00:27:42 it’s hard enough to train one neural net,

00:27:43 you can’t train a second neural net

00:27:44 in the inner loop of the outer neural net.

00:27:48 That was basically my friend’s objection,

00:27:49 was that trying to train two neural nets at the same time

00:27:52 would be too hard.

00:27:54 So it was more about the training process,

00:27:56 unless, so my skepticism would be,

00:27:58 you know, I’m sure you could train it,

00:28:01 but the thing it would converge to

00:28:03 would not be able to generate anything reasonable,

00:28:05 any kind of reasonable realism.

00:28:08 Yeah, so part of what all of us were thinking about

00:28:11 when we had this conversation was deep Bolton machines,

00:28:15 which a lot of us in the lab, including me,

00:28:16 were a big fan of deep Bolton machines at the time.

00:28:20 They involved two separate processes

00:28:22 running at the same time.

00:28:25 One of them is called the positive phase,

00:28:28 where you load data into the model

00:28:31 and tell the model to make the data more likely.

00:28:33 The other one is called the negative phase,

00:28:35 where you draw samples from the model

00:28:37 and tell the model to make those samples less likely.

00:28:41 In a deep Bolton machine,

00:28:42 it’s not trivial to generate a sample.

00:28:43 You have to actually run an iterative process

00:28:46 that gets better and better samples

00:28:49 coming closer and closer to the distribution

00:28:51 the model represents.

00:28:52 So during the training process,

00:28:53 you’re always running these two systems at the same time,

00:28:56 one that’s updating the parameters of the model

00:28:58 and another one that’s trying to generate samples

00:29:00 from the model.

00:29:01 And they worked really well in things like MNIST,

00:29:04 but a lot of us in the lab, including me,

00:29:05 had tried to get deep Bolton machines

00:29:07 to scale past MNIST to things like generating color photos,

00:29:11 and we just couldn’t get the two processes

00:29:14 to stay synchronized.

00:29:17 So when I had the idea for GANs,

00:29:18 a lot of people thought that the discriminator

00:29:20 would have more or less the same problem

00:29:22 as the negative phase in the Bolton machine,

00:29:25 that trying to train the discriminator in the inner loop,

00:29:27 you just couldn’t get it to keep up

00:29:29 with the generator in the outer loop,

00:29:31 and that would prevent it from converging

00:29:33 to anything useful.

00:29:35 Yeah, I share that intuition.

00:29:36 Yeah.

00:29:39 But turns out to not be the case.

00:29:41 A lot of the time with machine learning algorithms,

00:29:43 it’s really hard to predict ahead of time

00:29:45 how well they’ll actually perform.

00:29:46 You have to just run the experiment and see what happens.

00:29:49 And I would say I still today don’t have

00:29:52 like one factor I can put my finger on and say,

00:29:54 this is why GANs worked for photo generation

00:29:58 and deep Bolton machines don’t.

00:30:01 There are a lot of theory papers

00:30:03 showing that under some theoretical settings,

00:30:06 the GAN algorithm does actually converge,

00:30:10 but those settings are restricted enough

00:30:14 that they don’t necessarily explain the whole picture

00:30:17 in terms of all the results that we see in practice.

00:30:20 So taking a step back,

00:30:22 can you, in the same way as we talked about deep learning,

00:30:24 can you tell me what generative adversarial networks are?

00:30:29 Yeah, so generative adversarial networks

00:30:31 are a particular kind of generative model.

00:30:33 A generative model is a machine learning model

00:30:36 that can train on some set of data.

00:30:38 Like, so you have a collection of photos of cats

00:30:41 and you want to generate more photos of cats,

00:30:43 or you want to estimate a probability distribution over cats.

00:30:47 So you can ask how likely it is

00:30:49 that some new image is a photo of a cat.

00:30:52 GANs are one way of doing this.

00:30:55 Some generative models are good at creating new data.

00:30:59 Other generative models are good at estimating

00:31:01 that density function and telling you how likely

00:31:04 particular pieces of data are to come

00:31:07 from the same distribution as the training data.

00:31:09 GANs are more focused on generating samples

00:31:12 rather than estimating the density function.

00:31:15 There are some kinds of GANs like FlowGAN that can do both,

00:31:18 but mostly GANs are about generating samples,

00:31:21 generating new photos of cats that look realistic.

00:31:24 And they do that completely from scratch.

00:31:29 It’s analogous to human imagination.

00:31:32 When a GAN creates a new image of a cat,

00:31:34 it’s using a neural network to produce a cat

00:31:39 that has not existed before.

00:31:41 It isn’t doing something like compositing photos together.

00:31:44 You’re not literally taking the eye off of one cat

00:31:47 and the ear off of another cat.

00:31:48 It’s more of this digestive process

00:31:51 where the neural net trains in a lot of data

00:31:53 and comes up with some representation

00:31:55 of the probability distribution

00:31:57 and generates entirely new cats.

00:31:59 There are a lot of different ways

00:32:00 of building a generative model.

00:32:01 What’s specific to GANs is that we have a two player game

00:32:05 in the game theoretic sense.

00:32:08 And as the players in this game compete,

00:32:10 one of them becomes able to generate realistic data.

00:32:13 The first player is called the generator.

00:32:16 It produces output data such as just images, for example.

00:32:20 And at the start of the learning process,

00:32:22 it’ll just produce completely random images.

00:32:25 The other player is called the discriminator.

00:32:27 The discriminator takes images as input

00:32:29 and guesses whether they’re real or fake.

00:32:32 You train it both on real data,

00:32:34 so photos that come from your training set,

00:32:36 actual photos of cats,

00:32:37 and you train it to say that those are real.

00:32:39 You also train it on images

00:32:41 that come from the generator network

00:32:43 and you train it to say that those are fake.

00:32:46 As the two players compete in this game,

00:32:49 the discriminator tries to become better

00:32:50 at recognizing whether images are real or fake.

00:32:53 And the generator becomes better

00:32:54 at fooling the discriminator into thinking

00:32:57 that its outputs are real.

00:33:00 And you can analyze this through the language of game theory

00:33:03 and find that there’s a Nash equilibrium

00:33:06 where the generator has captured

00:33:08 the correct probability distribution.

00:33:10 So in the cat example,

00:33:12 it makes perfectly realistic cat photos.

00:33:14 And the discriminator is unable to do better

00:33:17 than random guessing

00:33:18 because all the samples coming from both the data

00:33:21 and the generator look equally likely

00:33:24 to have come from either source.

00:33:25 So do you ever sit back

00:33:28 and does it just blow your mind that this thing works?

00:33:31 So from very,

00:33:33 so it’s able to estimate that density function

00:33:35 enough to generate realistic images.

00:33:38 I mean, does it, yeah.

00:33:40 Do you ever sit back and think how does this even,

00:33:44 why, this is quite incredible,

00:33:46 especially where GANs have gone in terms of realism.

00:33:49 Yeah, and not just to flatter my own work,

00:33:51 but generative models,

00:33:53 all of them have this property that

00:33:56 if they really did what we ask them to do,

00:33:58 they would do nothing but memorize the training data.

00:34:01 Right, exactly.

00:34:02 Models that are based on maximizing the likelihood,

00:34:05 the way that you obtain the maximum likelihood

00:34:08 for a specific training set

00:34:09 is you assign all of your probability mass

00:34:12 to the training examples and nowhere else.

00:34:15 For GANs, the game is played using a training set.

00:34:18 So the way that you become unbeatable in the game

00:34:21 is you literally memorize training examples.

00:34:25 One of my former interns wrote a paper,

00:34:28 his name is Vaishnav Nagarajan,

00:34:31 and he showed that it’s actually hard for the generator

00:34:33 to memorize the training data,

00:34:36 hard in a statistical learning theory sense,

00:34:39 that you can actually create reasons

00:34:42 for why it would require quite a lot of learning steps

00:34:48 and a lot of observations of different latent variables

00:34:52 before you could memorize the training data.

00:34:54 That still doesn’t really explain why

00:34:56 when you produce samples that are new,

00:34:58 why do you get compelling images

00:34:59 rather than just garbage

00:35:01 that’s different from the training set.

00:35:03 And I don’t think we really have a good answer for that,

00:35:06 especially if you think about

00:35:07 how many possible images are out there

00:35:10 and how few images the generative model sees

00:35:14 during training.

00:35:15 It seems just unreasonable

00:35:16 that generative models create new images as well as they do,

00:35:20 especially considering that we’re basically

00:35:22 training them to memorize rather than generalize.

00:35:26 I think part of the answer is

00:35:28 there’s a paper called Deep Image Prior

00:35:30 where they show that you can take a convolutional net

00:35:33 and you don’t even need to learn

00:35:34 the parameters of it at all,

00:35:34 you just use the model architecture.

00:35:36 And it’s already useful for things like inpainting images.

00:35:40 I think that shows us

00:35:41 that the convolutional network architecture

00:35:43 captures something really important

00:35:45 about the structure of images.

00:35:47 And we don’t need to actually use the learning

00:35:50 to capture all the information

00:35:51 coming out of the convolutional net.

00:35:54 That would imply that it would be much harder

00:35:57 to make generative models in other domains.

00:36:00 So far, we’re able to make reasonable speech models

00:36:02 and things like that.

00:36:04 But to be honest, we haven’t actually explored

00:36:06 a whole lot of different data sets all that much.

00:36:09 We don’t, for example, see a lot of deep learning models

00:36:13 of like biology data sets

00:36:17 where you have lots of microarrays measuring

00:36:20 the amount of different enzymes and things like that.

00:36:22 So we may find that some of the progress

00:36:24 that we’ve seen for images and speech

00:36:26 turns out to really rely heavily on the model architecture.

00:36:29 And we were able to do what we did for vision

00:36:32 by trying to reverse engineer the human visual system.

00:36:35 And maybe it’ll turn out that we can’t just use

00:36:39 that same trick for arbitrary kinds of data.

00:36:42 Right, so there’s aspect to the human vision system,

00:36:45 the hardware of it, that makes it without learning,

00:36:49 without cognition, just makes it really effective

00:36:51 at detecting the patterns we see in the visual world.

00:36:54 Yeah.

00:36:55 Yeah, that’s really interesting.

00:36:57 What, in a big, quick overview,

00:37:01 in your view, what types of GANs are there

00:37:05 and what other generative models besides GANs are there?

00:37:09 Yeah, so it’s maybe a little bit easier to start

00:37:12 with what kinds of generative models are there

00:37:14 other than GANs.

00:37:16 So most generative models are likelihood based

00:37:20 where to train them, you have a model that tells you

00:37:24 how much probability it assigns to a particular example

00:37:28 and you just maximize the probability assigned

00:37:30 to all the training examples.

00:37:33 It turns out that it’s hard to design a model

00:37:35 that can create really complicated images

00:37:38 or really complicated audio waveforms

00:37:41 and still have it be possible to estimate

00:37:45 the likelihood function from a computational point of view.

00:37:51 Most interesting models that you would just write down

00:37:53 intuitively, it turns out that it’s almost impossible

00:37:56 to calculate the amount of probability they assign

00:37:58 to a particular point.

00:38:01 So there’s a few different schools of generative models

00:38:04 in the likelihood family.

00:38:07 One approach is to very carefully design the model

00:38:09 so that it is computationally tractable

00:38:12 to measure the density it assigns to a particular point.

00:38:15 So there are things like autoregressive models,

00:38:18 like PixelCNN, those basically break down

00:38:23 the probability distribution into a product

00:38:26 over every single feature.

00:38:28 So for an image, you estimate the probability

00:38:31 of each pixel given all of the pixels that came before it.

00:38:35 There’s tricks where if you want to measure

00:38:37 the density function, you can actually calculate

00:38:40 the density for all these pixels more or less in parallel.

00:38:44 Generating the image still tends to require you

00:38:46 to go one pixel at a time, and that can be very slow.

00:38:50 But there are, again, tricks for doing this

00:38:52 in a hierarchical pattern where you can keep

00:38:54 the runtime under control.

00:38:55 Are the quality of the images it generates,

00:38:59 putting runtime aside, pretty good?

00:39:02 They’re reasonable, yeah.

00:39:04 I would say a lot of the best results

00:39:07 are from GANs these days, but it can be hard to tell

00:39:11 how much of that is based on who’s studying

00:39:14 which type of algorithm, if that makes sense.

00:39:17 The amount of effort invested in a particular.

00:39:18 Yeah, or like the kind of expertise.

00:39:21 So a lot of people who’ve traditionally been excited

00:39:23 about graphics or art and things like that

00:39:25 have gotten interested in GANs.

00:39:27 And to some extent, it’s hard to tell

00:39:28 are GANs doing better because they have a lot

00:39:31 of graphics and art experts behind them,

00:39:34 or are GANs doing better because they’re more

00:39:37 computationally efficient, or are GANs doing better

00:39:40 because they prioritize the realism of samples

00:39:43 over the accuracy of the density function.

00:39:45 I think all of those are potentially valid explanations,

00:39:48 and it’s hard to tell.

00:39:51 So can you give a brief history of GANs from 2014?

00:39:57 Were you paper 13?

00:39:59 Yeah, so a few highlights.

00:40:00 In the first paper, we just showed

00:40:03 that GANs basically work.

00:40:04 If you look back at the samples we had now,

00:40:06 they look terrible.

00:40:08 On the CIFAR 10 data set,

00:40:10 you can’t even recognize objects in them.

00:40:12 Your paper, sorry, you used CIFAR 10?

00:40:15 We used MNIST, which is little handwritten digits.

00:40:18 We used the Toronto Face database,

00:40:19 which is small grayscale photos of faces.

00:40:22 We did have recognizable faces.

00:40:24 My colleague Bing Xu put together

00:40:25 the first GAN face model for that paper.

00:40:29 We also had the CIFAR 10 data set,

00:40:32 which is things like very small 32 by 32 pixels

00:40:36 of cars and cats and dogs.

00:40:40 For that, we didn’t get recognizable objects,

00:40:42 but all the deep learning people back then

00:40:46 were really used to looking at these failed samples

00:40:48 and kind of reading them like tea leaves.

00:40:50 And people who are used to reading the tea leaves

00:40:53 recognize that our tea leaves at least look different.

00:40:56 Maybe not necessarily better,

00:40:57 but there was something unusual about them.

00:41:01 And that got a lot of us excited.

00:41:03 One of the next really big steps was LAPGAN

00:41:06 by Emily Denton and Sumit Chintala at Facebook AI Research,

00:41:10 where they actually got really good high resolution photos

00:41:14 working with GANs for the first time.

00:41:16 They had a complicated system

00:41:18 where they generated the image starting at low res

00:41:20 and then scaling up to high res,

00:41:22 but they were able to get it to work.

00:41:24 And then in 2015, I believe later that same year,

00:41:31 Alec Radford and Sumit Chintala and Luke Metz

00:41:35 published the DCGAN paper,

00:41:38 which it stands for deep convolutional GAN.

00:41:41 It’s kind of a non unique name

00:41:43 because these days basically all GANs

00:41:46 and even some before that were deep and convolutional,

00:41:48 but they just kind of picked a name

00:41:50 for a really great recipe

00:41:52 where they were able to actually using only one model

00:41:55 instead of a multi step process,

00:41:57 actually generate realistic images of faces

00:41:59 and things like that.

00:42:01 That was sort of like the beginning

00:42:05 of the Cambrian explosion of GANs.

00:42:07 Like once you had animals that had a backbone,

00:42:09 you suddenly got lots of different versions of fish

00:42:12 and four legged animals and things like that.

00:42:15 So DCGAN became kind of the backbone

00:42:17 for many different models that came out.

00:42:19 It’s used as a baseline even still.

00:42:21 Yeah, yeah.

00:42:23 And so from there,

00:42:24 I would say some interesting things we’ve seen

00:42:26 are there’s a lot you can say

00:42:29 about how just the quality

00:42:30 of standard image generation GANs has increased,

00:42:33 but what’s also maybe more interesting

00:42:35 on an intellectual level

00:42:36 is how the things you can use GANs for has also changed.

00:42:41 One thing is that you can use them to learn classifiers

00:42:44 without having to have class labels

00:42:46 for every example in your training set.

00:42:48 So that’s called semi supervised learning.

00:42:51 My colleague at OpenAI, Tim Solomons,

00:42:53 who’s at Brain now,

00:42:55 wrote a paper called Improve Techniques for Training GANs.

00:42:59 I’m a coauthor on this paper,

00:43:00 but I can’t claim any credit for this particular part.

00:43:03 One thing he showed in the paper

00:43:04 is that you can take the GAN discriminator

00:43:07 and use it as a classifier that actually tells you,

00:43:11 this image is a cat, this image is a dog,

00:43:13 this image is a car, this image is a truck, and so on.

00:43:16 Not just to say whether the image is real or fake,

00:43:18 but if it is real to say specifically

00:43:20 what kind of object it is.

00:43:22 And he found that you can train these classifiers

00:43:25 with far fewer labeled examples

00:43:28 than traditional classifiers.

00:43:30 So if you supervise based on also

00:43:33 not just your discrimination ability,

00:43:35 but your ability to classify,

00:43:36 you’re going to do much,

00:43:38 you’re going to converge much faster

00:43:40 to being effective at being a discriminator.

00:43:43 Yeah.

00:43:44 So for example, for the MNIST dataset,

00:43:46 you want to look at an image of a handwritten digit

00:43:48 and say whether it’s a zero, a one, or a two, and so on.

00:43:54 To get down to less than 1% accuracy

00:43:56 required around 60,000 examples

00:44:00 until maybe about 2014 or so.

00:44:02 In 2016 with this semi supervised GAN project,

00:44:07 Tim was able to get below 1% error

00:44:11 using only 100 labeled examples.

00:44:13 So that was about a 600X decrease

00:44:15 in the amount of labels that he needed.

00:44:17 He’s still using more images than that,

00:44:21 but he doesn’t need to have each of them labeled

00:44:22 as this one’s a one, this one’s a two,

00:44:25 this one’s a zero, and so on.

00:44:27 Then to be able to,

00:44:28 for GANs to be able to generate recognizable objects,

00:44:31 so objects from a particular class,

00:44:33 you still need labeled data

00:44:37 because you need to know what it means

00:44:38 to be a particular class cat, dog.

00:44:41 How do you think we can move away from that?

00:44:44 Yeah, some researchers at Brain Zurich

00:44:46 actually just released a really great paper

00:44:49 on semi supervised GANs

00:44:51 where their goal isn’t to classify,

00:44:53 it’s to make recognizable objects

00:44:56 despite not having a lot of labeled data.

00:44:58 They were working off of DeepMind’s BigGAN project

00:45:02 and they showed that they can match the performance

00:45:05 of BigGAN using only 10%, I believe,

00:45:08 of the labels.

00:45:10 BigGAN was trained on the ImageNet data set,

00:45:12 which is about 1.2 million images

00:45:14 and had all of them labeled.

00:45:17 This latest project from Brain Zurich

00:45:19 shows that they’re able to get away

00:45:20 with only having about 10% of the images labeled.

00:45:25 And they do that essentially using a clustering algorithm

00:45:29 where the discriminator learns

00:45:31 to assign the objects to groups

00:45:34 and then this understanding that objects can be grouped

00:45:38 into similar types helps it to form more realistic ideas

00:45:43 of what should be appearing in the image

00:45:45 because it knows that every image it creates

00:45:47 has to come from one of these archetypal groups

00:45:50 rather than just being some arbitrary image.

00:45:53 If you train a GAN with no class labels,

00:45:54 you tend to get things that look sort of like grass

00:45:57 or water or brick or dirt,

00:46:00 but without necessarily a lot going on in them.

00:46:04 And I think that’s partly because

00:46:05 if you look at a large ImageNet image,

00:46:07 the object doesn’t necessarily occupy the whole image.

00:46:11 And so you learn to create realistic sets of pixels,

00:46:15 but you don’t necessarily learn

00:46:17 that the object is the star of the show

00:46:20 and you want it to be in every image you make.

00:46:22 Yeah, I’ve heard you talk about the horse,

00:46:25 the zebra cycle GAN mapping

00:46:26 and how it turns out, again, thought provoking

00:46:31 that horses are usually on grass

00:46:33 and zebras are usually on drier terrain.

00:46:35 So when you’re doing that kind of generation,

00:46:38 you’re going to end up generating greener horses

00:46:41 or whatever, so those are connected together.

00:46:45 It’s not just, you’re not able to segment,

00:46:49 be able to generate in a segment away.

00:46:52 So are there other types of games you come across

00:46:54 in your mind that neural networks can play

00:46:59 with each other to be able to solve problems?

00:47:04 Yeah, the one that I spend most of my time on

00:47:07 is in security.

00:47:09 You can model most interactions as a game

00:47:13 where there’s attackers trying to break your system

00:47:15 and you’re the defender trying to build a resilient system.

00:47:20 There’s also domain adversarial learning,

00:47:23 which is an approach to domain adaptation

00:47:25 that looks really a lot like GANs.

00:47:28 The authors had the idea before the GAN paper came out,

00:47:31 their paper came out a little bit later

00:47:33 and they’re very nice and cited the GAN paper,

00:47:38 but I know that they actually had the idea

00:47:40 before it came out.

00:47:42 Domain adaptation is when you want to train

00:47:44 a machine learning model in one setting called a domain

00:47:47 and then deploy it in another domain later.

00:47:50 And you would like it to perform well in the new domain,

00:47:52 even though the new domain is different

00:47:53 from how it was trained.

00:47:55 So for example, you might want to train

00:47:58 on a really clean image data set like ImageNet,

00:48:01 but then deploy on users phones

00:48:03 where the user is taking pictures in the dark

00:48:05 and pictures while moving quickly

00:48:07 and just pictures that aren’t really centered

00:48:09 or composed all that well.

00:48:13 When you take a normal machine learning model,

00:48:15 it often degrades really badly

00:48:17 when you move to the new domain

00:48:18 because it looks so different

00:48:20 from what the model was trained on.

00:48:22 Domain adaptation algorithms try to smooth out that gap

00:48:25 and the domain adversarial approach

00:48:27 is based on training a feature extractor

00:48:29 where the features have the same statistics

00:48:32 regardless of which domain you extracted them on.

00:48:35 So in the domain adversarial game,

00:48:36 you have one player that’s a feature extractor

00:48:39 and another player that’s a domain recognizer.

00:48:42 The domain recognizer wants to look at the output

00:48:44 of the feature extractor

00:48:45 and guess which of the two domains the features came from.

00:48:49 So it’s a lot like the real versus fake discriminator

00:48:51 in GANs and then the feature extractor,

00:48:54 you can think of as loosely analogous

00:48:56 to the generator in GANs,

00:48:57 except what it’s trying to do here

00:48:59 is both fool the domain recognizer

00:49:02 into not knowing which domain the data came from

00:49:05 and also extract features that are good for classification.

00:49:09 So at the end of the day,

00:49:12 in the cases where it works out,

00:49:13 you can actually get features

00:49:16 that work about the same in both domains.

00:49:20 Sometimes this has a drawback

00:49:21 where in order to make things work the same in both domains,

00:49:24 it just gets worse at the first one.

00:49:26 But there are a lot of cases

00:49:27 where it actually works out well on both.

00:49:30 So do you think of GANs being useful

00:49:32 in the context of data augmentation?

00:49:35 Yeah, one thing you could hope for with GANs

00:49:38 is you could imagine I’ve got a limited training set

00:49:41 and I’d like to make more training data

00:49:43 to train something else like a classifier.

00:49:47 You could train the GAN on the training set

00:49:50 and then create more data

00:49:52 and then maybe the classifier

00:49:54 would perform better on the test set

00:49:55 after training on this bigger GAN generated data set.

00:49:58 So that’s the simplest version

00:50:00 of something you might hope would work.

00:50:03 I’ve never heard of that particular approach working,

00:50:05 but I think there’s some closely related things

00:50:08 that I think could work in the future

00:50:11 and some that actually already have worked.

00:50:14 So if we think a little bit about what we’d be hoping for

00:50:15 if we use the GAN to make more training data,

00:50:18 we’re hoping that the GAN will generalize to new examples

00:50:22 better than the classifier would have generalized

00:50:24 if it was trained on the same data.

00:50:25 And I don’t know of any reason to believe

00:50:27 that the GAN would generalize better

00:50:28 than the classifier would,

00:50:31 but what we might hope for

00:50:33 is that the GAN could generalize differently

00:50:35 from a specific classifier.

00:50:37 So one thing I think is worth trying

00:50:39 that I haven’t personally tried but someone could try is

00:50:41 what if you trained a whole lot of different

00:50:44 generative models on the same training set,

00:50:46 create samples from all of them

00:50:48 and then train a classifier on that?

00:50:50 Because each of the generative models

00:50:52 might generalize in a slightly different way.

00:50:54 They might capture many different axes of variation

00:50:56 that one individual model wouldn’t

00:50:58 and then the classifier can capture all of those ideas

00:51:01 by training in all of their data.

00:51:03 So it’d be a little bit like making

00:51:04 an ensemble of classifiers.

00:51:06 And I think that…

00:51:07 Ensemble of GANs in a way.

00:51:08 I think that could generalize better.

00:51:10 The other thing that GANs are really good for

00:51:12 is not necessarily generating new data

00:51:17 that’s exactly like what you already have,

00:51:19 but by generating new data that has different properties

00:51:23 from the data you already had.

00:51:25 One thing that you can do is you can create

00:51:27 differentially private data.

00:51:29 So suppose that you have something like medical records

00:51:31 and you don’t want to train a classifier

00:51:33 on the medical records and then publish the classifier

00:51:36 because someone might be able to reverse engineer

00:51:38 some of the medical records you trained on.

00:51:40 There’s a paper from Casey Green’s lab

00:51:42 that shows how you can train a GAN

00:51:45 using differential privacy.

00:51:47 And then the samples from the GAN

00:51:49 still have the same differential privacy guarantees

00:51:51 as the parameters of the GAN.

00:51:52 So you can make fake patient data

00:51:55 for other researchers to use.

00:51:57 And they can do almost anything they want with that data

00:51:59 because it doesn’t come from real people.

00:52:02 And the differential privacy mechanism

00:52:04 gives you clear guarantees

00:52:06 on how much the original people’s data has been protected.

00:52:09 That’s really interesting, actually.

00:52:11 I haven’t heard you talk about that before.

00:52:13 In terms of fairness, I’ve seen from AAAI,

00:52:17 your talk, how can adversarial machine learning

00:52:21 help models be more fair with respect to sensitive variables?

00:52:25 Yeah, so there’s a paper from Amos Starkey’s lab

00:52:28 about how to learn machine learning models

00:52:31 that are incapable of using specific variables.

00:52:34 So say, for example, you wanted to make predictions

00:52:36 that are not affected by gender.

00:52:39 It isn’t enough to just leave gender

00:52:41 out of the input to the model.

00:52:42 You can often infer gender

00:52:44 from a lot of other characteristics.

00:52:45 Like say that you have the person’s name,

00:52:47 but you’re not told their gender.

00:52:48 Well, if their name is Ian, they’re kind of obviously a man.

00:52:53 So what you’d like to do is make a machine learning model

00:52:55 that can still take in a lot of different attributes

00:52:59 and make a really accurate informed prediction,

00:53:02 but be confident that it isn’t reverse engineering gender

00:53:05 or another sensitive variable internally.

00:53:08 You can do that using something very similar

00:53:10 to the domain adversarial approach,

00:53:12 where you have one player that’s a feature extractor

00:53:16 and another player that’s a feature analyzer.

00:53:19 And you want to make sure that the feature analyzer

00:53:21 is not able to guess the value of the sensitive variable

00:53:24 that you’re trying to keep private.

00:53:26 Right, that’s, yeah, I love this approach.

00:53:29 So yeah, with the feature,

00:53:31 you’re not able to infer the sensitive variables.

00:53:36 Brilliant, that’s quite brilliant and simple actually.

00:53:39 Another way I think that GANs in particular

00:53:42 could be used for fairness

00:53:44 would be to make something like a CycleGAN,

00:53:46 where you can take data from one domain

00:53:49 and convert it into another.

00:53:51 We’ve seen CycleGAN turning horses into zebras.

00:53:53 We’ve seen other unsupervised GANs made by Mingyu Liu

00:53:59 doing things like turning day photos into night photos.

00:54:03 I think for fairness,

00:54:04 you could imagine taking records for people in one group

00:54:08 and transforming them into analogous people in another group

00:54:11 and testing to see if they’re treated equitably

00:54:14 across those two groups.

00:54:16 There’s a lot of things that’d be hard to get right

00:54:18 to make sure that the conversion process itself is fair.

00:54:21 And I don’t think it’s anywhere near

00:54:23 something that we could actually use yet,

00:54:25 but if you could design that conversion process

00:54:27 very carefully, it might give you a way of doing audits

00:54:30 where you say, what if we took people from this group,

00:54:33 converted them into equivalent people in another group,

00:54:35 does the system actually treat them how it ought to?

00:54:38 That’s also really interesting.

00:54:41 You know, in popular press and in general,

00:54:47 in our imagination, you think,

00:54:49 well, GANs are able to generate data

00:54:51 and you start to think about deep fakes

00:54:54 or being able to sort of maliciously generate data

00:54:57 that fakes the identity of other people.

00:55:01 Is this something of a concern to you?

00:55:03 Is this something, if you look 10, 20 years into the future,

00:55:06 is that something that pops up in your work,

00:55:10 in the work of the community

00:55:11 that’s working on generating models?

00:55:13 I’m a lot less concerned about 20 years from now

00:55:15 than the next few years.

00:55:17 I think there’ll be a kind of bumpy cultural transition

00:55:20 as people encounter this idea

00:55:23 that there can be very realistic videos

00:55:24 and audio that aren’t real.

00:55:26 I think 20 years from now,

00:55:28 people will mostly understand

00:55:30 that you shouldn’t believe something is real

00:55:31 just because you saw a video of it.

00:55:34 People will expect to see

00:55:35 that it’s been cryptographically signed

00:55:38 or have some other mechanism to make them believe

00:55:41 that the content is real.

00:55:44 There’s already people working on this.

00:55:45 Like there’s a startup called Truepick

00:55:47 that provides a lot of mechanisms

00:55:50 for authenticating that an image is real.

00:55:52 They’re maybe not quite up to having a state actor

00:55:56 try to evade their verification techniques,

00:55:59 but it’s something that people are already working on

00:56:02 and I think we’ll get right eventually.

00:56:04 So you think authentication will eventually win out.

00:56:08 So being able to authenticate that this is real

00:56:10 and this is not.

00:56:11 Yeah.

00:56:13 As opposed to GANs just getting better and better

00:56:15 or generative models being able to get better and better

00:56:18 to where the nature of what is real is normal.

00:56:21 I don’t think we’ll ever be able

00:56:22 to look at the pixels of a photo

00:56:25 and tell you for sure that it’s real or not real.

00:56:28 And I think it would actually be somewhat dangerous

00:56:32 to rely on that approach too much.

00:56:35 If you make a really good fake detector

00:56:36 and then someone’s able to fool your fake detector

00:56:38 and your fake detector says this image is not fake,

00:56:42 then it’s even more credible

00:56:43 than if you’ve never made a fake detector

00:56:45 in the first place.

00:56:46 What I do think we’ll get to is systems

00:56:50 that we can kind of use behind the scenes

00:56:53 to make estimates of what’s going on

00:56:55 and maybe not like use them in court

00:56:57 for a definitive analysis.

00:56:59 I also think we will likely get better authentication systems

00:57:04 where, imagine that every phone cryptographically signs

00:57:08 everything that comes out of it.

00:57:10 You wouldn’t be able to conclusively tell

00:57:12 that an image was real,

00:57:14 but you would be able to tell somebody

00:57:17 who knew the appropriate private key for this phone

00:57:21 was actually able to sign this image

00:57:24 and upload it to this server at this timestamp.

00:57:27 Okay, so you could imagine maybe you make phones

00:57:31 that have the private keys hardware embedded in them.

00:57:35 If like a state security agency

00:57:37 really wants to infiltrate the company,

00:57:39 they could probably plant a private key of their choice

00:57:42 or break open the chip and learn the private key

00:57:45 or something like that.

00:57:46 But it would make it a lot harder

00:57:47 for an adversary with fewer resources to fake things.

00:57:51 For most of us it would be okay.

00:57:53 So you mentioned the beer and the bar and the new ideas.

00:57:58 You were able to implement this

00:57:59 or come up with this new idea pretty quickly

00:58:02 and implement it pretty quickly.

00:58:04 Do you think there’s still many such groundbreaking ideas

00:58:07 in deep learning that could be developed so quickly?

00:58:10 Yeah, I do think that there are a lot of ideas

00:58:12 that can be developed really quickly.

00:58:15 GANs were probably a little bit of an outlier

00:58:17 on the whole like one hour timescale.

00:58:20 But just in terms of like low resource ideas

00:58:24 where you do something really different

00:58:25 on the algorithm scale and get a big payback.

00:58:30 I think it’s not as likely that you’ll see that

00:58:31 in terms of things like core machine learning technologies

00:58:34 like a better classifier

00:58:36 or a better reinforcement learning algorithm

00:58:38 or a better generative model.

00:58:41 If I had the GAN idea today,

00:58:42 it would be a lot harder to prove that it was useful

00:58:45 than it was back in 2014

00:58:46 because I would need to get it running

00:58:49 on something like ImageNet or Celeb A at high resolution.

00:58:54 You know, those take a while to train.

00:58:55 You couldn’t train it in an hour

00:58:57 and know that it was something really new and exciting.

00:59:01 Back in 2014, training on MNIST was enough.

00:59:04 But there are other areas of machine learning

00:59:06 where I think a new idea

00:59:09 could actually be developed really quickly

00:59:11 with low resources.

00:59:13 What’s your intuition about what areas

00:59:15 of machine learning are ripe for this?

00:59:17 Yeah, so I think fairness and interpretability

00:59:23 are areas where we just really don’t have any idea

00:59:27 how anything should be done yet.

00:59:29 Like for interpretability,

00:59:30 I don’t think we even have the right definitions.

00:59:32 And even just defining a really useful concept,

00:59:36 you don’t even need to run any experiments,

00:59:38 could have a huge impact on the field.

00:59:40 We’ve seen that, for example, in differential privacy

00:59:42 that Cynthia Dwork and her collaborators

00:59:45 made this technical definition of privacy

00:59:48 where before a lot of things were really mushy.

00:59:50 And then with that definition,

00:59:51 you could actually design randomized algorithms

00:59:54 for accessing databases and guarantee

00:59:56 that they preserved individual people’s privacy

00:59:58 in like a mathematical quantitative sense.

01:00:03 Right now, we all talk a lot about

01:00:05 how interpretable different machine learning algorithms are,

01:00:07 but it’s really just people’s opinion.

01:00:09 And everybody probably has a different idea

01:00:11 of what interpretability means in their head.

01:00:13 If we could define some concept related to interpretability

01:00:16 that’s actually measurable,

01:00:18 that would be a huge leap forward

01:00:20 even without a new algorithm that increases that quantity.

01:00:24 And also once we had the definition of differential privacy,

01:00:28 it was fast to get the algorithms that guaranteed it.

01:00:31 So you could imagine once we have definitions

01:00:33 of good concepts and interpretability,

01:00:35 we might be able to provide the algorithms

01:00:37 that have the interpretability guarantees quickly too.

01:00:40 So what do you think it takes to build a system

01:00:46 with human level intelligence

01:00:48 as we quickly venture into the philosophical?

01:00:51 So artificial general intelligence, what do you think it takes?

01:00:55 I think that it definitely takes better environments

01:01:01 than we currently have for training agents

01:01:03 that we want them to have

01:01:05 a really wide diversity of experiences.

01:01:08 I also think it’s gonna take really a lot of computation.

01:01:11 It’s hard to imagine exactly how much.

01:01:13 So you’re optimistic about simulation,

01:01:16 simulating a variety of environments as the path forward?

01:01:19 I think it’s a necessary ingredient.

01:01:21 Yeah, I don’t think that we’re going to get

01:01:24 to artificial general intelligence

01:01:27 by training on fixed data sets

01:01:29 or by thinking really hard about the problem.

01:01:32 I think that the agent really needs to interact

01:01:35 and have a variety of experiences within the same lifespan.

01:01:41 And today we have many different models

01:01:44 that can each do one thing.

01:01:45 And we tend to train them on one data set

01:01:47 or one RL environment.

01:01:50 Sometimes there are actually papers

01:01:51 about getting one set of parameters to perform well

01:01:54 in many different RL environments.

01:01:56 But we don’t really have anything like an agent

01:01:59 that goes seamlessly from one type of experience to another

01:02:02 and really integrates all the different things

01:02:05 that it does over the course of its life.

01:02:08 When we do see multi agent environments,

01:02:10 they tend to be,

01:02:12 or so many multi environment agents,

01:02:14 they tend to be similar environments.

01:02:16 Like all of them are playing like an action based video game.

01:02:20 We don’t really have an agent that goes from

01:02:23 playing a video game to like reading the Wall Street Journal

01:02:27 to predicting how effective a molecule will be as a drug

01:02:31 or something like that.

01:02:33 What do you think is a good test for intelligence

01:02:35 in your view?

01:02:37 There’s been a lot of benchmarks started with the,

01:02:40 with Alan Turing,

01:02:41 natural conversation being a good benchmark for intelligence.

01:02:46 What would Ian Goodfellow sit back

01:02:51 and be really damn impressed

01:02:53 if a system was able to accomplish?

01:02:56 Something that doesn’t take a lot of glue

01:02:58 from human engineers.

01:02:59 So imagine that instead of having to

01:03:03 go to the CIFAR website and download CIFAR 10

01:03:07 and then write a Python script to parse it and all that,

01:03:11 you could just point an agent at the CIFAR 10 problem

01:03:16 and it downloads and extracts the data

01:03:19 and trains a model and starts giving you predictions.

01:03:22 I feel like something that doesn’t need to have

01:03:25 every step of the pipeline assembled for it,

01:03:28 definitely understands what it’s doing.

01:03:30 Is AutoML moving into that direction

01:03:32 or are you thinking way even bigger?

01:03:34 AutoML has mostly been moving toward,

01:03:38 once we’ve built all the glue,

01:03:39 can the machine learning system

01:03:42 design the architecture really well?

01:03:44 And so I’m more of saying like,

01:03:47 if something knows how to pre process the data

01:03:49 so that it successfully accomplishes the task,

01:03:52 then it would be very hard to argue

01:03:53 that it doesn’t truly understand the task

01:03:56 in some fundamental sense.

01:03:58 And I don’t necessarily know that that’s like

01:04:00 the philosophical definition of intelligence,

01:04:02 but that’s something that would be really cool to build

01:04:03 that would be really useful and would impress me

01:04:05 and would convince me that we’ve made a step forward

01:04:08 in real AI.

01:04:09 So you give it like the URL for Wikipedia

01:04:13 and then next day expect it to be able to solve CIFAR 10.

01:04:18 Or like you type in a paragraph

01:04:20 explaining what you want it to do

01:04:22 and it figures out what web searches it should run

01:04:24 and downloads all the necessary ingredients.

01:04:28 So you have a very clear, calm way of speaking,

01:04:34 no ums, easy to edit.

01:04:37 I’ve seen comments for both you and I

01:04:40 have been identified as both potentially being robots.

01:04:44 If you have to prove to the world that you are indeed human,

01:04:47 how would you do it?

01:04:48 I can understand thinking that I’m a robot.

01:04:55 It’s the flip side of the Turing test, I think.

01:04:57 Yeah, yeah, the prove your human test.

01:05:01 Intellectually, so you have to…

01:05:04 Is there something that’s truly unique in your mind?

01:05:08 Does it go back to just natural language again?

01:05:11 Just being able to talk the way out of it.

01:05:13 Proving that I’m not a robot with today’s technology.

01:05:17 Yeah, that’s pretty straightforward.

01:05:18 Like my conversation today hasn’t veered off

01:05:20 into talking about the stock market or something

01:05:24 because of my training data.

01:05:25 But I guess more generally trying to prove

01:05:28 that something is real from the content alone

01:05:30 is incredibly hard.

01:05:31 That’s one of the main things I’ve gotten

01:05:32 out of my GAN research,

01:05:33 that you can simulate almost anything.

01:05:37 And so you have to really step back to a separate channel

01:05:41 to prove that something is real.

01:05:42 So like, I guess I should have had myself

01:05:45 stamped on a blockchain when I was born or something,

01:05:47 but I didn’t do that.

01:05:48 So according to my own research methodology,

01:05:50 there’s just no way to know at this point.

01:05:52 So what, last question, problem stands out for you

01:05:56 that you’re really excited about challenging

01:05:58 in the near future?

01:05:59 So I think resistance to adversarial examples,

01:06:02 figuring out how to make machine learning secure

01:06:05 against an adversary who wants to interfere

01:06:07 and control it, that is one of the most important things

01:06:10 researchers today could solve.

01:06:12 In all domains, image, language, driving, and everything.

01:06:17 I guess I’m most concerned about domains

01:06:19 we haven’t really encountered yet.

01:06:21 Like imagine 20 years from now,

01:06:24 when we’re using advanced AIs to do things

01:06:26 we haven’t even thought of yet.

01:06:28 Like if you ask people,

01:06:30 what are the important problems in security of phones

01:06:35 in like 2002?

01:06:37 I don’t think we would have anticipated

01:06:38 that we’re using them for nearly as many things

01:06:42 as we’re using them for today.

01:06:43 I think it’s gonna be like that with AI

01:06:44 that you can kind of try to speculate

01:06:46 about where it’s going,

01:06:47 but really the business opportunities

01:06:49 that end up taking off would be hard

01:06:52 to predict ahead of time.

01:06:54 What you can predict ahead of time

01:06:55 is that almost anything you can do with machine learning,

01:06:58 you would like to make sure

01:06:59 that people can’t get it to do what they want

01:07:03 rather than what you want,

01:07:04 just by showing it a funny QR code

01:07:06 or a funny input pattern.

01:07:08 And you think that the set of methodology to do that

01:07:10 can be bigger than any one domain?

01:07:13 I think so, yeah.

01:07:14 Yeah, like one methodology that I think is,

01:07:19 not a specific methodology,

01:07:20 but like a category of solutions

01:07:22 that I’m excited about today is making dynamic models

01:07:25 that change every time they make a prediction.

01:07:28 So right now we tend to train models

01:07:31 and then after they’re trained, we freeze them

01:07:33 and we just use the same rule

01:07:35 to classify everything that comes in from then on.

01:07:38 That’s really a sitting duck from a security point of view.

01:07:41 If you always output the same answer for the same input,

01:07:45 then people can just run inputs through

01:07:48 until they find a mistake that benefits them.

01:07:50 And then they use the same mistake

01:07:51 over and over and over again.

01:07:54 I think having a model that updates its predictions

01:07:56 so that it’s harder to predict what you’re gonna get

01:08:00 will make it harder for an adversary

01:08:02 to really take control of the system

01:08:04 and make it do what they want it to do.

01:08:06 Yeah, models that maintain a bit of a sense of mystery

01:08:09 about them, because they always keep changing.

01:08:12 Ian, thanks so much for talking today, it was awesome.

01:08:14 Thank you for coming in, it’s great to see you.