Rohit Prasad: Amazon Alexa and Conversational AI #57

Transcript

00:00:00 The following is a conversation with Rohit Prasad.

00:00:02 He’s the vice president and head scientist of Amazon Alexa

00:00:06 and one of its original creators.

00:00:08 The Alexa team embodies some of the most challenging,

00:00:12 incredible, impactful, and inspiring work

00:00:14 that is done in AI today.

00:00:17 The team has to both solve problems

00:00:19 at the cutting edge of natural language processing

00:00:21 and provide a trustworthy, secure, and enjoyable experience

00:00:25 to millions of people.

00:00:27 This is where state of the art methods

00:00:29 in computer science meet the challenges

00:00:31 of real world engineering.

00:00:33 In many ways, Alexa and the other voice assistants

00:00:37 are the voices of artificial intelligence

00:00:39 to millions of people and an introduction to AI

00:00:43 for people who have only encountered it in science fiction.

00:00:46 This is an important and exciting opportunity.

00:00:49 So the work that Rohit and the Alexa team are doing

00:00:52 is an inspiration to me and to many researchers

00:00:55 and engineers in the AI community.

00:00:58 This is the Artificial Intelligence Podcast.

00:01:01 If you enjoy it, subscribe on YouTube,

00:01:04 give it five stars on Apple Podcast, support it on Patreon,

00:01:07 or simply connect with me on Twitter,

00:01:09 at Lex Friedman, spelled F R I D M A N.

00:01:13 If you leave a review on Apple Podcasts especially,

00:01:16 but also cast box or comment on YouTube,

00:01:20 consider mentioning topics, people, ideas, questions,

00:01:22 quotes in science, tech, or philosophy

00:01:25 that you find interesting,

00:01:26 and I’ll read them on this podcast.

00:01:28 I won’t call out names, but I love comments

00:01:31 with kindness and thoughtfulness in them,

00:01:33 so I thought I’d share them.

00:01:35 Someone on YouTube highlighted a quote

00:01:37 from the conversation with Ray Dalio,

00:01:40 where he said that you have to appreciate

00:01:41 all the different ways that people can be A players.

00:01:45 This connected me to, on teams of engineers,

00:01:48 it’s easy to think that raw productivity

00:01:50 is the measure of excellence, but there are others.

00:01:53 I’ve worked with people who brought a smile to my face

00:01:55 every time I got to work in the morning.

00:01:57 Their contribution to the team is immeasurable.

00:02:01 I recently started doing podcast ads

00:02:03 at the end of the introduction.

00:02:04 I’ll do one or two minutes after introducing the episode,

00:02:07 and never any ads in the middle

00:02:09 that break the flow of the conversation.

00:02:11 I hope that works for you.

00:02:13 It doesn’t hurt the listening experience.

00:02:15 This show is presented by Cash App,

00:02:17 the number one finance app in the App Store.

00:02:20 I personally use Cash App to send money to friends,

00:02:23 but you can also use it to buy, sell,

00:02:24 and deposit Bitcoin in just seconds.

00:02:27 Cash App also has a new investing feature.

00:02:30 You can buy fractions of a stock, say $1 worth,

00:02:33 no matter what the stock price is.

00:02:35 Brokerage services are provided by Cash App Investing,

00:02:38 a subsidiary of Square and member SIPC.

00:02:42 I’m excited to be working with Cash App

00:02:44 to support one of my favorite organizations called First,

00:02:47 best known for their FIRST Robotics and Lego competitions.

00:02:50 They educate and inspire hundreds of thousands of students

00:02:54 in over 110 countries, and have a perfect rating

00:02:57 on Charity Navigator, which means that donated money

00:03:00 is used to maximum effectiveness.

00:03:03 When you get Cash App from the App Store, Google Play,

00:03:06 and use code LexPodcast, you’ll get $10,

00:03:10 and Cash App will also donate $10 to FIRST,

00:03:13 which again, is an organization that I’ve personally seen

00:03:16 inspire girls and boys to dream

00:03:19 of engineering a better world.

00:03:20 This podcast is also supported by ZipRecruiter.

00:03:24 Hiring great people is hard, and to me,

00:03:26 is one of the most important elements

00:03:28 of a successful mission driven team.

00:03:31 I’ve been fortunate to be a part of,

00:03:33 and lead several great engineering teams.

00:03:35 The hiring I’ve done in the past was mostly through tools

00:03:38 we built ourselves, but reinventing the wheel was painful.

00:03:42 ZipRecruiter is a tool that’s already available for you.

00:03:45 It seeks to make hiring simple, fast, and smart.

00:03:49 For example, Codable cofounder, Gretchen Huebner,

00:03:52 used ZipRecruiter to find a new game artist

00:03:55 to join our education tech company.

00:03:57 By using ZipRecruiter’s screening questions

00:03:59 to filter candidates, Gretchen found it easier

00:04:02 to focus on the best candidates,

00:04:03 and finally, hiring the perfect person for the role,

00:04:06 in less than two weeks, from start to finish.

00:04:10 ZipRecruiter, the smartest way to hire.

00:04:13 See why ZipRecruiter is effective for businesses

00:04:15 of all sizes by signing up,

00:04:17 as I did, for free, at ziprecruiter.com slash lexpod.

00:04:23 That’s ziprecruiter.com slash lexpod.

00:04:27 And now, here’s my conversation with Rohit Prasad.

00:04:33 In the movie Her, I’m not sure if you’ve ever seen it.

00:04:36 Human falls in love with the voice of an AI system.

00:04:39 Let’s start at the highest philosophical level

00:04:42 before we get to deep learning and some of the fun things.

00:04:45 Do you think this, what the movie Her shows,

00:04:48 is within our reach?

00:04:51 I think not specifically about Her,

00:04:54 but I think what we are seeing is a massive increase

00:04:59 in adoption of AI assistance, or AI,

00:05:02 in all parts of our social fabric.

00:05:05 And I think it’s, what I do believe,

00:05:08 is that the utility these AIs provide,

00:05:11 some of the functionalities that are shown

00:05:14 are absolutely within reach.

00:05:18 So some of the functionality

00:05:19 in terms of the interactive elements,

00:05:21 but in terms of the deep connection,

00:05:24 that’s purely voice based.

00:05:26 Do you think such a close connection is possible

00:05:29 with voice alone?

00:05:30 It’s been a while since I saw Her,

00:05:32 but I would say in terms of interactions

00:05:36 which are both human like and in these AI systems,

00:05:40 you have to value what is also superhuman.

00:05:44 We as humans can be in only one place.

00:05:47 AI assistance can be in multiple places at the same time.

00:05:51 One with you on your mobile device,

00:05:53 one at your home, one at work.

00:05:56 So you have to respect these superhuman capabilities too.

00:06:00 Plus as humans, we have certain attributes

00:06:03 we are very good at, very good at reasoning.

00:06:05 AI assistance not yet there,

00:06:07 but in the realm of AI assistance,

00:06:10 what they’re great at is computation, memory,

00:06:12 it’s infinite and pure.

00:06:14 These are the attributes you have to start respecting.

00:06:16 So I think the comparison with human like

00:06:18 versus the other aspect, which is also superhuman,

00:06:21 has to be taken into consideration.

00:06:22 So I think we need to elevate the discussion

00:06:25 to not just human like.

00:06:27 So there’s certainly elements,

00:06:28 we just mentioned, Alexa is everywhere,

00:06:32 computation speaking.

00:06:33 So this is a much bigger infrastructure

00:06:35 than just the thing that sits there in the room with you.

00:06:38 But it certainly feels to us mere humans

00:06:43 that there’s just another little creature there

00:06:47 when you’re interacting with it.

00:06:48 You’re not interacting with the entirety

00:06:49 of the infrastructure, you’re interacting with the device.

00:06:52 The feeling is, okay, sure, we anthropomorphize things,

00:06:56 but that feeling is still there.

00:06:58 So what do you think we as humans,

00:07:02 the purity of the interaction with a smart device,

00:07:04 interaction with a smart assistant,

00:07:06 what do you think we look for in that interaction?

00:07:10 I think in the certain interactions

00:07:12 I think will be very much where it does feel like a human

00:07:15 because it has a persona of its own.

00:07:19 And in certain ones it wouldn’t be.

00:07:20 So I think a simple example to think of it

00:07:23 is if you’re walking through the house

00:07:25 and you just wanna turn on your lights on and off

00:07:27 and you’re issuing a command,

00:07:29 that’s not very much like a human like interaction

00:07:32 and that’s where the AI shouldn’t come back

00:07:33 and have a conversation with you,

00:07:35 just it should simply complete that command.

00:07:38 So those, I think the blend of,

00:07:40 we have to think about this is not human, human alone.

00:07:43 It is a human machine interaction

00:07:45 and certain aspects of humans are needed

00:07:48 and certain aspects are in situations

00:07:49 demand it to be like a machine.

00:07:51 So I told you, it’s gonna be philosophical in parts.

00:07:55 What’s the difference between human and machine

00:07:57 in that interaction?

00:07:58 When we interact to humans,

00:08:00 especially those are friends and loved ones

00:08:04 versus you and a machine that you also are close with.

00:08:10 I think the, you have to think about the roles

00:08:12 the AI plays, right?

00:08:13 So, and it differs from different customer to customer,

00:08:16 different situation to situation,

00:08:18 especially I can speak from Alexa’s perspective.

00:08:21 It is a companion, a friend at times,

00:08:25 an assistant, an advisor down the line.

00:08:27 So I think most AIs will have this kind of attributes

00:08:31 and it will be very situational in nature.

00:08:33 So where is the boundary?

00:08:34 I think the boundary depends on exact context

00:08:37 in which you’re interacting with the AI.

00:08:39 So the depth and the richness

00:08:41 of natural language conversation

00:08:42 is been by Alan Turing been used to try to define

00:08:48 what it means to be intelligent.

00:08:50 There’s a lot of criticism of that kind of test,

00:08:52 but what do you think is a good test of intelligence

00:08:55 in your view, in the context of the Turing test

00:08:58 and Alexa or the Alexa prize, this whole realm,

00:09:03 do you think about this human intelligence,

00:09:07 what it means to define it,

00:09:08 what it means to reach that level?

00:09:10 I do think the ability to converse

00:09:12 is a sign of an ultimate intelligence.

00:09:15 I think that there’s no question about it.

00:09:18 So if you think about all aspects of humans,

00:09:20 there are sensors we have,

00:09:22 and those are basically a data collection mechanism.

00:09:26 And based on that,

00:09:27 we make some decisions with our sensory brains, right?

00:09:30 And from that perspective,

00:09:32 I think there are elements we have to talk about

00:09:35 how we sense the world

00:09:37 and then how we act based on what we sense.

00:09:40 Those elements clearly machines have,

00:09:43 but then there’s the other aspects of computation

00:09:46 that is way better.

00:09:48 I also mentioned about memory again,

00:09:50 in terms of being near infinite,

00:09:51 depending on the storage capacity you have,

00:09:54 and the retrieval can be extremely fast and pure

00:09:58 in terms of like, there’s no ambiguity

00:09:59 of who did I see when, right?

00:10:02 I mean, machines can remember that quite well.

00:10:04 So again, on a philosophical level,

00:10:06 I do subscribe to the fact that to be able to converse

00:10:10 and as part of that, to be able to reason

00:10:13 based on the world knowledge you’ve acquired

00:10:15 and the sensory knowledge that is there

00:10:18 is definitely very much the essence of intelligence.

00:10:23 But intelligence can go beyond human level intelligence

00:10:26 based on what machines are getting capable of.

00:10:29 So what do you think maybe stepping outside of Alexa

00:10:33 broadly as an AI field,

00:10:35 what do you think is a good test of intelligence?

00:10:38 Put it another way outside of Alexa,

00:10:41 because so much of Alexa is a product,

00:10:43 is an experience for the customer.

00:10:44 On the research side,

00:10:46 what would impress the heck out of you if you saw,

00:10:49 you know, what is the test where you said,

00:10:50 wow, this thing is now starting to encroach

00:10:57 into the realm of what we loosely think

00:10:59 of as human intelligence?

00:11:00 So, well, we think of it as AGI

00:11:02 and human intelligence altogether, right?

00:11:04 So in some sense, and I think we are quite far from that.

00:11:08 I think an unbiased view I have

00:11:11 is that the Alexa’s intelligence capability is a great test.

00:11:17 I think of it as there are many other true points

00:11:20 like self driving cars, game playing like go or chess.

00:11:26 Let’s take those two for as an example,

00:11:28 clearly requires a lot of data driven learning

00:11:31 and intelligence, but it’s not as hard a problem

00:11:35 as conversing with, as an AI is with humans

00:11:39 to accomplish certain tasks or open domain chat,

00:11:42 as you mentioned, Alexa prize.

00:11:44 In those settings, the key differences

00:11:47 that the end goal is not defined unlike game playing.

00:11:51 You also do not know exactly what state you are in

00:11:55 in a particular goal completion scenario.

00:11:58 In certain sense, sometimes you can,

00:12:00 if it’s a simple goal, but if you’re even certain examples

00:12:04 like planning a weekend or you can imagine

00:12:07 how many things change along the way,

00:12:09 you look for whether you may change your mind

00:12:11 and you change the destination,

00:12:14 or you want to catch a particular event

00:12:17 and then you decide, no, I want this other event

00:12:19 I want to go to.

00:12:20 So these dimensions of how many different steps

00:12:24 are possible when you’re conversing as a human

00:12:26 with a machine makes it an extremely daunting problem.

00:12:29 And I think it is the ultimate test for intelligence.

00:12:32 And don’t you think that natural language is enough to prove

00:12:37 that conversation, just pure conversation?

00:12:40 From a scientific standpoint,

00:12:42 natural language is a great test,

00:12:45 but I would go beyond, I don’t want to limit it

00:12:47 to as natural language as simply understanding an intent

00:12:51 or parsing for entities and so forth.

00:12:52 We are really talking about dialogue.

00:12:54 Dialogue, yeah.

00:12:55 So I would say human machine dialogue

00:12:58 is definitely one of the best tests of intelligence.

00:13:02 So can you briefly speak to the Alexa Prize

00:13:06 for people who are not familiar with it,

00:13:08 and also just maybe where things stand

00:13:12 and what have you learned and what’s surprising?

00:13:15 What have you seen that’s surprising

00:13:16 from this incredible competition?

00:13:18 Absolutely, it’s a very exciting competition.

00:13:20 Alexa Prize is essentially a grand challenge

00:13:24 in conversational artificial intelligence,

00:13:26 where we threw the gauntlet to the universities

00:13:29 who do active research in the field,

00:13:31 to say, can you build what we call a social bot

00:13:35 that can converse with you coherently

00:13:37 and engagingly for 20 minutes?

00:13:39 That is an extremely hard challenge,

00:13:42 talking to someone who you’re meeting for the first time,

00:13:46 or even if you’ve met them quite often,

00:13:49 to speak at 20 minutes on any topic,

00:13:53 an evolving nature of topics is super hard.

00:13:57 We have completed two successful years of the competition.

00:14:01 The first was won with the University of Washington,

00:14:03 second, the University of California.

00:14:05 We are in our third instance.

00:14:06 We have an extremely strong team of 10 cohorts,

00:14:09 and the third instance of the Alexa Prize is underway now.

00:14:14 And we are seeing a constant evolution.

00:14:17 First year was definitely a learning.

00:14:18 It was a lot of things to be put together.

00:14:21 We had to build a lot of infrastructure

00:14:23 to enable these universities

00:14:25 to be able to build magical experiences

00:14:28 and do high quality research.

00:14:31 Just a few quick questions, sorry for the interruption.

00:14:33 What does failure look like in the 20 minute session?

00:14:37 So what does it mean to fail,

00:14:38 not to reach the 20 minute mark?

00:14:39 Oh, awesome question.

00:14:41 So there are one, first of all,

00:14:43 I forgot to mention one more detail.

00:14:45 It’s not just 20 minutes,

00:14:46 but the quality of the conversation too that matters.

00:14:49 And the beauty of this competition

00:14:51 before I answer that question on what failure means

00:14:53 is first that you actually converse

00:14:56 with millions and millions of customers

00:14:59 as the social bots.

00:15:00 So during the judging phases, there are multiple phases,

00:15:05 before we get to the finals,

00:15:06 which is a very controlled judging in a situation

00:15:08 where we bring in judges

00:15:10 and we have interactors who interact with these social bots,

00:15:14 that is a much more controlled setting.

00:15:15 But till the point we get to the finals,

00:15:18 all the judging is essentially by the customers of Alexa.

00:15:22 And there you basically rate on a simple question,

00:15:26 how good your experience was.

00:15:28 So that’s where we are not testing

00:15:29 for a 20 minute boundary being crossed,

00:15:32 because you do want it to be very much like a clear cut,

00:15:36 winner, be chosen, and it’s an absolute bar.

00:15:40 So did you really break that 20 minute barrier

00:15:42 is why we have to test it in a more controlled setting

00:15:45 with actors, essentially interactors.

00:15:48 And see how the conversation goes.

00:15:50 So this is why it’s a subtle difference

00:15:54 between how it’s being tested in the field

00:15:57 with real customers versus in the lab to award the prize.

00:16:00 So on the latter one, what it means is that

00:16:03 essentially there are three judges

00:16:08 and two of them have to say

00:16:09 this conversation has stalled, essentially.

00:16:13 Got it.

00:16:13 And the judges are human experts.

00:16:15 Judges are human experts.

00:16:16 Okay, great.

00:16:17 So this is in the third year.

00:16:19 So what’s been the evolution?

00:16:20 How far, so the DARPA challenge in the first year,

00:16:24 the autonomous vehicles, nobody finished.

00:16:26 In the second year, a few more finished in the desert.

00:16:30 So how far along in this,

00:16:33 I would say much harder challenge are we?

00:16:36 This challenge has come a long way

00:16:37 to the extent that we’re definitely not close

00:16:40 to the 20 minute barrier being with coherence

00:16:42 and engaging conversation.

00:16:44 I think we are still five to 10 years away

00:16:46 in that horizon to complete that.

00:16:49 But the progress is immense.

00:16:51 Like what you’re finding is the accuracy

00:16:54 and what kind of responses these social bots generate

00:16:57 is getting better and better.

00:16:59 What’s even amazing to see that now there’s humor coming in.

00:17:03 The bots are quite…

00:17:04 Awesome.

00:17:05 You know, you’re talking about

00:17:07 ultimate science of intelligence.

00:17:09 I think humor is a very high bar

00:17:11 in terms of what it takes to create humor.

00:17:14 And I don’t mean just being goofy.

00:17:16 I really mean good sense of humor

00:17:19 is also a sign of intelligence in my mind

00:17:21 and something very hard to do.

00:17:23 So these social bots are now exploring

00:17:25 not only what we think of natural language abilities,

00:17:28 but also personality attributes

00:17:30 and aspects of when to inject an appropriate joke,

00:17:34 when you don’t know the domain,

00:17:38 how you come back with something more intelligible

00:17:41 so that you can continue the conversation.

00:17:43 If you and I are talking about AI

00:17:45 and we are domain experts, we can speak to it.

00:17:47 But if you suddenly switch a topic to that I don’t know of,

00:17:50 how do I change the conversation?

00:17:52 So you’re starting to notice these elements as well.

00:17:55 And that’s coming from partly by the nature

00:17:58 of the 20 minute challenge

00:18:00 that people are getting quite clever

00:18:02 on how to really converse

00:18:05 and essentially mask some of the understanding defects

00:18:08 if they exist.

00:18:09 So some of this, this is not Alexa, the product.

00:18:12 This is somewhat for fun, for research,

00:18:16 for innovation and so on.

00:18:17 I have a question sort of in this modern era,

00:18:20 there’s a lot of, if you look at Twitter and Facebook

00:18:24 and so on, there’s discourse, public discourse going on

00:18:27 and some things that are a little bit too edgy,

00:18:28 people get blocked and so on.

00:18:30 I’m just out of curiosity,

00:18:32 are people in this context pushing the limits?

00:18:35 Is anyone using the F word?

00:18:37 Is anyone sort of pushing back

00:18:41 sort of arguing, I guess I should say,

00:18:45 as part of the dialogue to really draw people in?

00:18:48 First of all, let me just back up a bit

00:18:50 in terms of why we are doing this, right?

00:18:52 So you said it’s fun.

00:18:54 I think fun is more part of the engaging part for customers.

00:18:59 It is one of the most used skills as well

00:19:02 in our skill store.

00:19:04 But up that apart, the real goal was essentially

00:19:07 what was happening is with a lot of AI research

00:19:10 moving to industry, we felt that academia has the risk

00:19:14 of not being able to have the same resources

00:19:16 at disposal that we have, which is lots of data,

00:19:20 massive computing power, and a clear ways

00:19:24 to test these AI advances with real customer benefits.

00:19:28 So we brought all these three together in the Alexa price.

00:19:30 That’s why it’s one of my favorite projects in Amazon.

00:19:33 And with that, the secondary effect is yes,

00:19:37 it has become engaging for our customers as well.

00:19:40 We’re not there in terms of where we want it to be, right?

00:19:43 But it’s a huge progress.

00:19:45 But coming back to your question on

00:19:47 how do the conversations evolve?

00:19:48 Yes, there is some natural attributes of what you said

00:19:51 in terms of argument and some amount of swearing.

00:19:54 The way we take care of that is that there is

00:19:57 a sensitive filter we have built that sees keywords.

00:20:00 It’s more than keywords, a little more in terms of,

00:20:03 of course, there’s keyword based too,

00:20:04 but there’s more in terms of these words can be

00:20:07 very contextual, as you can see,

00:20:09 and also the topic can be something

00:20:12 that you don’t want a conversation to happen

00:20:15 because this is a communal device as well.

00:20:17 A lot of people use these devices.

00:20:19 So we have put a lot of guardrails for the conversation

00:20:22 to be more useful for advancing AI

00:20:25 and not so much of these other issues you attributed

00:20:31 what’s happening in the AI field as well.

00:20:32 Right, so this is actually a serious opportunity.

00:20:35 I didn’t use the right word, fun.

00:20:36 I think it’s an open opportunity to do

00:20:39 some of the best innovation

00:20:42 in conversational agents in the world.

00:20:44 Absolutely.

00:20:45 Why just universities?

00:20:49 Why just universities?

00:20:49 Because as I said, I really felt

00:20:51 Young minds.

00:20:52 Young minds, it’s also to,

00:20:55 if you think about the other aspect

00:20:57 of where the whole industry is moving with AI,

00:21:01 there’s a dearth of talent given the demands.

00:21:04 So you do want universities to have a clear place

00:21:09 where they can invent and research

00:21:11 and not fall behind that they can’t motivate students.

00:21:13 Imagine all grad students left to industry like us

00:21:19 or faculty members, which has happened too.

00:21:22 So this is a way that if you’re so passionate

00:21:25 about the field where you feel industry and academia

00:21:28 need to work well, this is a great example

00:21:31 and a great way for universities to participate.

00:21:35 So what do you think it takes to build a system

00:21:37 that wins the Alexa Prize?

00:21:39 I think you have to start focusing on aspects of reasoning

00:21:46 that it is, there are still more lookups

00:21:50 of what intents customers asking for

00:21:54 and responding to those rather than really reasoning

00:21:58 about the elements of the conversation.

00:22:02 For instance, if you’re playing,

00:22:06 if the conversation is about games

00:22:08 and it’s about a recent sports event,

00:22:11 there’s so much context involved

00:22:13 and you have to understand the entities

00:22:15 that are being mentioned

00:22:17 so that the conversation is coherent

00:22:19 rather than you suddenly just switch to knowing some fact

00:22:23 about a sports entity and you’re just relaying that

00:22:26 rather than understanding the true context of the game.

00:22:28 Like if you just said, I learned this fun fact

00:22:32 about Tom Brady rather than really say

00:22:36 how he played the game the previous night,

00:22:39 then the conversation is not really that intelligent.

00:22:42 So you have to go to more reasoning elements

00:22:46 of understanding the context of the dialogue

00:22:49 and giving more appropriate responses,

00:22:51 which tells you that we are still quite far

00:22:53 because a lot of times it’s more facts being looked up

00:22:57 and something that’s close enough as an answer,

00:22:59 but not really the answer.

00:23:02 So that is where the research needs to go more

00:23:05 and actual true understanding and reasoning.

00:23:08 And that’s why I feel it’s a great way to do it

00:23:10 because you have an engaged set of users

00:23:13 working to help these AI advances happen in this case.

00:23:18 You mentioned customers, they’re quite a bit,

00:23:20 and there’s a skill.

00:23:22 What is the experience for the user that’s helping?

00:23:26 So just to clarify, this isn’t, as far as I understand,

00:23:30 the Alexa, so this skill is a standalone

00:23:32 for the Alexa Prize.

00:23:33 I mean, it’s focused on the Alexa Prize.

00:23:35 It’s not you ordering certain things on Amazon.

00:23:37 Like, oh, we’re checking the weather

00:23:39 or playing Spotify, right?

00:23:40 This is a separate skill.

00:23:42 And so you’re focused on helping that,

00:23:45 I don’t know, how do people, how do customers think of it?

00:23:48 Are they having fun?

00:23:49 Are they helping teach the system?

00:23:52 What’s the experience like?

00:23:53 I think it’s both actually.

00:23:54 And let me tell you how you invoke this skill.

00:23:57 So all you have to say, Alexa, let’s chat.

00:24:00 And then the first time you say, Alexa, let’s chat,

00:24:03 it comes back with a clear message

00:24:04 that you’re interacting with one of those

00:24:06 university social bots.

00:24:08 And there’s a clear,

00:24:09 so you know exactly how you interact, right?

00:24:11 And that is why it’s very transparent.

00:24:14 You are being asked to help, right?

00:24:16 And we have a lot of mechanisms

00:24:18 where as we are in the first phase of feedback phase,

00:24:23 then you send a lot of emails to our customers

00:24:26 and then they know that the team needs a lot of interactions

00:24:31 to improve the accuracy of the system.

00:24:33 So we know we have a lot of customers

00:24:35 who really want to help these university bots

00:24:38 and they’re conversing with that.

00:24:40 And some are just having fun with just saying,

00:24:42 Alexa, let’s chat.

00:24:44 And also some adversarial behavior to see whether,

00:24:47 how much do you understand as a social bot?

00:24:50 So I think we have a good,

00:24:51 healthy mix of all three situations.

00:24:53 So what is the,

00:24:55 if we talk about solving the Alexa challenge,

00:24:58 the Alexa prize,

00:25:00 what’s the data set of really engaging,

00:25:05 pleasant conversations look like?

00:25:07 Because if we think of this

00:25:08 as a supervised learning problem,

00:25:10 I don’t know if it has to be,

00:25:12 but if it does, maybe you can comment on that.

00:25:15 Do you think there needs to be a data set

00:25:17 of what it means to be an engaging, successful,

00:25:21 fulfilling conversation?

00:25:22 I think that’s part of the research question here.

00:25:24 This was, I think, we at least got the first part right,

00:25:29 which is have a way for universities to build

00:25:33 and test in a real world setting.

00:25:35 Now you’re asking in terms of the next phase of questions,

00:25:38 which we are still, we’re also asking, by the way,

00:25:41 what does success look like from a optimization function?

00:25:45 That’s what you’re asking in terms of,

00:25:47 we as researchers are used to having a great corpus

00:25:49 of annotated data and then making,

00:25:53 then sort of tune our algorithms on those, right?

00:25:57 And fortunately and unfortunately,

00:26:00 in this world of Alexa prize,

00:26:02 that is not the way we are going after it.

00:26:05 So you have to focus more on learning

00:26:07 based on life feedback.

00:26:10 That is another element that’s unique,

00:26:12 where just not to,

00:26:15 I started with giving you how you ingress

00:26:17 and experience this capability as a customer.

00:26:21 What happens when you’re done?

00:26:23 So they ask you a simple question on a scale of one to five,

00:26:27 how likely are you to interact with this social bot again?

00:26:31 That is a good feedback

00:26:33 and customers can also leave more open ended feedback.

00:26:37 And I think partly that to me

00:26:40 is one part of the question you’re asking,

00:26:42 which I’m saying is a mental model shift

00:26:44 that as researchers also,

00:26:47 you have to change your mindset

00:26:48 that this is not a DARPA evaluation or NSF funded study

00:26:52 and you have a nice corpus.

00:26:54 This is where it’s real world.

00:26:56 You have real data.

00:26:58 The scale is amazing and that’s a beautiful thing.

00:27:01 And then the customer,

00:27:02 the user can quit the conversation at any time.

00:27:06 Exactly, the user can,

00:27:07 that is also a signal for how good you were at that point.

00:27:11 So, and then on a scale one to five, one to three,

00:27:15 do they say how likely are you

00:27:16 or is it just a binary?

00:27:18 One to five.

00:27:18 One to five.

00:27:20 Wow, okay, that’s such a beautifully constructed challenge.

00:27:22 Okay.

00:27:24 You said the only way to make a smart assistant really smart

00:27:30 is to give it eyes and let it explore the world.

00:27:34 I’m not sure it might’ve been taken out of context,

00:27:36 but can you comment on that?

00:27:38 Can you elaborate on that idea?

00:27:40 Is that I personally also find that idea super exciting

00:27:43 from a social robotics, personal robotics perspective.

00:27:46 Yeah, a lot of things do get taken out of context.

00:27:48 This particular one was just

00:27:50 as philosophical discussion we were having

00:27:53 on terms of what does intelligence look like?

00:27:55 And the context was in terms of learning,

00:27:59 I think just we said we as humans are empowered

00:28:03 with many different sensory abilities.

00:28:05 I do believe that eyes are an important aspect of it

00:28:09 in terms of if you think about how we as humans learn,

00:28:14 it is quite complex and it’s also not unimodal

00:28:18 that you are fed a ton of text or audio

00:28:22 and you just learn that way.

00:28:23 No, you learn by experience, you learn by seeing,

00:28:27 you’re taught by humans

00:28:30 and we are very efficient in how we learn.

00:28:33 Machines on the contrary are very inefficient

00:28:35 on how they learn, especially these AIs.

00:28:38 I think the next wave of research is going to be

00:28:42 with less data, not just less human,

00:28:46 not just with less labeled data,

00:28:48 but also with a lot of weak supervision

00:28:51 and where you can increase the learning rate.

00:28:55 I don’t mean less data

00:28:56 in terms of not having a lot of data to learn from

00:28:58 that we are generating so much data,

00:29:00 but it is more about from a aspect

00:29:02 of how fast can you learn?

00:29:04 So improving the quality of the data,

00:29:07 the quality of data and the learning process.

00:29:09 I think more on the learning process.

00:29:11 I think we have to, we as humans learn

00:29:13 with a lot of noisy data, right?

00:29:15 And I think that’s the part

00:29:18 that I don’t think should change.

00:29:21 What should change is how we learn, right?

00:29:23 So if you look at, you mentioned supervised learning,

00:29:26 we have making transformative shifts

00:29:27 from moving to more unsupervised, more weak supervision.

00:29:31 Those are the key aspects of how to learn.

00:29:34 And I think in that setting, I hope you agree with me

00:29:37 that having other senses is very crucial

00:29:41 in terms of how you learn.

00:29:43 So absolutely.

00:29:44 And from a machine learning perspective,

00:29:46 which I hope we get a chance to talk to a few aspects

00:29:49 that are fascinating there,

00:29:51 but to stick on the point of sort of a body,

00:29:55 an embodiment.

00:29:56 So Alexa has a body.

00:29:57 It has a very minimalistic, beautiful interface

00:30:01 where there’s a ring and so on.

00:30:02 I mean, I’m not sure of all the flavors

00:30:04 of the devices that Alexa lives on,

00:30:07 but there’s a minimalistic basic interface.

00:30:13 And nevertheless, we humans, so I have a Roomba,

00:30:15 I have all kinds of robots all over everywhere.

00:30:18 So what do you think the Alexa of the future looks like

00:30:24 if it begins to shift what his body looks like?

00:30:29 Maybe beyond the Alexa,

00:30:30 what do you think are the different devices in the home

00:30:33 as they start to embody their intelligence more and more?

00:30:36 What do you think that looks like?

00:30:38 Philosophically, a future, what do you think that looks like?

00:30:41 I think let’s look at what’s happening today.

00:30:43 You mentioned, I think our devices as an Amazon devices,

00:30:46 but I also wanted to point out Alexa is already integrated

00:30:49 a lot of third party devices,

00:30:51 which also come in lots of forms and shapes,

00:30:54 some in robots, some in microwaves,

00:30:58 some in appliances that you use in everyday life.

00:31:02 So I think it’s not just the shape Alexa takes

00:31:07 in terms of form factors,

00:31:09 but it’s also where all it’s available.

00:31:13 And it’s getting in cars,

00:31:14 it’s getting in different appliances in homes,

00:31:16 even toothbrushes, right?

00:31:18 So I think you have to think about it

00:31:20 as not a physical assistant.

00:31:25 It will be in some embodiment, as you said,

00:31:28 we already have these nice devices,

00:31:31 but I think it’s also important to think of it,

00:31:33 it is a virtual assistant.

00:31:35 It is superhuman in the sense that it is in multiple places

00:31:38 at the same time.

00:31:40 So I think the actual embodiment in some sense,

00:31:45 to me doesn’t matter.

00:31:47 I think you have to think of it as not as human like

00:31:52 and more of what its capabilities are

00:31:56 that derive a lot of benefit for customers

00:31:58 and how there are different ways to delight it

00:32:00 and delight customers and different experiences.

00:32:03 And I think I’m a big fan of it not being just human like,

00:32:09 it should be human like in certain situations.

00:32:11 Alexa price social bot in terms of conversation

00:32:13 is a great way to look at it,

00:32:14 but there are other scenarios where human like,

00:32:18 I think is underselling the abilities of this AI.

00:32:22 So if I could trivialize what we’re talking about.

00:32:26 So if you look at the way Steve Jobs thought

00:32:29 about the interaction with the device that Apple produced,

00:32:33 there was a extreme focus on controlling the experience

00:32:36 by making sure there’s only this Apple produced devices.

00:32:40 You see the voice of Alexa being taking all kinds of forms

00:32:45 depending on what the customers want.

00:32:47 And that means it could be anywhere

00:32:49 from the microwave to a vacuum cleaner to the home

00:32:53 and so on the voice is the essential element

00:32:56 of the interaction.

00:32:57 I think voice is an essence, it’s not all,

00:33:01 but it’s a key aspect.

00:33:02 I think to your question in terms of,

00:33:05 you should be able to recognize Alexa

00:33:08 and that’s a huge problem.

00:33:10 I think in terms of a huge scientific problem,

00:33:12 I should say like, what are the traits?

00:33:13 What makes it look like Alexa,

00:33:16 especially in different settings

00:33:17 and especially if it’s primarily voice, what it is,

00:33:20 but Alexa is not just voice either, right?

00:33:22 I mean, we have devices with a screen.

00:33:25 Now you’re seeing just other behaviors of Alexa.

00:33:28 So I think we’re in very early stages of what that means

00:33:31 and this will be an important topic for the following years.

00:33:34 But I do believe that being able to recognize

00:33:38 and tell when it’s Alexa versus it’s not

00:33:40 is going to be important from an Alexa perspective.

00:33:43 I’m not speaking for the entire AI community,

00:33:46 but I think attribution and as we go into more

00:33:51 of understanding who did what,

00:33:54 that identity of the AI is crucial in the coming world.

00:33:58 I think from the broad AI community perspective,

00:34:00 that’s also a fascinating problem.

00:34:02 So basically if I close my eyes and listen to the voice,

00:34:05 what would it take for me to recognize that this is Alexa?

00:34:08 Exactly.

00:34:08 Or at least the Alexa that I’ve come to know

00:34:10 from my personal experience in my home

00:34:13 through my interactions that come through.

00:34:14 Yeah, and the Alexa here in the US is very different

00:34:16 than Alexa in UK and the Alexa in India,

00:34:19 even though they are all speaking English

00:34:21 or the Australian version.

00:34:23 So again, so now think about when you go

00:34:26 into a different culture, a different community,

00:34:28 but you travel there, what do you recognize Alexa?

00:34:31 I think these are super hard questions actually.

00:34:34 So there’s a team that works on personality.

00:34:36 So if we talk about those different flavors

00:34:39 of what it means culturally speaking,

00:34:41 India, UK, US, what does it mean to add?

00:34:44 So the problem that we just stated,

00:34:46 it’s just fascinating, how do we make it purely recognizable

00:34:51 that it’s Alexa, assuming that the qualities

00:34:55 of the voice are not sufficient?

00:34:58 It’s also the content of what is being said.

00:35:01 How do we do that?

00:35:02 How does the personality come into play?

00:35:04 What’s that research gonna look like?

00:35:06 I mean, it’s such a fascinating subject.

00:35:08 We have some very fascinating folks

00:35:11 who from both the UX background and human factors

00:35:13 are looking at these aspects and these exact questions.

00:35:16 But I’ll definitely say it’s not just how it sounds,

00:35:21 the choice of words, the tone, not just, I mean,

00:35:25 the voice identity of it, but the tone matters,

00:35:28 the speed matters, how you speak,

00:35:30 how you enunciate words, what choice of words

00:35:34 are you using, how terse are you,

00:35:37 or how lengthy in your explanations you are.

00:35:40 All of these are factors.

00:35:42 And you also, you mentioned something crucial

00:35:45 that you may have personalized it, Alexa,

00:35:49 to some extent in your homes

00:35:51 or in the devices you are interacting with.

00:35:53 So you, as your individual, how you prefer Alexa sounds

00:35:59 can be different than how I prefer.

00:36:01 And the amount of customizability you want to give

00:36:04 is also a key debate we always have.

00:36:07 But I do want to point out it’s more than the voice actor

00:36:10 that recorded and it sounds like that actor.

00:36:14 It is more about the choices of words,

00:36:16 the attributes of tonality, the volume

00:36:19 in terms of how you raise your pitch and so forth.

00:36:22 All of that matters.

00:36:23 This is such a fascinating problem

00:36:25 from a product perspective.

00:36:27 I could see those debates just happening

00:36:29 inside of the Alexa team of how much personalization

00:36:32 do you do for the specific customer?

00:36:34 Because you’re taking a risk if you over personalize.

00:36:38 Because you don’t, if you create a personality

00:36:42 for a million people, you can test that better.

00:36:46 You can create a rich, fulfilling experience

00:36:48 that will do well.

00:36:50 But the more you personalize it, the less you can test it,

00:36:53 the less you can know that it’s a great experience.

00:36:56 So how much personalization, what’s the right balance?

00:36:59 I think the right balance depends on the customer.

00:37:01 Give them the control.

00:37:02 So I’ll say, I think the more control you give customers,

00:37:07 the better it is for everyone.

00:37:09 And I’ll give you some key personalization features.

00:37:13 I think we have a feature called Remember This,

00:37:15 which is where you can tell Alexa to remember something.

00:37:19 There you have an explicit sort of control

00:37:23 in customer’s hand because they have to say,

00:37:24 Alexa, remember X, Y, Z.

00:37:26 What kind of things would that be used for?

00:37:28 So you can like you, I have stored my tire specs

00:37:32 for my car because it’s so hard to go and find

00:37:34 and see what it is, right?

00:37:36 When you’re having some issues.

00:37:38 I store my mileage plan numbers

00:37:41 for all the frequent flyer ones

00:37:43 where I’m sometimes just looking at it and it’s not handy.

00:37:46 So those are my own personal choices I’ve made

00:37:49 for Alexa to remember something on my behalf, right?

00:37:52 So again, I think the choice was be explicit

00:37:56 about how you provide that to a customer as a control.

00:38:00 So I think these are the aspects of what you do.

00:38:03 Like think about where we can use speaker recognition

00:38:07 capabilities that it’s, if you taught Alexa

00:38:11 that you are Lex and this person in your household

00:38:14 is person two, then you can personalize the experiences.

00:38:17 Again, these are very in the CX customer experience patterns

00:38:22 are very clear about and transparent

00:38:26 when a personalization action is happening.

00:38:30 And then you have other ways like you go

00:38:32 through explicit control right now through your app

00:38:34 that your multiple service providers,

00:38:36 let’s say for music, which one is your preferred one.

00:38:39 So when you say play sting, depend on your

00:38:42 whether you have preferred Spotify or Amazon music

00:38:44 or Apple music, that the decision is made

00:38:47 where to play it from.

00:38:49 So what’s Alexa’s backstory from her perspective?

00:38:52 Is there, I remember just asking as probably a lot

00:38:58 of us are just the basic questions about love

00:39:00 and so on of Alexa, just to see what the answer would be.

00:39:03 It feels like there’s a little bit of a personality

00:39:10 but not too much.

00:39:12 Is Alexa have a metaphysical presence

00:39:18 in this human universe we live in

00:39:21 or is it something more ambiguous?

00:39:23 Is there a past?

00:39:25 Is there a birth?

00:39:26 Is there a family kind of idea

00:39:28 even for joking purposes and so on?

00:39:31 I think, well, it does tell you if I think you,

00:39:34 I should double check this but if you said

00:39:36 when were you born, I think we do respond.

00:39:39 I need to double check that

00:39:40 but I’m pretty positive about it.

00:39:41 I think you do actually because I think I’ve tested that.

00:39:44 But that’s like how I was born in your brand of champagne

00:39:49 and whatever the year kind of thing, yeah.

00:39:51 So in terms of the metaphysical, I think it’s early.

00:39:55 Does it have the historic knowledge about herself

00:40:00 to be able to do that?

00:40:01 Maybe, have we crossed that boundary?

00:40:03 Not yet, right?

00:40:04 In terms of being, thank you.

00:40:06 Have we thought about it quite a bit

00:40:08 but I wouldn’t say that we have come to a clear decision

00:40:11 in terms of what it should look like.

00:40:13 But you can imagine though, and I bring this back

00:40:16 to the Alexa Prize social bot one,

00:40:19 there you will start seeing some of that.

00:40:21 Like these bots have their identity

00:40:23 and in terms of that, you may find,

00:40:26 this is such a great research topic

00:40:28 that some academia team may think of these problems

00:40:32 and start solving them too.

00:40:35 So let me ask a question.

00:40:38 It’s kind of difficult, I think,

00:40:41 but it feels, and fascinating to me

00:40:43 because I’m fascinated with psychology.

00:40:45 It feels that the more personality you have,

00:40:48 the more dangerous it is

00:40:50 in terms of a customer perspective of product.

00:40:54 If you want to create a product that’s useful.

00:40:57 By dangerous, I mean creating an experience that upsets me.

00:41:02 And so how do you get that right?

00:41:06 Because if you look at the relationships,

00:41:10 maybe I’m just a screwed up Russian,

00:41:11 but if you look at the human to human relationship,

00:41:15 some of our deepest relationships have fights,

00:41:18 have tension, have the push and pull,

00:41:21 have a little flavor in them.

00:41:22 Do you want to have such flavor in an interaction with Alexa?

00:41:26 How do you think about that?

00:41:28 So there’s one other common thing that you didn’t say,

00:41:31 but we think of it as paramount for any deep relationship.

00:41:35 That’s trust.

00:41:36 Trust, yeah.

00:41:37 So I think if you trust every attribute you said,

00:41:40 a fight, some tension, is all healthy.

00:41:44 But what is sort of unnegotiable in this instance is trust.

00:41:49 And I think the bar to earn customer trust for AI

00:41:52 is very high, in some sense, more than a human.

00:41:56 It’s not just about personal information or your data.

00:42:02 It’s also about your actions on a daily basis.

00:42:05 How trustworthy are you in terms of consistency,

00:42:07 in terms of how accurate are you in understanding me?

00:42:11 Like if you’re talking to a person on the phone,

00:42:13 if you have a problem with your,

00:42:14 let’s say your internet or something,

00:42:16 if the person’s not understanding,

00:42:17 you lose trust right away.

00:42:19 You don’t want to talk to that person.

00:42:20 That whole example gets amplified by a factor of 10,

00:42:24 because when you’re a human interacting with an AI,

00:42:28 you have a certain expectation.

00:42:29 Either you expect it to be very intelligent

00:42:31 and then you get upset, why is it behaving this way?

00:42:34 Or you expect it to be not so intelligent

00:42:37 and when it surprises you, you’re like,

00:42:38 really, you’re trying to be too smart?

00:42:40 So I think we grapple with these hard questions as well.

00:42:43 But I think the key is actions need to be trustworthy.

00:42:47 From these AIs, not just about data protection,

00:42:50 your personal information protection,

00:42:53 but also from how accurately it accomplishes

00:42:57 all commands or all interactions.

00:42:59 Well, it’s tough to hear because trust,

00:43:02 you’re absolutely right,

00:43:03 but trust is such a high bar with AI systems

00:43:05 because people, and I see this

00:43:07 because I work with autonomous vehicles.

00:43:08 I mean, the bar that’s placed on AI system

00:43:11 is unreasonably high.

00:43:13 Yeah, that is going to be, I agree with you.

00:43:16 And I think of it as it’s a challenge

00:43:19 and it’s also keeps my job, right?

00:43:23 So from that perspective, I totally,

00:43:26 I think of it at both sides as a customer

00:43:28 and as a researcher.

00:43:30 I think as a researcher, yes, occasionally it will frustrate

00:43:33 me that why is the bar so high for these AIs?

00:43:36 And as a customer, then I say,

00:43:38 absolutely, it has to be that high, right?

00:43:40 So I think that’s the trade off we have to balance,

00:43:44 but it doesn’t change the fundamentals.

00:43:46 That trust has to be earned and the question then becomes

00:43:50 is are we holding the AIs to a different bar

00:43:53 in accuracy and mistakes than we hold humans?

00:43:56 That’s going to be a great societal questions

00:43:58 for years to come, I think for us.

00:44:00 Well, one of the questions that we grapple as a society now

00:44:04 that I think about a lot,

00:44:05 I think a lot of people in the AI think about a lot

00:44:07 and Alexis taking on head on is privacy.

00:44:11 The reality is us giving over data to any AI system

00:44:20 can be used to enrich our lives in profound ways.

00:44:25 So if basically any product that does anything awesome

00:44:28 for you, the more data it has,

00:44:31 the more awesome things it can do.

00:44:34 And yet on the other side,

00:44:37 people imagine the worst case possible scenario

00:44:39 of what can you possibly do with that data?

00:44:42 People, it’s goes down to trust, as you said before.

00:44:45 There’s a fundamental distrust of,

00:44:48 in certain groups of governments and so on.

00:44:50 And depending on the government,

00:44:51 depending on who’s in power,

00:44:52 depending on all these kinds of factors.

00:44:55 And so here’s Alexa in the middle of all of it in the home,

00:44:59 trying to do good things for the customers.

00:45:02 So how do you think about privacy in this context,

00:45:05 the smart assistance in the home?

00:45:06 How do you maintain, how do you earn trust?

00:45:08 Absolutely.

00:45:09 So as you said, trust is the key here.

00:45:12 So you start with trust

00:45:13 and then privacy is a key aspect of it.

00:45:16 It has to be designed from very beginning about that.

00:45:20 And we believe in two fundamental principles.

00:45:23 One is transparency and second is control.

00:45:26 So by transparency, I mean,

00:45:28 when we build what is now called smart speaker

00:45:32 or the first echo,

00:45:34 we were quite judicious about making these right trade offs

00:45:38 on customer’s behalf,

00:45:40 that it is pretty clear

00:45:41 when the audio is being sent to cloud,

00:45:44 the light ring comes on

00:45:45 when it has heard you say the word wake word,

00:45:48 and then the streaming happens, right?

00:45:49 So when the light ring comes up,

00:45:51 we also had, we put a physical mute button on it,

00:45:55 just so if you didn’t want it to be listening,

00:45:57 even for the wake word,

00:45:58 then you turn the power button or the mute button on,

00:46:01 and that disables the microphones.

00:46:04 That’s just the first decision on essentially transparency

00:46:08 and control.

00:46:09 Oh, then even when we launched,

00:46:11 we gave the control in the hands of the customers

00:46:13 that you can go and look at any of your individual utterances

00:46:16 that is recorded and delete them anytime.

00:46:19 And we’ve got to true to that promise, right?

00:46:22 So, and that is super, again,

00:46:25 a great instance of showing how you have the control.

00:46:29 Then we made it even easier.

00:46:30 You can say, like I said, delete what I said today.

00:46:33 So that is now making it even just more control

00:46:36 in your hands with what’s most convenient

00:46:39 about this technology is voice.

00:46:42 You delete it with your voice now.

00:46:44 So these are the types of decisions we continually make.

00:46:48 We just recently launched this feature called,

00:46:51 what we think of it as,

00:46:52 if you wanted humans not to review your data,

00:46:56 because you’ve mentioned supervised learning, right?

00:46:59 So in supervised learning,

00:47:01 humans have to give some annotation.

00:47:03 And that also is now a feature

00:47:06 where you can essentially, if you’ve selected that flag,

00:47:09 your data will not be reviewed by a human.

00:47:11 So these are the types of controls

00:47:13 that we have to constantly offer with customers.

00:47:18 So why do you think it bothers people so much that,

00:47:23 so everything you just said is really powerful.

00:47:26 So the control, the ability to delete,

00:47:28 cause we collect, we have studies here running at MIT

00:47:31 that collects huge amounts of data

00:47:32 and people consent and so on.

00:47:34 The ability to delete that data is really empowering

00:47:38 and almost nobody ever asked to delete it,

00:47:40 but the ability to have that control is really powerful.

00:47:44 But still, there’s these popular anecdote,

00:47:47 anecdotal evidence that people say,

00:47:49 they like to tell that,

00:47:51 them and a friend were talking about something,

00:47:53 I don’t know, sweaters for cats.

00:47:56 And all of a sudden they’ll have advertisements

00:47:58 for cat sweaters on Amazon.

00:48:01 That’s a popular anecdote

00:48:02 as if something is always listening.

00:48:05 What, can you explain that anecdote,

00:48:07 that experience that people have?

00:48:09 What’s the psychology of that?

00:48:11 What’s that experience?

00:48:13 And can you, you’ve answered it,

00:48:15 but let me just ask, is Alexa listening?

00:48:18 No, Alexa listens only for the wake word on the device.

00:48:22 And the wake word is?

00:48:23 The words like Alexa, Amazon, Echo,

00:48:28 but you only choose one at a time.

00:48:29 So you choose one and it listens only

00:48:31 for that on our devices.

00:48:34 So that’s first.

00:48:35 From a listening perspective,

00:48:36 we have to be very clear that it’s just the wake word.

00:48:38 So you said, why is there this anxiety, if you may?

00:48:41 Yeah, exactly.

00:48:42 It’s because there’s a lot of confusion,

00:48:43 what it really listens to, right?

00:48:45 And I think it’s partly on us to keep educating

00:48:49 our customers and the general media more

00:48:52 in terms of like how, what really happens.

00:48:54 And we’ve done a lot of it.

00:48:56 And our pages on information are clear,

00:49:00 but still people have to have more,

00:49:04 there’s always a hunger for information and clarity.

00:49:06 And we’ll constantly look at how best to communicate.

00:49:09 If you go back and read everything,

00:49:10 yes, it states exactly that.

00:49:13 And then people could still question it.

00:49:15 And I think that’s absolutely okay to question.

00:49:17 What we have to make sure is that we are,

00:49:21 because our fundamental philosophy is customer first,

00:49:24 customer obsession is our leadership principle.

00:49:27 If you put, as researchers, I put myself

00:49:31 in the shoes of the customer,

00:49:33 and all decisions in Amazon are made with that.

00:49:35 And trust has to be earned,

00:49:38 and we have to keep earning the trust

00:49:39 of our customers in this setting.

00:49:41 And to your other point on like,

00:49:44 is there something showing up

00:49:45 based on your conversations?

00:49:46 No, I think the answer is like,

00:49:49 a lot of times when those experiences happen,

00:49:51 you have to also know that, okay,

00:49:52 it may be a winter season,

00:49:54 people are looking for sweaters, right?

00:49:56 And it shows up on your amazon.com because it is popular.

00:49:59 So there are many of these,

00:50:02 you mentioned that personality or personalization,

00:50:06 turns out we are not that unique either, right?

00:50:09 So those things we as humans start thinking,

00:50:12 oh, must be because something was heard,

00:50:14 and that’s why this other thing showed up.

00:50:16 The answer is no,

00:50:17 probably it is just the season for sweaters.

00:50:21 I’m not gonna ask you this question

00:50:23 because people have so much paranoia.

00:50:27 But let me just say from my perspective,

00:50:29 I hope there’s a day when customer can ask Alexa

00:50:33 to listen all the time,

00:50:35 to improve the experience,

00:50:36 to improve because I personally don’t see the negative

00:50:40 because if you have the control and if you have the trust,

00:50:43 there’s no reason why I shouldn’t be listening

00:50:45 all the time to the conversations to learn more about you.

00:50:48 Because ultimately,

00:50:49 as long as you have control and trust,

00:50:52 every data you provide to the device,

00:50:55 that the device wants is going to be useful.

00:51:00 And so to me, as a machine learning person,

00:51:03 I think it worries me how sensitive people are

00:51:08 about their data relative to how empowering it could be

00:51:13 relative to how empowering it could be

00:51:19 for the devices around them,

00:51:21 how enriching it could be for their own life

00:51:23 to improve the product.

00:51:25 So I just, it’s something I think about sort of a lot,

00:51:28 how do we make that devices,

00:51:29 obviously Alexa thinks about a lot as well.

00:51:32 I don’t know if you wanna comment on that,

00:51:34 sort of, okay, have you seen,

00:51:35 let me ask it in the form of a question, okay.

00:51:38 Have you seen an evolution in the way people think about

00:51:42 their private data in the previous several years?

00:51:46 So as we as a society get more and more comfortable

00:51:48 to the benefits we get by sharing more data.

00:51:53 First, let me answer that part

00:51:55 and then I’ll wanna go back

00:51:55 to the other aspect you were mentioning.

00:51:58 So as a society, on a general,

00:52:01 we are getting more comfortable as a society.

00:52:03 Doesn’t mean that everyone is,

00:52:05 and I think we have to respect that.

00:52:07 I don’t think one size fits all

00:52:10 is always gonna be the answer for all, right?

00:52:13 By definition.

00:52:14 So I think that’s something to keep in mind in these.

00:52:17 Going back to your, on what more

00:52:21 magical experiences can be launched

00:52:23 in these kinds of AI settings.

00:52:26 I think again, if you give the control,

00:52:29 we, it’s possible certain parts of it.

00:52:32 So we have a feature called follow up mode

00:52:33 where you, if you turn it on

00:52:37 and Alexa, after you’ve spoken to it,

00:52:40 will open the mics again,

00:52:42 thinking you will answer something again.

00:52:44 Like if you’re adding lists to your shopping item,

00:52:47 so right, or a shopping list or to do list,

00:52:50 you’re not done.

00:52:51 You want to keep, so in that setting,

00:52:53 it’s awesome that it opens the mic

00:52:54 for you to say eggs and milk and then bread, right?

00:52:57 So these are the kinds of things which you can empower.

00:52:59 So, and then another feature we have,

00:53:02 which is called Alexa Guard.

00:53:04 I said it only listens for the wake word, right?

00:53:07 But if you have, let’s say you’re going to say,

00:53:10 like you leave your home and you want Alexa to listen

00:53:13 for a couple of sound events like smoke alarm going off

00:53:17 or someone breaking your glass, right?

00:53:19 So it’s like just to keep your peace of mind.

00:53:22 So you can say Alexa on guard or I’m away

00:53:26 and then it can be listening for these sound events.

00:53:29 And when you’re home, you come out of that mode, right?

00:53:33 So this is another one where you again gave controls

00:53:35 in the hands of the user or the customer

00:53:38 and to enable some experience that is high utility

00:53:42 and maybe even more delightful in the certain settings

00:53:44 like follow up mode and so forth.

00:53:46 And again, this general principle is the same,

00:53:48 control in the hands of the customer.

00:53:52 So I know we kind of started with a lot of philosophy

00:53:55 and a lot of interesting topics

00:53:56 and we’re just jumping all over the place,

00:53:58 but really some of the fascinating things

00:54:00 that the Alexa team and Amazon is doing

00:54:03 is in the algorithm side, the data side,

00:54:05 the technology, the deep learning, machine learning

00:54:07 and so on.

00:54:08 So can you give a brief history of Alexa

00:54:13 from the perspective of just innovation,

00:54:15 the algorithms, the data of how it was born,

00:54:18 how it came to be, how it’s grown, where it is today?

00:54:22 Yeah, it start with in Amazon,

00:54:24 everything starts with the customer

00:54:27 and we have a process called working backwards.

00:54:30 Alexa and more specifically than the product Echo,

00:54:35 there was a working backwards document essentially

00:54:37 that reflected what it would be,

00:54:38 started with a very simple vision statement for instance

00:54:44 that morphed into a full fledged document

00:54:47 along the way changed into what all it can do, right?

00:54:51 But the inspiration was the Star Trek computer.

00:54:54 So when you think of it that way,

00:54:56 everything is possible, but when you launch a product,

00:54:58 you have to start with some place.

00:55:01 And when I joined, the product was already in conception

00:55:05 and we started working on the far field speech recognition

00:55:08 because that was the first thing to solve.

00:55:10 By that we mean that you should be able to speak

00:55:12 to the device from a distance.

00:55:15 And in those days, that wasn’t a common practice.

00:55:18 And even in the previous research world I was in

00:55:22 was considered to an unsolvable problem then

00:55:24 in terms of whether you can converse from a length.

00:55:28 And here I’m still talking about the first part

00:55:30 of the problem where you say,

00:55:32 get the attention of the device

00:55:34 as in by saying what we call the wake word,

00:55:37 which means the word Alexa has to be detected

00:55:40 with a very high accuracy because it is a very common word.

00:55:44 It has sound units that map with words like I like you

00:55:48 or Alec, Alex, right?

00:55:51 So it’s a undoubtedly hard problem to detect

00:55:56 the right mentions of Alexa’s address to the device

00:56:00 versus I like Alexa.

00:56:02 So you have to pick up that signal

00:56:04 when there’s a lot of noise.

00:56:06 Not only noise but a lot of conversation in the house,

00:56:09 right?

00:56:09 You remember on the device,

00:56:10 you’re simply listening for the wake word, Alexa.

00:56:13 And there’s a lot of words being spoken in the house.

00:56:15 How do you know it’s Alexa and directed at Alexa?

00:56:21 Because I could say, I love my Alexa, I hate my Alexa.

00:56:25 I want Alexa to do this.

00:56:27 And in all these three sentences, I said, Alexa,

00:56:29 I didn’t want it to wake up.

00:56:32 Can I just pause on that second?

00:56:33 What would be your device that I should probably

00:56:36 in the introduction of this conversation give to people

00:56:39 in terms of them turning off their Alexa device

00:56:43 if they’re listening to this podcast conversation out loud?

00:56:49 Like what’s the probability that an Alexa device

00:56:51 will go off because we mentioned Alexa like a million times.

00:56:55 So it will, we have done a lot of different things

00:56:58 where we can figure out that there is the device,

00:57:03 the speech is coming from a human versus over the air.

00:57:08 Also, I mean, in terms of like, also it is think about ads

00:57:11 or so we have also launched a technology

00:57:14 for watermarking kind of approaches

00:57:16 in terms of filtering it out.

00:57:18 But yes, if this kind of a podcast is happening,

00:57:21 it’s possible your device will wake up a few times.

00:57:24 It’s an unsolved problem,

00:57:25 but it is definitely something we care very much about.

00:57:31 But the idea is you wanna detect Alexa.

00:57:33 Meant for the device.

00:57:36 First of all, just even hearing Alexa versus I like something.

00:57:40 I mean, that’s a fascinating part.

00:57:41 So that was the first relief.

00:57:43 That’s the first.

00:57:43 The world’s best detector of Alexa.

00:57:45 Yeah, the world’s best wake word detector

00:57:48 in a far field setting,

00:57:49 not like something where the phone is sitting on the table.

00:57:53 This is like people have devices 40 feet away

00:57:56 like in my house or 20 feet away and you still get an answer.

00:58:00 So that was the first part.

00:58:02 The next is, okay, you’re speaking to the device.

00:58:05 Of course, you’re gonna issue many different requests.

00:58:09 Some may be simple, some may be extremely hard,

00:58:11 but it’s a large vocabulary speech recognition problem

00:58:13 essentially, where the audio is now not coming

00:58:17 onto your phone or a handheld mic like this

00:58:20 or a close talking mic, but it’s from 20 feet away

00:58:23 where if you’re in a busy household,

00:58:26 your son may be listening to music,

00:58:28 your daughter may be running around with something

00:58:31 and asking your mom something and so forth, right?

00:58:33 So this is like a common household setting

00:58:36 where the words you’re speaking to Alexa

00:58:40 need to be recognized with very high accuracy, right?

00:58:43 Now we are still just in the recognition problem.

00:58:45 We haven’t yet come to the understanding one, right?

00:58:48 And if I pause them, sorry, once again,

00:58:50 what year was this?

00:58:51 Is this before neural networks began to start

00:58:56 to seriously prove themselves in the audio space?

00:59:00 Yeah, this is around, so I joined in 2013 in April, right?

00:59:05 So the early research and neural networks coming back

00:59:08 and showing some promising results

00:59:11 in speech recognition space had started happening,

00:59:13 but it was very early.

00:59:15 But we just now build on that

00:59:17 on the very first thing we did when I joined with the team.

00:59:23 And remember, it was a very much of a startup environment,

00:59:25 which is great about Amazon.

00:59:28 And we doubled down on deep learning right away.

00:59:31 And we knew we’ll have to improve accuracy fast.

00:59:36 And because of that, we worked on,

00:59:38 and the scale of data, once you have a device like this,

00:59:41 if it is successful, will improve big time.

00:59:44 Like you’ll suddenly have large volumes of data

00:59:48 to learn from to make the customer experience better.

00:59:51 So how do you scale deep learning?

00:59:52 So we did one of the first works

00:59:54 in training with distributed GPUs

00:59:57 and where the training time was linear

01:00:01 in terms of the amount of data.

01:00:03 So that was quite important work

01:00:06 where it was algorithmic improvements

01:00:07 as well as a lot of engineering improvements

01:00:09 to be able to train on thousands and thousands of speech.

01:00:14 And that was an important factor.

01:00:15 So if you ask me like back in 2013 and 2014,

01:00:19 when we launched Echo,

01:00:22 the combination of large scale data,

01:00:25 deep learning progress, near infinite GPUs

01:00:29 we had available on AWS even then,

01:00:33 was all came together for us to be able

01:00:35 to solve the far field speech recognition

01:00:38 to the extent it could be useful to the customers.

01:00:40 It’s still not solved.

01:00:41 Like, I mean, it’s not that we are perfect

01:00:43 at recognizing speech, but we are great at it

01:00:45 in terms of the settings that are in homes, right?

01:00:48 So, and that was important even in the early stages.

01:00:50 So first of all, just even,

01:00:51 I’m trying to look back at that time.

01:00:54 If I remember correctly,

01:00:57 it was, it seems like the task would be pretty daunting.

01:01:01 So like, so we kind of take it for granted

01:01:04 that it works now.

01:01:06 Yes, you’re right.

01:01:07 So let me, like how, first of all, you mentioned startup.

01:01:10 I wasn’t familiar how big the team was.

01:01:12 I kind of, cause I know there’s a lot

01:01:14 of really smart people working on it.

01:01:16 So now it’s a very, very large team.

01:01:19 How big was the team?

01:01:20 How likely were you to fail in the eyes of everyone else?

01:01:24 And ourselves?

01:01:26 And yourself?

01:01:27 So like what?

01:01:28 I’ll give you a very interesting anecdote on that.

01:01:31 When I joined the team,

01:01:33 the speech recognition team was six people.

01:01:37 My first meeting, and we had hired a few more people,

01:01:40 it was 10 people.

01:01:42 Nine out of 10 people thought it can’t be done.

01:01:48 Who was the one?

01:01:50 The one was me, say, actually I should say,

01:01:52 and one was semi optimistic.

01:01:56 And eight were trying to convince,

01:01:59 let’s go to the management and say,

01:02:01 let’s not work on this problem.

01:02:03 Let’s work on some other problem,

01:02:05 like either telephony speech for customer service calls

01:02:09 and so forth.

01:02:10 But this was the kind of belief you must have.

01:02:12 And I had experience with far field speech recognition

01:02:14 and my eyes lit up when I saw a problem like that saying,

01:02:17 okay, we have been in speech recognition,

01:02:20 always looking for that killer app.

01:02:23 And this was a killer use case

01:02:25 to bring something delightful in the hands of customers.

01:02:28 So you mentioned the way you kind of think of it

01:02:31 in the product way in the future,

01:02:32 have a press release and an FAQ and you think backwards.

01:02:35 Did you have, did the team have the echo in mind?

01:02:41 So this far field speech recognition,

01:02:43 actually putting a thing in the home that works,

01:02:45 that it’s able to interact with,

01:02:46 was that the press release?

01:02:48 What was the?

01:02:49 The way close, I would say, in terms of the,

01:02:51 as I said, the vision was start a computer, right?

01:02:55 Or the inspiration.

01:02:56 And from there, I can’t divulge

01:02:59 all the exact specifications,

01:03:00 but one of the first things that was magical on Alexa

01:03:07 was music.

01:03:08 It brought me to back to music

01:03:11 because my taste was still in when I was an undergrad.

01:03:14 So I still listened to those songs and I,

01:03:17 it was too hard for me to be a music fan with a phone, right?

01:03:21 So I, and I don’t, I hate things in my ears.

01:03:24 So from that perspective, it was quite hard

01:03:28 and music was part of the,

01:03:32 at least the documents I have seen, right?

01:03:33 So from that perspective, I think, yes,

01:03:36 in terms of how far are we from the original vision?

01:03:40 I can’t reveal that, but it’s,

01:03:42 that’s why I have done a fun at work

01:03:44 because every day we go in and thinking like,

01:03:47 these are the new set of challenges to solve.

01:03:49 Yeah, that’s a great way to do great engineering

01:03:51 as you think of the press release.

01:03:53 I like that idea actually.

01:03:55 Maybe we’ll talk about it a bit later,

01:03:56 but it’s just a super nice way to have a focus.

01:03:59 I’ll tell you this, you’re a scientist

01:04:01 and a lot of my scientists have adopted that.

01:04:03 They have now, they love it as a process

01:04:07 because it was very, as scientists,

01:04:09 you’re trained to write great papers,

01:04:10 but they are all after you’ve done the research

01:04:13 or you’ve proven that and your PhD dissertation proposal

01:04:16 is something that comes closest

01:04:18 or a DARPA proposal or a NSF proposal

01:04:21 is the closest that comes to a press release.

01:04:23 But that process is now ingrained in our scientists,

01:04:27 which is like delightful for me to see.

01:04:30 You write the paper first and then make it happen.

01:04:33 That’s right.

01:04:33 In fact, it’s not.

01:04:34 State of the art results.

01:04:36 Or you leave the results section open

01:04:38 where you have a thesis about here’s what I expect, right?

01:04:41 And here’s what it will change, right?

01:04:44 So I think it is a great thing.

01:04:46 It works for researchers as well.

01:04:48 Yeah.

01:04:49 So far field recognition.

01:04:50 Yeah.

01:04:52 What was the big leap?

01:04:53 What were the breakthroughs

01:04:55 and what was that journey like to today?

01:04:58 Yeah, I think the, as you said first,

01:05:00 there was a lot of skepticism

01:05:01 on whether far field speech recognition

01:05:03 will ever work to be good enough, right?

01:05:06 And what we first did was got a lot of training data

01:05:10 in a far field setting.

01:05:11 And that was extremely hard to get

01:05:14 because none of it existed.

01:05:16 So how do you collect data in far field setup, right?

01:05:20 With no customer base at this time.

01:05:21 With no customer base, right?

01:05:22 So that was first innovation.

01:05:24 And once we had that, the next thing was,

01:05:27 okay, if you have the data,

01:05:29 first of all, we didn’t talk about like,

01:05:31 what would magical mean in this kind of a setting?

01:05:35 What is good enough for customers, right?

01:05:37 That’s always, since you’ve never done this before,

01:05:40 what would be magical?

01:05:41 So it wasn’t just a research problem.

01:05:44 You had to put some in terms of accuracy

01:05:47 and customer experience features,

01:05:49 some stakes on the ground saying,

01:05:51 here’s where I think it should get to.

01:05:55 So you established a bar

01:05:56 and then how do you measure progress

01:05:57 towards given you have no customer right now.

01:06:01 So from that perspective, we went,

01:06:04 so first was the data without customers.

01:06:07 Second was doubling down on deep learning

01:06:10 as a way to learn.

01:06:11 And I can just tell you that the combination of the two

01:06:16 got our error rates by a factor of five.

01:06:19 From where we were when I started

01:06:21 to within six months of having that data,

01:06:24 we, at that point, I got the conviction

01:06:28 that this will work, right?

01:06:29 So, because that was magical

01:06:31 in terms of when it started working and.

01:06:34 That reached the magical bar.

01:06:36 That came close to the magical bar.

01:06:38 To the bar, right?

01:06:39 That we felt would be where people will use it.

01:06:44 That was critical.

01:06:45 Because you really have one chance at this.

01:06:48 If we had launched in November 2014 is when we launched,

01:06:51 if it was below the bar,

01:06:53 I don’t think this category exists

01:06:56 if you don’t meet the bar.

01:06:58 Yeah, and just having looked at voice based interactions

01:07:02 like in the car or earlier systems,

01:07:06 it’s a source of huge frustration for people.

01:07:08 In fact, we use voice based interaction

01:07:10 for collecting data on subjects to measure frustration.

01:07:14 So, as a training set for computer vision,

01:07:16 for face data, so we can get a data set

01:07:19 of frustrated people.

01:07:20 That’s the best way to get frustrated people

01:07:22 is having them interact with a voice based system

01:07:24 in the car.

01:07:25 So, that bar I imagine is pretty high.

01:07:28 It was very high.

01:07:29 And we talked about how also errors are perceived

01:07:32 from AIs versus errors by humans.

01:07:35 But we are not done with the problems that ended up,

01:07:38 we had to solve to get it to launch.

01:07:39 So, do you want the next one?

01:07:41 Yeah, the next one.

01:07:42 So, the next one was what I think of as

01:07:47 multi domain natural language understanding.

01:07:50 It’s very, I wouldn’t say easy,

01:07:53 but it is during those days,

01:07:56 solving it, understanding in one domain,

01:07:59 a narrow domain was doable,

01:08:02 but for these multiple domains like music,

01:08:06 like information, other kinds of household productivity,

01:08:10 alarms, timers, even though it wasn’t as big as it is

01:08:14 in terms of the number of skills Alexa has

01:08:15 and the confusion space has like grown

01:08:17 by three orders of magnitude,

01:08:20 it was still daunting even those days.

01:08:22 And again, no customer base yet.

01:08:24 Again, no customer base.

01:08:26 So, now you’re looking at meaning understanding

01:08:28 and intent understanding and taking actions

01:08:30 on behalf of customers.

01:08:31 Based on their requests.

01:08:33 And that is the next hard problem.

01:08:36 Even if you have gotten the words recognized,

01:08:39 how do you make sense of them?

01:08:42 In those days, there was still a lot of emphasis

01:08:47 on rule based systems for writing grammar patterns

01:08:50 to understand the intent.

01:08:52 But we had a statistical first approach even then,

01:08:55 where for our language understanding we had,

01:08:58 and even those starting days,

01:09:00 an entity recognizer and an intent classifier,

01:09:03 which was all trained statistically.

01:09:06 In fact, we had to build the deterministic matching

01:09:09 as a follow up to fix bugs that statistical models have.

01:09:14 So, it was just a different mindset

01:09:16 where we focused on data driven statistical understanding.

01:09:20 It wins in the end if you have a huge data set.

01:09:22 Yes, it is contingent on that.

01:09:24 And that’s why it came back to how do you get the data.

01:09:27 Before customers, the fact that this is why data

01:09:30 becomes crucial to get to the point

01:09:33 that you have the understanding system built up.

01:09:37 And notice that for you,

01:09:40 we were talking about human machine dialogue,

01:09:42 and even those early days,

01:09:44 even it was very much transactional,

01:09:47 do one thing, one shot utterances in great way.

01:09:50 There was a lot of debate on how much should Alexa talk back

01:09:52 in terms of if you misunderstood it.

01:09:55 If you misunderstood you or you said play songs by the stones,

01:10:01 and let’s say it doesn’t know early days,

01:10:04 knowledge can be sparse, who are the stones?

01:10:09 It’s the Rolling Stones.

01:10:12 And you don’t want the match to be Stone Temple Pilots

01:10:16 or Rolling Stones.

01:10:17 So, you don’t know which one it is.

01:10:18 So, these kind of other signals,

01:10:22 now there we had great assets from Amazon in terms of…

01:10:27 UX, like what is it, what kind of…

01:10:29 Yeah, how do you solve that problem?

01:10:31 In terms of what we think of it

01:10:32 as an entity resolution problem, right?

01:10:34 So, because which one is it, right?

01:10:36 I mean, even if you figured out the stones as an entity,

01:10:40 you have to resolve it to whether it’s the stones

01:10:42 or the Stone Temple Pilots or some other stones.

01:10:44 Maybe I misunderstood, is the resolution

01:10:47 the job of the algorithm or is the job of UX

01:10:50 communicating with the human to help the resolution?

01:10:52 Well, there is both, right?

01:10:54 It is, you want 90% or high 90s to be done

01:10:58 without any further questioning or UX, right?

01:11:01 So, but it’s absolutely okay, just like as humans,

01:11:05 we ask the question, I didn’t understand you, Lex.

01:11:09 It’s fine for Alexa to occasionally say,

01:11:10 I did not understand you, right?

01:11:12 And that’s an important way to learn.

01:11:14 And I’ll talk about where we have come

01:11:16 with more self learning with these kind of feedback signals.

01:11:20 But in those days, just solving the ability

01:11:23 of understanding the intent and resolving to an action

01:11:26 where action could be play a particular artist

01:11:28 or a particular song was super hard.

01:11:31 Again, the bar was high as we were talking about, right?

01:11:35 So, while we launched it in sort of 13 big domains,

01:11:40 I would say in terms of,

01:11:42 we think of it as 13, the big skills we had,

01:11:44 like music is a massive one when we launched it.

01:11:47 And now we have 90,000 plus skills on Alexa.

01:11:51 So, what are the big skills?

01:11:52 Can you just go over them?

01:11:53 Because the only thing I use it for

01:11:55 is music, weather and shopping.

01:11:58 So, we think of it as music information, right?

01:12:02 So, weather is a part of information, right?

01:12:05 So, when we launched, we didn’t have smart home,

01:12:08 but within, by smart home I mean,

01:12:10 you connect your smart devices,

01:12:12 you control them with voice.

01:12:13 If you haven’t done it, it’s worth,

01:12:15 it will change your life.

01:12:15 Like turning on the lights and so on.

01:12:16 Turning on your light to anything that’s connected

01:12:20 and has a, it’s just that.

01:12:21 What’s your favorite smart device for you?

01:12:23 My light.

01:12:24 Light.

01:12:24 And now you have the smart plug with,

01:12:26 and you don’t, we also have this echo plug, which is.

01:12:29 Oh yeah, you can plug in anything.

01:12:30 You can plug in anything

01:12:31 and now you can turn that one on and off.

01:12:33 I use this conversation motivation to get one.

01:12:35 Garage door, you can check your status of the garage door

01:12:39 and things like, and we have gone,

01:12:41 make Alexa more and more proactive,

01:12:43 where it even has hunches now,

01:12:45 that, oh, looks, hunches, like you left your light on.

01:12:50 Let’s say you’ve gone to your bed

01:12:51 and you left the garage light on.

01:12:52 So it will help you out in these settings, right?

01:12:56 That’s smart devices, information, smart devices.

01:13:00 You said music.

01:13:01 Yeah, so I don’t remember everything we had,

01:13:02 but alarms, timers were the big ones.

01:13:05 Like that was, you know,

01:13:06 the timers were very popular right away.

01:13:09 Music also, like you could play song, artist, album,

01:13:13 everything, and so that was like a clear win

01:13:17 in terms of the customer experience.

01:13:19 So that’s, again, this is language understanding.

01:13:22 Now things have evolved, right?

01:13:24 So where we want Alexa definitely to be more accurate,

01:13:28 competent, trustworthy,

01:13:29 based on how well it does these core things,

01:13:33 but we have evolved in many different dimensions.

01:13:35 First is what I think of are doing more conversational

01:13:38 for high utility, not just for chat, right?

01:13:40 And there at Remars this year, which is our AI conference,

01:13:44 we launched what is called Alexa Conversations.

01:13:48 That is providing the ability for developers

01:13:51 to author multi turn experiences on Alexa

01:13:55 with no code, essentially,

01:13:57 in terms of the dialogue code.

01:13:58 Initially it was like, you know, all these IVR systems,

01:14:02 you have to fully author if the customer says this,

01:14:06 do that, right?

01:14:07 So the whole dialogue flow is hand authored.

01:14:11 And with Alexa Conversations,

01:14:13 the way it is that you just provide

01:14:15 a sample interaction data with your service or your API,

01:14:18 let’s say your Atom tickets that provides a service

01:14:21 for buying movie tickets.

01:14:23 You provide a few examples of how your customers

01:14:25 will interact with your APIs.

01:14:27 And then the dialogue flow is automatically constructed

01:14:29 using a record neural network trained on that data.

01:14:33 So that simplifies the developer experience.

01:14:35 We just launched our preview for the developers

01:14:38 to try this capability out.

01:14:40 And then the second part of it,

01:14:42 which shows even increased utility for customers

01:14:45 is you and I, when we interact with Alexa or any customer,

01:14:50 as I’m coming back to our initial part of the conversation,

01:14:53 the goal is often unclear or unknown to the AI.

01:14:58 If I say, Alexa, what movies are playing nearby?

01:15:02 Am I trying to just buy movie tickets?

01:15:07 Am I actually even,

01:15:09 do you think I’m looking for just movies for curiosity,

01:15:12 whether the Avengers is still in theater or when is it?

01:15:15 Maybe it’s gone and maybe it will come on my missed it.

01:15:17 So I may watch it on Prime, right?

01:15:20 Which happened to me.

01:15:21 So from that perspective now,

01:15:24 you’re looking into what is my goal?

01:15:27 And let’s say I now complete the movie ticket purchase.

01:15:31 Maybe I would like to get dinner nearby.

01:15:35 So what is really the goal here?

01:15:38 Is it night out or is it movies?

01:15:41 As in just go watch a movie?

01:15:44 The answer is, we don’t know.

01:15:46 So can Alexa now figuratively have the intelligence

01:15:50 that I think this meta goal is really night out

01:15:53 or at least say to the customer

01:15:55 when you’ve completed the purchase of movie tickets

01:15:58 from Atom tickets or Fandango,

01:16:00 or pick your anyone.

01:16:01 Then the next thing is,

01:16:02 do you want to get an Uber to the theater, right?

01:16:09 Or do you want to book a restaurant next to it?

01:16:12 And then not ask the same information over and over again,

01:16:17 what time, how many people in your party, right?

01:16:22 So this is where you shift the cognitive burden

01:16:26 from the customer to the AI.

01:16:29 Where it’s thinking of what is your,

01:16:32 it anticipates your goal

01:16:34 and takes the next best action to complete it.

01:16:37 Now that’s the machine learning problem.

01:16:40 But essentially the way we solve this first instance,

01:16:43 and we have a long way to go to make it scale

01:16:46 to everything possible in the world.

01:16:48 But at least for this situation,

01:16:50 it is from at every instance,

01:16:53 Alexa is making the determination,

01:16:54 whether it should stick with the experience

01:16:56 with Atom tickets or not.

01:16:58 Or offer you based on what you say,

01:17:03 whether either you have completed the interaction,

01:17:06 or you said, no, get me an Uber now.

01:17:07 So it will shift context into another experience or skill

01:17:12 or another service.

01:17:12 So that’s a dynamic decision making.

01:17:15 That’s making Alexa, you can say more conversational

01:17:18 for the benefit of the customer,

01:17:20 rather than simply complete transactions,

01:17:22 which are well thought through.

01:17:24 You as a customer has fully specified

01:17:27 what you want to be accomplished.

01:17:29 It’s accomplishing that.

01:17:30 So it’s kind of as we do this with pedestrians,

01:17:34 like intent modeling is predicting

01:17:36 what your possible goals are and what’s the most likely goal

01:17:40 and switching that depending on the things you say.

01:17:42 So my question is there,

01:17:44 it seems maybe it’s a dumb question,

01:17:46 but it would help a lot if Alexa remembered me,

01:17:51 what I said previously.

01:17:53 Right.

01:17:53 Is it trying to use some memories for the customer?

01:17:58 Yeah, it is using a lot of memory within that.

01:18:00 So right now, not so much in terms of,

01:18:02 okay, which restaurant do you prefer, right?

01:18:05 That is a more longterm memory,

01:18:06 but within the short term memory, within the session,

01:18:09 it is remembering how many people did you,

01:18:11 so if you said buy four tickets,

01:18:13 now it has made an implicit assumption

01:18:15 that you were gonna have,

01:18:18 you need at least four seats at a restaurant, right?

01:18:21 So these are the kind of context it’s preserving

01:18:24 between these skills, but within that session.

01:18:26 But you’re asking the right question

01:18:28 in terms of for it to be more and more useful,

01:18:32 it has to have more longterm memory

01:18:33 and that’s also an open question

01:18:35 and again, these are still early days.

01:18:37 So for me, I mean, everybody’s different,

01:18:40 but yeah, I’m definitely not representative

01:18:43 of the general population in the sense

01:18:45 that I do the same thing every day.

01:18:47 Like I eat the same,

01:18:48 I do everything the same, the same thing,

01:18:51 wear the same thing clearly, this or the black shirt.

01:18:55 So it’s frustrating when Alexa doesn’t get what I’m saying

01:18:59 because I have to correct her every time

01:19:01 in the exact same way.

01:19:02 This has to do with certain songs,

01:19:05 like she doesn’t know certain weird songs I like

01:19:08 and doesn’t know, I’ve complained to Spotify about this,

01:19:11 talked to the RD, head of RD at Spotify,

01:19:13 it’s their way to heaven.

01:19:15 I have to correct it every time.

01:19:16 It doesn’t play Led Zeppelin correctly.

01:19:18 It plays cover of Led’s of Stairway to Heaven.

01:19:22 So I’m.

01:19:22 You should figure, you should send me your,

01:19:24 next time it fails, feel free to send it to me,

01:19:27 we’ll take care of it.

01:19:28 Okay, well.

01:19:29 Because Led Zeppelin is one of my favorite brands,

01:19:31 it works for me, so I’m like shocked it doesn’t work for you.

01:19:34 This is an official bug report.

01:19:35 I’ll put it, I’ll make it public,

01:19:37 I’ll make everybody retweet it.

01:19:39 We’re gonna fix the Stairway to Heaven problem.

01:19:40 Anyway, but the point is,

01:19:43 you know, I’m pretty boring and do the same things,

01:19:45 but I’m sure most people do the same set of things.

01:19:48 Do you see Alexa sort of utilizing that in the future

01:19:51 for improving the experience?

01:19:52 Yes, and not only utilizing,

01:19:54 it’s already doing some of it.

01:19:56 We call it, where Alexa is becoming more self learning.

01:19:59 So, Alexa is now auto correcting millions and millions

01:20:04 of utterances in the US

01:20:06 without any human supervision involved.

01:20:08 The way it does it is,

01:20:10 let’s take an example of a particular song

01:20:13 didn’t work for you.

01:20:14 What do you do next?

01:20:15 You either it played the wrong song

01:20:17 and you said, Alexa, no, that’s not the song I want.

01:20:20 Or you say, Alexa play that, you try it again.

01:20:25 And that is a signal to Alexa

01:20:27 that she may have done something wrong.

01:20:30 And from that perspective,

01:20:31 we can learn if there’s that failure pattern

01:20:35 or that action of song A was played

01:20:38 when song B was requested.

01:20:41 And it’s very common with station names

01:20:43 because play NPR, you can have N be confused as an M.

01:20:47 And then you, for a certain accent like mine,

01:20:51 people confuse my N and M all the time.

01:20:54 And because I have a Indian accent,

01:20:57 they’re confusable to humans.

01:20:59 It is for Alexa too.

01:21:01 And in that part, but it starts auto correcting

01:21:05 and we collect, we correct a lot of these automatically

01:21:09 without a human looking at the failures.

01:21:12 So one of the things that’s for me missing in Alexa,

01:21:17 I don’t know if I’m a representative customer,

01:21:19 but every time I correct it,

01:21:22 it would be nice to know that that made a difference.

01:21:26 Yes.

01:21:26 You know what I mean?

01:21:27 Like the sort of like, I heard you like a sort of.

01:21:31 Some acknowledgement of that.

01:21:33 We work a lot with Tesla, we study autopilot and so on.

01:21:37 And a large amount of the customers

01:21:39 that use Tesla autopilot,

01:21:40 they feel like they’re always teaching the system.

01:21:43 They’re almost excited

01:21:43 by the possibility that they’re teaching.

01:21:45 I don’t know if Alexa customers generally think of it

01:21:48 as they’re teaching to improve the system.

01:21:51 And that’s a really powerful thing.

01:21:52 Again, I would say it’s a spectrum.

01:21:55 Some customers do think that way

01:21:57 and some would be annoyed by Alexa acknowledging that.

01:22:02 So there’s, again, no one,

01:22:04 while there are certain patterns,

01:22:05 not everyone is the same in this way.

01:22:08 But we believe that, again, customers helping Alexa

01:22:13 is a tenet for us in terms of improving it.

01:22:15 And some more self learning is by, again,

01:22:18 this is like fully unsupervised, right?

01:22:20 There is no human in the loop and no labeling happening.

01:22:23 And based on your actions as a customer,

01:22:27 Alexa becomes smarter.

01:22:29 Again, it’s early days,

01:22:31 but I think this whole area of teachable AI

01:22:35 is gonna get bigger and bigger in the whole space,

01:22:38 especially in the AI assistant space.

01:22:40 So that’s the second part

01:22:41 where I mentioned more conversational.

01:22:44 This is more self learning.

01:22:46 The third is more natural.

01:22:48 And the way I think of more natural

01:22:50 is we talked about how Alexa sounds.

01:22:53 And we have done a lot of advances in our text to speech

01:22:58 by using, again, neural network technology

01:23:00 for it to sound very humanlike.

01:23:03 From the individual texture of the sound to the timing,

01:23:07 the tonality, the tone, everything, the whole thing.

01:23:09 I would think in terms of,

01:23:11 there’s a lot of controls in each of the places

01:23:13 for how, I mean, the speed of the voice,

01:23:16 the prosthetic patterns,

01:23:19 the actual smoothness of how it sounds,

01:23:23 all of those are factored

01:23:24 and we do a ton of listening tests to make sure.

01:23:27 But naturalness, how it sounds should be very natural.

01:23:30 How it understands requests is also very important.

01:23:33 And in terms of, we have 95,000 skills.

01:23:37 And if we have, imagine that in many of these skills,

01:23:41 you have to remember the skill name

01:23:43 and say, Alexa, ask the tide skill to tell me X.

01:23:51 Now, if you have to remember the skill name,

01:23:52 that means the discovery and the interaction is unnatural.

01:23:56 And we are trying to solve that

01:23:58 by what we think of as, again,

01:24:03 you don’t have to have the app metaphor here.

01:24:05 These are not individual apps, right?

01:24:07 Even though they’re,

01:24:08 so you’re not sort of opening one at a time and interacting.

01:24:11 So it should be seamless because it’s voice.

01:24:14 And when it’s voice,

01:24:15 you have to be able to understand these requests

01:24:17 independent of the specificity, like a skill name.

01:24:20 And to do that,

01:24:21 what we have done is again,

01:24:22 built a deep learning based capability

01:24:24 where we shortlist a bunch of skills

01:24:27 when you say, Alexa, get me a car.

01:24:28 And then we figure it out, okay,

01:24:30 it’s meant for an Uber skill versus a Lyft

01:24:33 or based on your preferences.

01:24:34 And then you can rank the responses from the skill

01:24:38 and then choose the best response for the customer.

01:24:41 So that’s on the more natural,

01:24:43 other examples of more natural is like,

01:24:46 we were talking about lists, for instance,

01:24:49 and you don’t wanna say, Alexa, add milk,

01:24:51 Alexa, add eggs, Alexa, add cookies.

01:24:55 No, Alexa, add cookies, milk, and eggs

01:24:57 and that in one shot, right?

01:24:59 So that works, that helps with the naturalness.

01:25:01 We talked about memory, like if you said,

01:25:05 you can say, Alexa, remember I have to go to mom’s house,

01:25:09 or you may have entered a calendar event

01:25:11 through your calendar that’s linked to Alexa.

01:25:13 You don’t wanna remember whether it’s in my calendar

01:25:15 or did I tell you to remember something

01:25:18 or some other reminder, right?

01:25:20 So you have to now, independent of how customers

01:25:25 create these events, it should just say,

01:25:28 Alexa, when do I have to go to mom’s house?

01:25:29 And it tells you when you have to go to mom’s house.

01:25:32 Now that’s a fascinating problem.

01:25:33 Who’s that problem on?

01:25:35 So there’s people who create skills.

01:25:38 Who’s tasked with integrating all of that knowledge together

01:25:42 so the skills become seamless?

01:25:44 Is it the creators of the skills

01:25:46 or is it an infrastructure that Alexa provides problem?

01:25:51 It’s both.

01:25:52 I think the large problem in terms of making sure

01:25:54 your skill quality is high,

01:25:58 that has to be done by our tools,

01:26:01 because it’s just, so these skills,

01:26:03 just to put the context,

01:26:04 they are built through Alexa Skills Kit,

01:26:06 which is a self serve way of building

01:26:09 an experience on Alexa.

01:26:11 This is like any developer in the world

01:26:13 could go to Alexa Skills Kit

01:26:14 and build an experience on Alexa.

01:26:16 Like if you’re a Domino’s, you can build a Domino’s Skills.

01:26:20 For instance, that does pizza ordering.

01:26:22 When you have authored that,

01:26:25 you do want to now,

01:26:28 if people say, Alexa, open Domino’s

01:26:30 or Alexa, ask Domino’s to get a particular type of pizza,

01:26:35 that will work, but the discovery is hard.

01:26:37 You can’t just say, Alexa, get me a pizza.

01:26:39 And then Alexa figures out what to do.

01:26:42 That latter part is definitely our responsibility

01:26:45 in terms of when the request is not fully specific,

01:26:48 how do you figure out what’s the best skill

01:26:51 or a service that can fulfill the customer’s request?

01:26:56 And it can keep evolving.

01:26:57 Imagine going to the situation I said,

01:26:59 which was the night out planning,

01:27:00 that the goal could be more than that individual request

01:27:03 that came up.

01:27:05 A pizza ordering could mean a night in,

01:27:08 where you’re having an event with your kids

01:27:10 in their house, and you’re, so this is,

01:27:12 welcome to the world of conversational AI.

01:27:16 This is super exciting because it’s not

01:27:18 the academic problem of NLP,

01:27:20 of natural language processing, understanding, dialogue.

01:27:23 This is like real world.

01:27:24 And the stakes are high in the sense

01:27:27 that customers get frustrated quickly,

01:27:30 people get frustrated quickly.

01:27:31 So you have to get it right,

01:27:33 you have to get that interaction right.

01:27:35 So it’s, I love it.

01:27:36 But so from that perspective,

01:27:39 what are the challenges today?

01:27:41 What are the problems that really need to be solved

01:27:45 in the next few years?

01:27:45 What’s the focus?

01:27:46 First and foremost, as I mentioned,

01:27:48 that get the basics right is still true.

01:27:53 Basically, even the one shot requests,

01:27:57 which we think of as transactional requests,

01:27:58 needs to work magically, no question about that.

01:28:01 If it doesn’t turn your light on and off,

01:28:03 you’ll be super frustrated.

01:28:05 Even if I can complete the night out for you

01:28:07 and not do that, that is unacceptable as a customer, right?

01:28:10 So that you have to get the foundational understanding

01:28:14 going very well.

01:28:15 The second aspect when I said more conversational

01:28:17 is as you imagine is more about reasoning.

01:28:20 It is really about figuring out what the latent goal is

01:28:24 of the customer based on what I have the information now

01:28:28 and the history, what’s the next best thing to do.

01:28:31 So that’s a complete reasoning and decision making problem.

01:28:35 Just like your self driving car,

01:28:37 but the goal is still more finite.

01:28:38 Here it evolves, your environment is super hard

01:28:41 and self driving and the cost of a mistake is huge here,

01:28:46 but there are certain similarities.

01:28:48 But if you think about how many decisions Alexa is making

01:28:52 or evaluating at any given time,

01:28:54 it’s a huge hypothesis space.

01:28:56 And we’re only talked about so far

01:28:59 about what I think of reactive decision

01:29:02 in terms of you asked for something

01:29:03 and Alexa is reacting to it.

01:29:05 If you bring the proactive part,

01:29:07 which is Alexa having hunches.

01:29:10 So any given instance then it’s really a decision

01:29:14 at any given point based on the information.

01:29:17 Alexa has to determine what’s the best thing it needs to do.

01:29:20 So these are the ultimate AI problem

01:29:22 about decisions based on the information you have.

01:29:25 Do you think, just from my perspective,

01:29:27 I work a lot with sensing of the human face.

01:29:31 Do you think they’ll, and we touched this topic

01:29:33 a little bit earlier, but do you think it’ll be a day soon

01:29:36 when Alexa can also look at you to help improve the quality

01:29:41 of the hunch it has, or at least detect frustration

01:29:46 or detect, improve the quality of its perception

01:29:51 of what you’re trying to do?

01:29:54 I mean, let me again bring back to what it already does.

01:29:57 We talked about how based on you barge in over Alexa,

01:30:01 clearly it’s a very high probability

01:30:04 it must have done something wrong.

01:30:06 That’s why you barged in.

01:30:08 The next extension of whether frustration is a signal or not,

01:30:13 of course, is a natural thought

01:30:15 in terms of how that should be in a signal to it.

01:30:18 You can get that from voice.

01:30:19 You can get from voice, but it’s very hard.

01:30:21 Like, I mean, frustration as a signal historically,

01:30:25 if you think about emotions of different kinds,

01:30:29 there’s a whole field of affective computing,

01:30:31 something that MIT has also done a lot of research in,

01:30:34 is super hard.

01:30:35 And you are now talking about a far field device,

01:30:39 as in you’re talking to a distance noisy environment.

01:30:41 And in that environment,

01:30:44 it needs to have a good sense for your emotions.

01:30:47 This is a very, very hard problem.

01:30:49 Very hard problem, but you haven’t shied away

01:30:50 from hard problems.

01:30:51 So, Deep Learning has been at the core

01:30:55 of a lot of this technology.

01:30:57 Are you optimistic

01:30:58 about the current Deep Learning approaches

01:30:59 to solving the hardest aspects of what we’re talking about?

01:31:03 Or do you think there will come a time

01:31:05 where new ideas need to further,

01:31:07 if we look at reasoning,

01:31:09 so OpenAI, DeepMind,

01:31:10 a lot of folks are now starting to work in reasoning,

01:31:13 trying to see how we can make neural networks reason.

01:31:16 Do you see that new approaches need to be invented

01:31:20 to take the next big leap?

01:31:23 Absolutely, I think there has to be a lot more investment.

01:31:27 And I think in many different ways,

01:31:29 and there are these, I would say,

01:31:31 nuggets of research forming in a good way,

01:31:33 like learning with less data

01:31:36 or like zero short learning, one short learning.

01:31:39 And the active learning stuff you’ve talked about

01:31:41 is incredible stuff.

01:31:43 So, transfer learning is also super critical,

01:31:45 especially when you’re thinking about applying knowledge

01:31:48 from one task to another,

01:31:49 or one language to another, right?

01:31:52 It’s really ripe.

01:31:52 So, these are great pieces.

01:31:55 Deep learning has been useful too.

01:31:56 And now we are sort of marrying deep learning

01:31:58 with transfer learning and active learning.

01:32:02 Of course, that’s more straightforward

01:32:04 in terms of applying deep learning

01:32:05 and an active learning setup.

01:32:06 But I do think in terms of now looking

01:32:12 into more reasoning based approaches

01:32:14 is going to be key for our next wave of the technology.

01:32:19 But there is a good news.

01:32:20 The good news is that I think for keeping on

01:32:23 to delight customers, that a lot of it

01:32:25 can be done by prediction tasks.

01:32:27 So, we haven’t exhausted that.

01:32:30 So, we don’t need to give up

01:32:34 on the deep learning approaches for that.

01:32:37 So, that’s just I wanted to sort of point that out.

01:32:39 Creating a rich, fulfilling, amazing experience

01:32:42 that makes Amazon a lot of money

01:32:44 and a lot of everybody a lot of money

01:32:46 because it does awesome things, deep learning is enough.

01:32:49 The point.

01:32:51 I don’t think, I wouldn’t say deep learning is enough.

01:32:54 I think for the purposes of Alexa

01:32:56 accomplished the task for customers.

01:32:58 I’m saying there are still a lot of things we can do

01:33:02 with prediction based approaches that do not reason.

01:33:05 I’m not saying that and we haven’t exhausted those.

01:33:08 But for the kind of high utility experiences

01:33:12 that I’m personally passionate about

01:33:14 of what Alexa needs to do, reasoning has to be solved

01:33:18 to the same extent as you can think

01:33:21 of natural language understanding and speech recognition

01:33:24 to the extent of understanding intents

01:33:27 has been how accurate it has become.

01:33:30 But reasoning, we have very, very early days.

01:33:32 Let me ask it another way.

01:33:34 How hard of a problem do you think that is?

01:33:36 Hardest of them.

01:33:39 I would say hardest of them because again,

01:33:42 the hypothesis space is really, really large.

01:33:47 And when you go back in time, like you were saying,

01:33:50 I wanna, I want Alexa to remember more things

01:33:53 that once you go beyond a session of interaction,

01:33:56 which is by session, I mean a time span,

01:33:59 which is today to versus remembering which restaurant I like.

01:34:03 And then when I’m planning a night out to say,

01:34:05 do you wanna go to the same restaurant?

01:34:07 Now you’re up the stakes big time.

01:34:09 And this is where the reasoning dimension

01:34:12 also goes way, way bigger.

01:34:14 So you think the space, we’ll be elaborating that

01:34:17 a little bit, just philosophically speaking,

01:34:20 do you think when you reason about trying to model

01:34:24 what the goal of a person is in the context

01:34:28 of interacting with Alexa, you think that space is huge?

01:34:31 It’s huge, absolutely huge.

01:34:32 Do you think, so like another sort of devil’s advocate

01:34:35 would be that we human beings are really simple

01:34:38 and we all want like just a small set of things.

01:34:41 And so do you think it’s possible?

01:34:44 Cause we’re not talking about

01:34:47 a fulfilling general conversation.

01:34:49 Perhaps actually the Alexa prize is a little bit after that.

01:34:53 Creating a customer, like there’s so many

01:34:56 of the interactions, it feels like are clustered

01:35:01 in groups that are, don’t require general reasoning.

01:35:06 I think you’re right in terms of the head

01:35:09 of the distribution of all the possible things

01:35:11 customers may wanna accomplish.

01:35:13 But the tail is long and it’s diverse, right?

01:35:18 So from that.

01:35:19 There’s many, many long tails.

01:35:21 So from that perspective, I think you have

01:35:24 to solve that problem otherwise,

01:35:27 and everyone’s very different.

01:35:28 Like, I mean, we see this already

01:35:30 in terms of the skills, right?

01:35:32 I mean, if you’re an average surfer, which I am not, right?

01:35:36 But somebody is asking Alexa about surfing conditions, right?

01:35:41 And there’s a skill that is there for them to get to, right?

01:35:45 That tells you that the tail is massive.

01:35:47 Like in terms of like what kind of skills

01:35:50 people have created, it’s humongous in terms of it.

01:35:54 And which means there are these diverse needs.

01:35:56 And when you start looking at the combinations

01:36:00 of these, right?

01:36:00 Even if you had pairs of skills and 90,000 choose two,

01:36:05 it’s still a big set of combinations.

01:36:07 So I’m saying there’s a huge to do here now.

01:36:11 And I think customers are, you know,

01:36:14 wonderfully frustrated with things.

01:36:18 And they have to keep getting to do better things for them.

01:36:20 So.

01:36:21 And they’re not known to be super patient.

01:36:23 So you have to.

01:36:24 Do it fast.

01:36:25 You have to do it fast.

01:36:26 So you’ve mentioned the idea of a press release,

01:36:29 the research and development, Amazon Alexa

01:36:33 and Amazon general, you kind of think of what

01:36:35 the future product will look like.

01:36:37 And you kind of make it happen.

01:36:38 You work backwards.

01:36:40 So can you draft for me, you probably already have one,

01:36:43 but can you make up one for 10, 20, 30, 40 years out

01:36:48 that you see the Alexa team putting out

01:36:52 just in broad strokes, something that you dream about?

01:36:56 I think let’s start with the five years first, right?

01:37:00 So, and I’ll get to the 40 years too.

01:37:03 Cause I’m pretty sure you have a real five year one.

01:37:06 That’s why I didn’t want to, but yeah,

01:37:08 in broad strokes, let’s start with five years.

01:37:10 I think the five year is where, I mean,

01:37:11 I think of in these spaces, it’s hard,

01:37:14 especially if you’re in the thick of things

01:37:16 to think beyond the five year space,

01:37:17 because a lot of things change, right?

01:37:20 I mean, if you ask me five years back,

01:37:22 will Alexa will be here?

01:37:24 I wouldn’t have, I think it has surpassed

01:37:26 my imagination of that time, right?

01:37:29 So I think from the next five years perspective,

01:37:33 from a AI perspective, what we’re gonna see

01:37:37 is that notion, which you said goal oriented dialogues

01:37:40 and open domain like Alexa prize.

01:37:42 I think that bridge is gonna get closed.

01:37:45 They won’t be different.

01:37:46 And I’ll give you why that’s the case.

01:37:48 You mentioned shopping.

01:37:50 How do you shop?

01:37:52 Do you shop in one shot?

01:37:55 Sure, your double A batteries, paper towels.

01:37:59 Yes, how long does it take for you to buy a camera?

01:38:04 You do ton of research, then you make a decision.

01:38:07 So is that a goal oriented dialogue

01:38:11 when somebody says, Alexa, find me a camera?

01:38:15 Is it simply inquisitiveness, right?

01:38:18 So even in the something that you think of it as shopping,

01:38:20 which you said you yourself use a lot of,

01:38:23 if you go beyond where it’s reorders

01:38:27 or items where you sort of are not brand conscious

01:38:32 and so forth.

01:38:33 So that was just in shopping.

01:38:35 Just to comment quickly,

01:38:36 I’ve never bought anything through Alexa

01:38:38 that I haven’t bought before on Amazon on the desktop

01:38:41 after I clicked in a bunch of read a bunch of reviews,

01:38:44 that kind of stuff.

01:38:44 So it’s repurchase.

01:38:45 So now you think in,

01:38:47 even for something that you felt like is a finite goal,

01:38:51 I think the space is huge because even products,

01:38:54 the attributes are many,

01:38:56 and you wanna look at reviews,

01:38:58 some on Amazon, some outside,

01:39:00 some you wanna look at what CNET is saying

01:39:01 or another consumer forum is saying

01:39:05 about even a product for instance, right?

01:39:06 So that’s just shopping where you could argue

01:39:11 the ultimate goal is sort of known.

01:39:13 And we haven’t talked about Alexa,

01:39:15 what’s the weather in Cape Cod this weekend, right?

01:39:18 So why am I asking that weather question, right?

01:39:22 So I think of it as how do you complete goals

01:39:27 with minimum steps for our customers, right?

01:39:30 And when you think of it that way,

01:39:32 the distinction between goal oriented and conversations

01:39:35 for open domain say goes away.

01:39:38 I may wanna know what happened

01:39:41 in the presidential debate, right?

01:39:43 And is it I’m seeking just information

01:39:45 or I’m looking at who’s winning the debates, right?

01:39:49 So these are all quite hard problems.

01:39:53 So even the five year horizon problem,

01:39:55 I’m like, I sure hope we’ll solve these.

01:39:59 And you’re optimistic because that’s a hard problem.

01:40:03 Which part?

01:40:04 The reasoning enough to be able to help explore

01:40:09 complex goals that are beyond something simplistic.

01:40:12 That feels like it could be, well, five years is a nice.

01:40:16 Is a nice bar for it, right?

01:40:18 I think you will, it’s a nice ambition

01:40:21 and do we have press releases for that?

01:40:23 Absolutely, can I tell you what specifically

01:40:25 the roadmap will be?

01:40:26 No, right?

01:40:28 And what, and will we solve all of it

01:40:30 in the five year space?

01:40:31 No, this is, we’ll work on this forever actually.

01:40:35 This is the hardest of the AI problems

01:40:37 and I don’t see that being solved even in a 40 year horizon

01:40:42 because even if you limit to the human intelligence,

01:40:45 we know we are quite far from that.

01:40:47 In fact, every aspects of our sensing to neural processing,

01:40:52 to how brain stores information and how it processes it,

01:40:56 we don’t yet know how to represent knowledge, right?

01:40:59 So we are still in those early stages.

01:41:02 So I wanted to start, that’s why at the five year,

01:41:06 because the five year success would look like that

01:41:09 in solving these complex goals.

01:41:11 And the 40 year would be where it’s just natural

01:41:14 to talk to these in terms of more of these complex goals.

01:41:18 Right now, we’ve already come to the point

01:41:20 where these transactions you mentioned

01:41:22 of asking for weather or reordering something

01:41:25 or listening to your favorite tune,

01:41:28 it’s natural for you to ask Alexa.

01:41:30 It’s now unnatural to pick up your phone, right?

01:41:33 And that I think is the first five year transformation.

01:41:36 The next five year transformation would be,

01:41:38 okay, I can plan my weekend with Alexa

01:41:40 or I can plan my next meal with Alexa

01:41:43 or my next night out with seamless effort.

01:41:47 So just to pause and look back at the big picture of it all.

01:41:51 It’s a, you’re a part of a large team

01:41:55 that’s creating a system that’s in the home

01:41:58 that’s not human, that gets to interact with human beings.

01:42:02 So we human beings, we these descendants of apes

01:42:06 have created an artificial intelligence system

01:42:09 that’s able to have conversations.

01:42:10 I mean, that to me, the two most transformative robots

01:42:18 of this century, I think will be autonomous vehicles,

01:42:23 but they’re a little bit transformative

01:42:24 in a more boring way.

01:42:26 It’s like a tool.

01:42:28 I think conversational agents in the home

01:42:32 is like an experience.

01:42:34 How does that make you feel?

01:42:36 That you’re at the center of creating that?

01:42:38 Do you sit back in awe sometimes?

01:42:42 What is your feeling about the whole mess of it?

01:42:47 Can you even believe that we’re able

01:42:49 to create something like this?

01:42:50 I think it’s a privilege.

01:42:52 I’m so fortunate like where I ended up, right?

01:42:57 And it’s been a long journey.

01:43:00 Like I’ve been in this space for a long time in Cambridge,

01:43:03 right, and it’s so heartwarming to see

01:43:07 the kind of adoption conversational agents are having now.

01:43:12 Five years back, it was almost like,

01:43:14 should I move out of this because we are unable

01:43:17 to find this killer application that customers would love

01:43:21 that would not simply be a good to have thing

01:43:24 in research labs.

01:43:26 And it’s so fulfilling to see it make a difference

01:43:29 to millions and billions of people worldwide.

01:43:32 The good thing is that it’s still very early.

01:43:34 So I have another 20 years of job security

01:43:37 doing what I love.

01:43:38 Like, so I think from that perspective,

01:43:42 I tell every researcher that joins

01:43:44 or every member of my team,

01:43:46 that this is a unique privilege.

01:43:47 Like I think, and we have,

01:43:49 and I would say not just launching Alexa in 2014,

01:43:52 which was first of its kind.

01:43:54 Along the way we have, when we launched Alexa Skills Kit,

01:43:57 it became democratizing AI.

01:43:59 When before that there was no good evidence

01:44:02 of an SDK for speech and language.

01:44:04 Now we are coming to this where you and I

01:44:06 are having this conversation where I’m not saying,

01:44:10 oh, Lex, planning a night out with an AI agent, impossible.

01:44:14 I’m saying it’s in the realm of possibility

01:44:17 and not only possibility, we’ll be launching this, right?

01:44:19 So some elements of that, it will keep getting better.

01:44:23 We know that is a universal truth.

01:44:25 Once you have these kinds of agents out there being used,

01:44:30 they get better for your customers.

01:44:32 And I think that’s where,

01:44:34 I think the amount of research topics

01:44:36 we are throwing out at our budding researchers

01:44:39 is just gonna be exponentially hard.

01:44:41 And the great thing is you can now get immense satisfaction

01:44:45 by having customers use it,

01:44:47 not just a paper in NeurIPS or another conference.

01:44:51 I think everyone, myself included,

01:44:53 are deeply excited about that future.

01:44:54 So I don’t think there’s a better place to end, Rohit.

01:44:58 Thank you so much for talking to us.

01:44:58 Thank you so much.

01:44:59 This was fun.

01:45:00 Thank you, same here.

01:45:02 Thanks for listening to this conversation

01:45:04 with Rohit Prasad.

01:45:05 And thank you to our presenting sponsor, Cash App.

01:45:08 Download it, use code LEGSPodcast,

01:45:11 you’ll get $10 and $10 will go to FIRST,

01:45:14 a STEM education nonprofit

01:45:16 that inspires hundreds of thousands of young minds

01:45:19 to learn and to dream of engineering our future.

01:45:23 If you enjoy this podcast, subscribe on YouTube,

01:45:26 give it five stars on Apple Podcast,

01:45:28 support it on Patreon, or connect with me on Twitter.

01:45:31 And now let me leave you with some words of wisdom

01:45:34 from the great Alan Turing.

01:45:37 Sometimes it is the people no one can imagine anything of

01:45:41 who do the things no one can imagine.

01:45:44 Thank you for listening and hope to see you next time.