Source: Microsoft Research
Episode 64, February 20, 2019
Humans are unique in their ability to learn from, understand the world through and communicate with language… Or are they? Perhaps not for long, if Dr. Layla El Asri, a Research Manager at Microsoft Research Montreal, has a say in it. She wants you to be able to talk to your machine just like you’d talk to another person. That’s the easy part. The hard part is getting your machine to understand and talk back to you like it’s that other person.
Today, Dr. El Asri talks about the particular challenges she and other scientists face in building sophisticated dialogue systems that lay the foundation for talking machines. She also explains how reinforcement learning, in the form of a text game generator called TextWorld, is helping us get there, and relates a fascinating story from more than fifty years ago that reveals some of the safeguards necessary to ensure that when we design machines specifically to pass the Turing test, we design them in an ethical and responsible way.
- Microsoft Research Podcast: View more podcasts on Microsoft.com
- iTunes: Subscribe and listen to new podcasts each week on iTunes
- Email: Subscribe and listen by email
- Android: Subscribe and listen on Android
- Spotify: Listen on Spotify
- RSS feed
- Microsoft Research Newsletter: Sign up to receive the latest news from Microsoft Research
Layla El Asri: In a video game, most of the time you only have a few actions that you can take. You just need to learn when you should go right, when you should go left, when you should go up, when you should go down. But when it comes to dialogue, you need to learn how to make a sentence that is grammatically correct, and then you need to learn how to make a sentence that makes sense in the global context of the dialogue, or a sentence that brings new information in the dialogue that is going to make the person you are talking to satisfied with the sentence. Your action space is just huge because it’s not just up/down, right/left, it’s all the sentences you could imagine!
Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.
Host: Humans are unique in their ability to learn from, understand the world through and communicate with language… Or are they? Perhaps not for long, if Dr. Layla El Asri, a Research Manager at Microsoft Research Montreal, has a say in it. She wants you to be able to talk to your machine just like you’d talk to another person. That’s the easy part. The hard part is getting your machine to understand and talk back to you like it’s that other person.
Today, Dr. El Asri talks about the particular challenges she and other scientists face in building sophisticated dialogue systems that lay the foundation for talking machines. She also explains how reinforcement learning, in the form of a text game generator called TextWorld, is helping us get there, and relates a fascinating story from more than fifty years ago that reveals some of the safeguards necessary to ensure that when we design machines specifically to pass the Turing test, we design them in an ethical and responsible way. That and much more on this episode of the Microsoft Research Podcast.
Host: Layla El Asri, welcome to the podcast.
Layla El Asri: Thank you.
Host: I like to start each show by asking my guests a general “what gets you up in the morning” question. So, as a research manager at Microsoft Research in Montreal, in broad strokes, what do you do for a living and why do you do it? What is a day in the life of Layla El Asri look like?
Layla El Asri: So, I have been with Microsoft Research Montreal for about two years now. And, I’ve been working mostly on what we call dialogue systems. So, these systems have been used since the 90s, but they’ve become really popular in 2011 when Siri was released with the iPhone 4, I think. So now they are used for doing tasks that personal assistants would do for you like setting reminders and if you are looking for restaurants nearby. And, I am working on trying to make those systems work for even more complex tasks, for instance, something like helping you with your finances, helping you with a problem you have with your computer, helping you with anything that you don’t really know how to navigate and a dialogue system could help you navigate and you can talk to it naturally like you would talk to a person and then get a response that you can understand because it’s said in natural language as well.
Host: You know, right out of the gate, I hope you’re successful because that is exactly what I am looking for! So, in terms of what you are trying to accomplish with dialogue systems, let’s talk about how reinforcement learning, writ large, is playing a role here. Give us a brief description of reinforcement learning, how it works, how it’s different from other kinds of machine learning techniques and what it’s good for.
Layla El Asri: When you hear about machine learning, you’ll hear, very often, about supervised learning, and what supervised learning is, is, I give you data and I give you a label for all of your data points, and then I ask you to learn to map the data points to the labels. So, for instance, I give you a bunch of images. I tell you those images are cat images, those images are dog images, and then what I expect is, if I give you a new image that you’ve never seen before, you should be able to tell me if that’s a cat or a dog. So, reinforcement learning is a bit different in that, in this case, we put an agent in an environment. Think about a robot. We put a robot in an environment and then we let this robot act. And what we want from this robot is to learn how to interact with its environment and most of the time we would tell it to accomplish a certain task and we would explain that task to the robot with a numerical reward. So, let’s say you’re a robot and I want you to learn to open a door. You will need to walk in the environment and then identify the object that is a door and then try to open it, and then when you have accomplished this, you will get a reward telling you yes, this is what you needed to learn. So, this is reinforcement learning. Putting an agent in an environment, letting it act in this environment and then letting it learn how to act based on numerical feedback that we give it.
Host: How do you evaluate whether it learned right or not?
Layla El Asri: Yeah, so you have a goal in mind when you train a reinforcement learning agent, opening a door or beating the game Ms. Pacman, for instance, and then you evaluate what your agent has done based on this goal. If your agent learns to accomplish this task, then it has succeeded. And then another thing that you want to look at is how fast it learns to accomplish this task because you don’t want – especially if you work with robots – you don’t want to let them just walk around forever. You know, you just want them to learn as fast as possible so that they are efficient and then they can learn more and more complex tasks.
Host: One of the cool success stories that you just mentioned in reinforcement learning comes out of Montreal where Harm van Seijen’s team was able to beat Ms. Pacman, as you mentioned, with a computer. This was no small feat, but as you noted, video games deal with a small set of basic actions, like up/down, left/right. So, give us a sense, in the context of language, why conquering dialogue is so much more difficult for a computer and how reinforcement learning is advancing the state-of-the art for dialogue systems.
Layla El Asri: As you said in a video game, most of the time you only have a few actions that you can take. You just need to learn when you should go right, when you should go left, when you should go up, when you should go down. But when it comes to dialogue, you need to learn how to make a sentence that is grammatically correct, and then you need to learn how to make a sentence that makes sense in the global context of the dialogue or a sentence that brings new information in the dialogue that is going to make the person you are talking to satisfied with the sentence. Your action space is just huge because it’s not just up/down, right/left, it’s all the sentences you could imagine! So, this is why it’s hard because you have a space to search for actions that is much larger than what you would have in a video game.
Host: Hmmm. So, we’re going to come back to that later and talk about that complexity, but let’s talk for a second about the fact that you’ve worked on both spoken- and text-based dialogue systems. So, at the outset here, I’d like you to tell us how they’re different, in terms of the science and research involved, and what are the challenges in getting these different kinds of systems up to scale?
Layla El Asri: So, for spoken dialogue systems, one of the big challenges was, for a long time, to get automatic speech recognition to work. Automatic speech recognition takes the utterance that is pronounced by the user and then it tries to transcribe this utterance into words. So, that was hard for a very long time and we really started making progress on this in the 80s at CMU, at AT&T labs and then it started working really, really well with deep learning in the early 2000s. So, the challenge with spoken dialogue systems is that you have uncertainty on what the user says. So, you have to have automatic speech recognition that works for different accents, for different ages, you have to have it work with children, with adults, with teenagers who will speak differently. So, that was the big challenge there. And then when you go to text, it’s a whole other challenge because then you have to deal with how people spell, and people might make typos, they might spell a word as if they are texting somebody so it’s not necessarily the official spelling of the word. So, you need to deal with that, and you need to be able to recognize what the user is trying to say based on what they’re typing. So, the set of challenges are pretty different. For us, it’s really easy to understand, oh, okay she’s just correcting that word that she misspelled before, but a dialogue system doesn’t have that type of common sense that we have. So, it can be challenging.
Host: Can you ever put that common sense into a machine? Is there enough data, enough breadth, to do that? How are you guys thinking about that?
Layla El Asri: One of the biggest challenges in machine learning is, how do we infuse machines with common sense? I don’t think we have a good answer for this right now. We can’t write rules for everything, you know, tell a machine, this is how gravity works, this is how language works, this is how objects work. You know, we can’t just put the entire taxonomy of the world into a machine. So, it’s very complicated. One approach that we could take is just have the machine interact with the world. Read text, look at images, hear sound and then try to build this common sense just because common sense is implicit. But it’s also very complicated because machines don’t really learn the way we do. This is why it’s extremely hard to infuse common sense into machines right now. But it’s a very active thread of research and a lot of exciting work is coming out of it.
Host: Let’s dig in a little bit more deeply on the technical side of things. You know, most of us are familiar with popular conversational AI agents like the frenemies, Cortana, Siri and Alexa, and as you mentioned before, they’re good with simple exchanges and commands and reminders and answering my questions, but when you get to the interactions where machines can actually comprehend and reason, what are the scientific challenges here?
Layla El Asri: Let me tell you first how we build current dialogue systems, like Cortana, like Siri, like Alexa, personal assistants that can make simple tasks for you. The way we build them is very modular. So, first you will have a module that does automatic speech recognition, trying to transcribe whatever the user says. Then you have another module that we call natural language understanding. And what this module is going to do is try to understand the intent of the user. So, if you say, look for a restaurant downtown, it will try to understand the intent, first of all. You’re not asking a question here, you’re informing the system of something, so it will label this whole sentence as information. And then it needs to understand the domain that you’re talking about. Here, you are talking about restaurants. That’s natural language understanding, understanding the intent, the domain and then all the entities. And then you have a module that we call state tracking and what state tracking does is keep track of the entire dialogue because maybe earlier you said that you were looking for something Italian. So, it needs to remember that, and it needs to keep taking this into consideration as it tries to help you. So, then what you do you is you look in your database of restaurants. You look for Italian restaurants downtown and then the database is going to return some results. So, let’s say the database returns about twenty results. What do you do with this, right? That’s the next question, and that’s the next module, and it’s called dialogue management. Dialogue management is going to look at the state of the dialogue and decide what to say next. So, maybe there are twenty results from the database, so it’s going to try to narrow that down. Maybe it will decide, okay, I can ask for a budget, if the user has a budget, then I can have fewer results and I can present that to the user, and it will be more efficient. So, that’s dialogue management, it’s going to decide what to say next. And it’s also going to make this decision in a compact form. It’s going to say request, I’m asking a question, and the entity is budget. And then you need to take that decision and transform it into a sentence. That’s natural language generation. And finally, you have text-to-speech which is going to read that sentence and speak it back to the user. So, you have all these different modules that are all specialized in one aspect of the dialogue and they all work together in order to have a seamless experience for the user. The user just says something, and they get something back. So, those systems work, but you can see that they are limited by the structure, right? Because there is only so many intents that you can understand. You have to define that in advance. I am going to understand information, requests, orders, if the user asks me to do something in particular… You have to define that all in advance. So, that limits what your system can do. So, reinforcement learning applies very well to the dialogue case because the agent is interacting with an environment. In this case, the environment is the user, and it needs to perform a series of actions. It needs to say a series of sentences in order to accomplish a certain task. For instance, finding an Italian restaurant downtown for the user that the user likes. So, it’s very much the reinforcement learning setting. But in 2014, what we were capable of doing with deep learning was have one neural network that would build its own representation, its own meaning of what was happening, and then decide what to do next, learn what to do next. And, this is what we’re trying to do with dialogue systems now. We’re trying to have one model that just reads the dialogue that has happened so far and then learn what to say next. So, we’re trying to have one neural network that just takes in dialogue history and outputs the next utterance to say to the user. And one of the big problems is evaluation. In a dialogue, as we were saying before, there are so many things to take into account. There is grammar, there is syntax, there is clarity, there is the choice of words that you make. There is a lot to take into account.
Host: Let’s talk a little bit about an interesting thread of research that you have going on called Dialogue Driven Design, and it deals with generative models. So, tell us about this approach. What’s the goal and how is it working so far?
Layla El Asri: So, this thread of research is about having a dialogue interface to a generative model. So, dialogue systems have been traditionally used for accessibility purposes because they make technology accessible to those who are not necessarily proficient with technology because you just need to speak to the system and then it does what you want it to do. And in machine learning we have those models called generative models that are working better and better. We hear a lot about GANS (Generative Adversarial Networks) and these systems are capable of generating images that really look fantastic right now. If you look at images like flowers or buildings, they are capable of generating photo-realistic images. But these models are pretty obscure because they are neural networks. If you want to use them, you need to be well-versed in this literature. You need to understand how they work, how to train them. So, we thought, can we put a dialogue interface to those generative models because they are really starting to work well? They could be useful. So, can we make it so that people can talk to them and can control what they generate. And that’s a project that we call Chat Painter. It’s a project that we did with my team, with Shikhar Sharma and Hannes Schulz and we had an intern, Alaa El-Nouby this summer who contributed a lot to it. And what we have right now is a model that you can talk to. You can ask it to generate an image based on different objects. You can say, add a tree in the middle, put a sun on the top left, add a little boy on the grass. And then the model will modify the image every time you ask it to add something new and then, at the end, you will have the image that you had envisioned. So, right now, our model is working pretty well and we’re working on having a demo that we can make available publicly so that anybody can use it. So, it really is an example of accessibility because all you will need to do is just talk to the model.
Host: Let’s talk about another cool thing you are using in your research. It’s a tool called TextWorld and it’s yet another amazing project coming out of Montreal. Your team calls it an open source extensible engine that generates and simulates text games, and it’s also a Sandbox learning environment. So, tell us a bit more. These are provocative descriptions that leave me craving more details.
Layla El Asri: Yes. TextWorld is a very exciting project. It allows you to generate text games as you were saying. So, text games were really popular back in the 80s before graphics were working really well. So, it was games like Zork, for instance, where you had to navigate a world just by entering sentences. And TextWorld is an environment that allows you to generate such games. So, you would generate an environment, you would say put bedroom and then attach to it a bathroom and a living room and then you can put objects in each of these rooms; put a bed in the bedroom, put a chest in the bathroom, put the key that opens this chest in the living room. And then you would define a quest, so you would say, if you want to succeed in this game, you need to go fetch the key, open the chest and then collect whatever is in this chest. And so that’s a very good environment for training dialogue systems because it is a dialogue. You’re talking to the game engine. You are saying, I want to open the door, I want to take the key, I want to unlock the chest with the key. And at the same time – so, I was saying earlier when you try to evaluate what a dialogue system does, you need to look at, broadly, two things. One is the quality of the language. Is it grammatically correct? Is the syntax good? And then, also, you need to look at the content of the sentence. Is it moving the dialogue forward? Is it a good sentence to be said at this point, given what was said before? Is it consistent? And TextWorld gives you that because you generate a game and then you need to output sentences that make sense if you want to get anywhere. If you say, go door, the parser will come back and say, you can’t do that, I have no idea what that means. You need to output a real sentence. So, you have this parser that, first of all, is going to tell you if your sentence is correct. And then you have this quest which evaluates your progression in the dialogue. So, you have these two objectives that you need to output a sentence that is correct, and then you need to also output a sentence that is going to move you forward in this quest that you are trying to accomplish. So, this is why I think it’s a really good environment to train dialogue systems right now because it gives you a really clear signal as to what you are generating. Is it correct and is it moving you forward?
Host: Well, let’s talk about data for a minute. Every machine learning model needs loads of data to train, and the more complex the model, the more and more complex the data that’s needed. So, what are the unique data requirements for dialogue systems that you’re trying to build and how are you going about data gathering?
Layla El Asri: It’s particularly challenging for dialogue systems because we can’t really find this data on the internet. Because if you think about dialogues that are happening on the internet, you don’t necessarily want to reproduce those dialogues, right?
Host: No, no. Please, no.
Layla El Asri: So, um, a few things we’ve done. We’ve collected out own data sets. We’ve collected our own data sets of goal-oriented dialogues. So, we have a data set called Frames. It’s about vacation-booking, and it was two people talking to each other. One was pretending to be a bot and the other was pretending to be a user. And we gave a task to the user and then we told the bot to try to help the user as much as they could and then we collected those dialogues that they had and we recorded everything, so, the database searches that the bot performed everything that they told each other. So, that’s one way of collecting data. And then another way is having a simulated environment like TextWorld, for instance. That’s why I really like TextWorld because you can generate as much data as you want, and you can also generate dialogues. You can just roll this out and then record this dialogue and use it as training data as if you had a real dialogue that you could train from. But it’s cleaner data because the language is a bit less complicated than when you have two people talking to each other because, in a simulated environment, you’ll have fewer words. You’ll have templates so the sentences will almost always look alike. So, it’s a cleaner way to train your model and really analyze what it can do and what it cannot do. So right now, I’m really focusing on the simulated environment and on using TextWorld for dialogue systems because I think this is how we’ll make the most progress because right now we really need to analyze our models, what they can do, what they cannot do, and how we can push them further.
Host: I want to talk a bit about evaluation right now. Training models is one thing, getting the data, but with conversation agents, as you indicated before, a bigger problem is evaluating. As an Ex-English teacher, I did this for years with teenage humans that came bundled with language learning infrastructure preinstalled, basically. And that was amazingly difficult. But machines don’t have the preexisting wetware, so how are you tackling the problem of evaluation in your dialogue systems?
Layla El Asri: It is one of the biggest challenges that we have right now. Because we could have English teachers evaluate whatever our models do, right? But this is quite expensive and sometimes, you know, you’re just developing your model, you’re trying out different architectures. You just want to know, quickly, which one is better. So, you really need some sort of automatic metric for this and then really use human evaluation at the end to validate your hypothesis and then compare to other models that were proposed previously in the literature. And having this automatic metric is extremely difficult. There are a few automatic metrics that the community is using, and that we all know are bad. (laughter) Let me tell you about that.
Layla El Asri: So, very often we will train a dialogue system based on dialogues. We will give it a data set of dialogues and we will train it so that for each dialogue history that we give it, it should try to generate whatever comes next in the data set. So, now the way we evaluate the dialogue system is we will give it this context. We will let it generate what it thinks is a good next utterance and then we will compare that next utterance to the one that was in the data set and we will look if they are the same or not. The metric that we’re using right now, it’s called the Blue Score, it will say that this utterance was really bad and that your dialogue system did a really bad job. So, it’s a really bad metric because for different contexts, there are so many things that you can say and they are all equally good and they are all equally consistent, they make sense, and we should be able to say if a sentence make sense, given the context and not just say, did you guess whatever that person said next in this dialogue? We shouldn’t be looking at this, which is what the Blue Score does, we should be looking at the meaning of the sentence. But it’s very complicated because, if you think about it, if you have a model that is capable of telling you, based on dialogue history, if the sentence that you just generated makes sense, is consistent with the dialogue, moves it forward, then you’ve kind of solved the problem of the dialogue system, right? You have a model that should be able to, itself, generate a good utterance because it understands what a good utterance is. So, it’s this chicken and egg problem that we’ve had forever. And it’s really challenging.
Host: Listen, as we talked before, I mentioned, the road to artificial general intelligence is paved with language, and that’s always been the domain of humans. But the goal of MSR Montreal is to teach machines to learn and understand the world through language. And as you said, which made me laugh and cry at the same time, building a good machine learning model for a dialogue system amounts to building a small person. This is very, very ambitious research. Uh, as they say, what could possibly go wrong? Is there anything… anything that keeps you up at night about the work you’re doing, Layla? Anything we should be mindful of, and if so, how are you addressing it?
Layla El Asri: So, you know, Cortana and Siri and the systems that you can talk to right now, if you start talking to them about politics or about something personal, if you anthropomorphize them, they’re going to answer something like, let’s not go there, let’s not talk about this, this is not what I’m meant to talk about. And that’s possible because of this modular architecture that I was telling you about before. You can tell natural language understanding, these are the intents you can recognize, if something is outside of this list of intent, say you don’t understand, say you don’t want to talk about it. But, as I was saying, we’re trying to get away from this structure, we’re trying to have something that’s much more fluid, something that could potentially understand anything. And that is what scares me. Let me tell you the story of one of the first dialogue systems ever created. It was in the 60s. It was created by Joseph Weizenbaum at MIT, and it was called Eliza. And Eliza was a very simple dialogue system. It was trying to imitate what a Rogerian therapist would do. And that means, make the user speak more about whatever they are saying. So, if you said something like, I’m having problems with my parents right now, it would just say, tell me more about your parents, or how does it make you feel to have problems with your parents? And what happened is that people opened up to it in a way that was completely unexpected. They said very intimate things about themselves. There’s this famous story that Weizebaum’s secretary was speaking to Eliza and she asked Weizenbaum, do you mind going out? Because she was saying things that were very personal and that very much scared Joseph Weizenbaum. And ever since, he tried to raise awareness of this potential harm, you know, that if you have a dialogue system that could potentially understand anything, how do you limit it so that it doesn’t have people talk too much about themselves or the system doesn’t say anything that could be detrimental to this person? So, this is something we should be aware of and that this is something we should be working on. If we’re trying to get away from that structure of modular systems, we should still have safeguards as to what our systems should talk about. We should still try to understand when a user is talking about something personal and have our system answer in an ethical and responsible way. So, this is exactly what we’re doing at Microsoft, and I think this is something that more and more companies and research labs are definitely aware of and trying to establish as a standard.
Host: Tell us a bit about yourself and your background, educationally and otherwise. What got you interested in becoming a computer science researcher and how did you end up at Microsoft Research?
Layla El Asri: I grew up in France, and it’s a bit different in France! We have a different path to get into engineering school. And when came the time to choose an engineering school, I decided to go for computer science because I was very intrigued by it. I didn’t know much about computer science. I knew how to use them for basic things, but I didn’t know how to program, and I wanted to learn about it. So, this is why I chose this path. And then I chose machine learning because of luck actually. I was looking for an internship, and there was this posting at a lab near my university and what they were doing was fantastic. I didn’t know we were capable of doing this. They were building software for dyslexic children, and what the software was doing was, it was presenting exercises to make the children increase their attention span. And I had no idea this was possible. And when I saw that, I was completely mind-blown. And I decided to apply for this internship, and I got it. And this is how I got into machine learning and I discovered a whole new world.
Host: So, after the internship, how did you end up here?
Layla El Asri: So, after the internship, I did a PhD thesis. I did it in France as well and that’s when I started working on dialogue systems because I was working with a telecommunication company in France called Orange and they were very much interested in dialogue systems for customer service. So, I did my PhD with them and I worked on machine learning models, and more specifically reinforcement learning, actually, for dialogue systems. And after that, I heard about this Canadian start-up called Maluuba and they were really interested in doing research on dialogue systems and it just worked out. I applied to the job and I got it and then I moved to Montreal and it was a really good choice because Montreal is growing so much. Many labs have come here. The AI labs at the universities are growing in an incredible way, so I just decided to stay in Montreal. And then Microsoft acquired Maluuba and this is how I ended up here at Microsoft Research Montreal.
Host: And here we are today teaching machines to speak.
Layla El Asri: Yes.
Host: You know, this particular field of research seems like one that needs a really broad spectrum of expertise, not the least of which is language or linguistics.
Layla El Asri: Yes.
Host: So, as we close, give us a picture of who might be a good fit in this kind of research. People that might be interested or inspired by what you are doing and say, but I don’t have this background or that background. What would you say to those people that might be interested in getting involved here?
Layla El Asri: Dialogue systems involve a lot of skills, and we were only capable of building dialogue systems in the first place because of the progress in different fields in automatic speech recognition, in linguistics. We had better machine learning models. We have more compute power, so it was really when everything came together that we were capable of building dialogue systems. So, I would say to anybody, if you have a background in machine learning, in linguistics, in natural language processing or even in building more efficient algorithms, then you can contribute to dialogue systems for sure because we need all of this to be working very well and we need to make progress in all of those fields if we want to make better dialogue systems.
Host: So, what would be an encouraging path for a potential dialogue systems researcher to take?
Layla El Asri: Oh, that’s a great question. There is this one dialogue system that was built a few years ago by a student at Stanford University and this dialogue system helps people fight parking tickets. And it’s been very successful because nobody knows how to fight a parking ticket. So, a dialogue system is very useful for that. You can just say I got a parking ticket in this part of town. Can you help me with this? And so, I was saying dialogue systems have been used for accessibility to technology. That’s a great example. That’s accessibility to knowledge. But it would be great to be able to have a dialogue about it, so you could ask about some legal procedure and then get clarification about it and then continue learning about what you should do. So, this is a path forward that I’m excited about. I call it accessibility to knowledge, having a dialogue system that can speak about documents that are not necessarily very easy to understand. So, that means a lot of things. That means being able to understand the documents and then being able to explain them, so use simpler words most of the time, so that users can use this knowledge and then do whatever they want to do with it. This is something that I think is going to get better and better in the future and I’m very excited about it.
Host: Layla El Asri, thank you so much for coming on the podcast.
Layla El Asri: Thank you for having me.
To learn more about Dr. Layla El Asri and the quest for talking machines, visit Microsoft.com/research.