6 Lessons Learned from 660M+ Bot Conversations

What would you learn from talking to millions of people? What can a chatbot learn? Since its release in 2014, Microsoft’s chatbot, XiaoIce (“Little Ice”) has communicated with over 660 million users and over the last 5 years, XiaoIce has developed 230 different “skills” ranging from answering questions to recommending movies, etc. The team also tried to give XiaoIce a personality they defined as “a characteristic set of behavior, cognition, and emotional patterns” that is modeled after an 18 year-old-girl who is reliable, sympathetic, affectionate, and has a wonderful sense of humor (uh hem… as defined by the research team—See Exhibit A of Researchers example of a good sense of humor).

The team behind XiaoIce recently published a paper reviewing nuances of natural language processing (NLP), how they optimized their bot, and the chatbot experience overall--among many other conversational strategies and frameworks.  Here are some of the interesting lessons I discovered (and tried to break down for conversational marketers):

1.)    Choosing Metric of Success: Conversations Per Session (CPS):

XiaoIce is a social chatbot with a goal of engaging in empathetic conversations with humans. The success metric the team set up is Conversations per Session (CPS). This is the average number of conversation turns between the chatbot and the user in a conversational session. The larger the CPS, the better the bot. How does this work? A human engages with the chatbot, the chatbot responds, and gets rewarded when the user responds again. The chatbot then tries to evaluate the state of the conversation and create another prompt—and is rewarded again with a response.

Having built hundreds of chatbots and seeing the results with the team, we know that CPS as a metric for success has some weaknesses. Originally, when we built Instabot, we also had a CPS metric. However, we saw that while CPS may show the amount of engagement, it does not always show the user’s satisfaction with the engagement. There are other reasons that your CPS may be high (but not effective), such as users testing the elasticity of conversations available, users looking for specific information/affirmation unsuccessfully, and continuing to search for it (hence longer conversations). What we found is that often conversations where the user was most satisfied may be relatively short and serve the person the information or an action as needed (e.g. buying a movie ticket). 

(Side note: It was interesting that the paper never reflected if there was a certain type or demographic of a person who had high CPS. Is it the bot that drives the conversation, or a type of human who needs an affirmation from a bot they cannot get elsewhere in their daily life? This is unknown.)

CPS is an interesting metric, but when using it as a success metric, it should be couched with additional information.

2.)    Diversity of Conversation Makes Conversations More Successful

XiaoIce team’s research showed that by promoting diversity of conversation, that the conversation was seen as more successful. For example, XaioIce was able to have 3 types of conversation:

a.) General Chat: this is non-specific chit-chat, e.g. How’s it going?
b.) Domain Chat: conversation based on a particular topic, e.g. music
c.) Skills: Buy a movie ticket

The team found that switching among different domain, general, and skill chat allowed the bot to attain a higher CPS. Also, this is pretty obvious: people are more interested when the topics you discuss are more diverse.

We find the same for the bots on our platform. The more topics you can cover (in an organic way), the better the conversation—we find that even users in a business context seeking a specific action or information will engage favorably with additional content when prompted by the bot (and those that do engage with additional content are more likely to achieve one of our success metrics like email given/trial started/demo booked in the process.)

3.)    Exploring v. Exploiting

The team found that in order to achieve a higher CPS, they needed to find a delicate balance of getting new information to train the bot versus using language they already knew would be successful. This is particularly helpful when thinking about using natural language processing (NLP) in your bot. Let me explain.

XiaoIce would improve its NLP by learning through a trial-and-error process called “exploring”. As it learned, it would develop “knowledge” of effective conversational tactics, which the team termed “exploit conversation”. The team didn’t want the bot to stop learning, but in order to achieve CPS they needed to use tactics that were already successful. So finding a balance between gathering more information to create new “exploit conversation” versus leveraging known “exploit conversation” was a challenge.

This is a common challenge for any bot builder. When do you collect free text data to learn from (for the purpose of NLP) and when do you direct users to conversations you know are effective (such as through an effective decision-tree). There are no right answers, and this is still really more of an art, than science. Currently, we recommend all of our clients have at least one conversation path for “exploring conversation”.

4.)    Creating a Hierarchal Dialogue Policy

XiaoIce used a hierarchal policy to manage conversations. The conversation went as follows: First the bot would introduce “Core Chat” defined as a general chitchat—e.g. “How are you?”, etc. Over time the bot learned “Domain Chat” or particular topics such as music, movies, etc. So if the user typed a word that was encompassed in domain chat (e.g. “have you seen a movie lately?”) then it would automatically introduce the domain chat (e.g. domain conversations about movies). Likewise, if the user said, “What is the weather?”. This would be a phrase associated with a skill—XiaoIce would tell the user the weather. Once of the lessons the researchers learned is that switching too often would negatively impact the conversation, and that it was key to find the right time to switch topics. This is true for any bot-builder and finding the time to switch topics, and knowing which topics to switch to, needs to be optimized over time and with careful analysis.

5.)    Triggering Changes in Conversation-Short Responses

Related the topic of changing conversations, the researchers found that good triggers for switching topics are when users respond with short, bland inputs, namely: “OK”, “I see”, and “Go On”. This is helpful as you start utilizing conditional logic and NLP in our platform Instabot. These may be helpful triggers to either change topics, or trigger a notification to connect to a real person and leverage live chat.


5.)    Pairing Data Sources for Training Data

The researchers also found that one source of training data was often not as helpful as using a diversity of sources. For example, when training the bot on “Domain Chat” (or chat related to specific topics such as movies and music). The team used training data from conversations as well as content from the internet. This allowed the training to be accurate from a personal perspective, but also capture a topical nature that is only available in internet news.

The lesson here is that one source of training data for your NLP might not be enough, and that when leveraging NLP, you should seek to diversify sources over time to improve the accuracy and power of your bot.

Exhibit A: Example Provided by Researcher of Xiaoice’s Sense of Humor


Want to utilize some of these new skills? Login to Instabot.

Again, I highly recommend reading the full report here. Still have questions? Email us at info@instabot.io