6 Technical Approaches For Building Conversational AI

How do you order toilet paper online?

If you were using a modern graphical user interface (GUI), you would:

1. Go to your computer

2. Open up a browser

3. Type in Amazon, then type “toilet paper” into the search window

4. Su er analysis paralysis over the gazillion selections that pop up

5. Make a choice but then be confronted with more choices over how
many packs to get
6. Sign in if you haven’t already

7. Put in your payment information if you haven’t already

8. Be confronted with yet another choice on whether you should

subscribe for regular deliveries or go for a one-time order

9. Review your order details

10. Con rm your order, then question that decision in a spectacular

display of buyer’s remorse.

Or, you can avoid the struggle and just tell your Amazon Echo to order
you some toilet paper.


The rise of conversational AI has been made possible by recent

breakthroughs in human parity level speech detection and smarter
sentiment analysis. Though humans have been speaking and writing
for a lot longer than they’ve been using GUIs, systems that rely on
language as its medium of interaction are di cult to build because the
computer has to be able handle user commands that are ambiguous or
hard to interpret.

There are countless commercial applications for conversational AI.

Conversational AI powers the customer engagement in chatbots, voice
experiences, and digital assistants like Google Assistant, Siri, and
Amazon’s Alexa. Popular user-facing applications include brand
engagement and storytelling, like Disney’s hugely successful
reengagement campaign for Zootopia and customer service where
conversational AI has been hugely successful in lowering the cost per
service ticket while scaling up the number of customer support
requests a business can handle.  You can browse our bot directory for
more inspirations on how conversational AI can be used in business. 

This article provides an overview of the six primary ways that

conversational AI systems are built today, including both traditional
approaches and novel, state-of-the-art techniques.  


Let’s say that you live in a parallel universe where conversational AI
already exists but bots for the commercial tra cking of toilet paper do

One of the rst decisions that you’d need to make is how your bot will
process dialogue inputs and produce replies (each armed with a
potentially di erent approach to NLP and NLU). Most current
production systems used rule-based or retrieval-based methods, while
generative methods, grounded learning, and interactive learning are
active areas of research.

Rule-based systems are trained on a prede ned hierarchy of rules that
govern how to transform user input into output dialogue or actions.
Rules can range from simple to complex, and a rule-based system is
relatively straightforward to create. However, these systems aren’t able
to respond to input patterns or keywords that don’t match existing


Remember Microsoft DOS and how painful it was to use? MS-DOS and
other terminal interfaces are actually examples of rule-based
conversational interface. Though the user has to learn a terse and
di cult-to-learn array of commands, the system responds in a
predictable manner if the user provides the correct command. As older
users may recall that MS-DOS o ered no error-handling; if the
commands and associated syntax weren’t entered exactly as directed,
then the system simply threw an error message and did nothing.

Rule-based conversational systems don’t have to suck. Eliza, an MIT
chatbot created in the 1960s, fooled many users into thinking that it
was a real therapist with its sophisticated rule-based dialogue
generation. Eliza rst scanned the input text for keywords, assigned
each keyword a programmer-designated rank, decomposed and
reassembled the input sentence based on the highest-ranking keyword,
and if it encountered remarks that didn’t match any known keyword,
prompted the user to provide more input (“Tell me more about that”).
Apparently that was enough to make some people think that Eliza was
a better listener than their human acquaintances!

Retrieval-based methods power the bulk of production systems in use

When given user input, the system uses heuristics to locate the best
response from its database of pre-de ned responses. Dialogue
selection is essentially a prediction problem, and using heuristics to
identify the most appropriate response template may involve simple
algorithms like keywords matching or it may require more complex
processing with machine learning or deep learning. Regardless of the
heuristic used, these systems only regurgitate pre-de ned responses
and do not generate new output.

Mitsuku, one of the world’s most popular open domain conversation

chatbots, contains over 300,000 hand-coded AIML response patterns
and a knowledge base containing over 3000 objects. Using these
response patterns, Mitsuku can even construct poems and songs about
a given topic. Here’s a rhyme she created about chatbots. 

Retrieval-based systems need a lot of data pre-processing for their data
and custom application logic. For example, the original IBM Watson was
built for the sole purpose of competing on Jeopardy!, and it had
sophisticated modules to preprocess questions, generate answers, and
score hypotheses.

However, all of these functionalities require speci c domain expertise

and a lot of hand engineering. This also means that their knowledge
bases can quickly become outdated and have to be manually updated,
in turn limiting their ability to adapt to new domains, languages, or use
cases. All of these obstacles also make retrieval-based systems di cult
to personalize or scale.

Overcoming the limitations of the previous two approaches requires
that the conversational AI be smart enough and creative enough to
generate new content. Instead of drawing upon pre-de ned responses,
conversational AI that use generative methods is given a large amount
of conversational training data in order to learn how to generate new
dialogue that resembles it.

While Mitsuku or other retrieval-based systems must follow a series of

steps to prepare data and to de ne functionality, generative models
are trained end-to-end rather than step-by-step. In other words,
developers should be able to build this type of conversational AI using
only machine learning and training data, and it would require no
manual engineering or domain expertise. In the ideal case, these
features make the system much more scalable and adaptable in the
long run.

Supervised learning, reinforcement learning, and adversarial learning

currently dominate in the building of generative systems, and
developers can combine all three approaches to do multi-step training
for conversational agents. 

Supervised learning frames conversation as a sequence-to-sequence

problem, where user input is mapped to a computer-generated
response. However, sequence to sequence learning tends to prioritize
high-priority, high-probability response content (i.e. “I don’t know”).
Such systems also have trouble incorporating proper nouns into their
speech because they occur at a much lower frequency in dialogue as
compared to other classes of words. All of these issues add up to
systems that are boring and repetitive to talk to and likely would not
promote sustained human engagement.


To address that issue, developers augment supervised learning with

reinforcement learning, which models how agents should take actions
that optimize for some cumulative reward, which in this case would be
sustained human interest and engagement with the conversational AI.

Though it sounds promising, Andrew Ng ranked reinforcement learning

dead last in terms of its ability to provide economic value to businesses.
The process would work well if the decision process can be modeled as
a Markov Decision Process (MDP), in which all of the information that
the system needs for making the optimal next action is contained in the
present state, making the preceding states irrelevant to the decision-
making process. While Go is a great example of nite game of perfect
information, conversation is not, and there is no guarantee even after
simulating sample responses millions and millions of times that a
system will come close to generating a “perfect” or even an “acceptable”

The compromise for conversational AI was to model the decision-

making process as a partially observable Markov decision process
(POMDP), in which system dynamics are determined by an MDP, but
the actual state cannot be determined by observation. Instead, the
agent observes the system’s current conditions and then formulates a
probabilistic belief on what the system’s state may be. However, nding
the optimum policy in a POMDP is generally considered to be a very
di cult problem, which makes this alternative also useless for
commercial deployment.


The solution for contemporary production systems is to split training

into two stages. In the observational phase, the conversational model
uses supervised learning on existing dialogue to imitate human
behavior. Then, during the trial and error phase, the model uses
reinforcement learning to adapt to new situations and dialogue inputs
that did not exist in the training data.
Adversarial learning has also been used to improve neural dialog
output. With adversarial training, conversational agents learn via a
miniature Turing Test where a generator network creates plausible
human-like responses while a discriminator network judges whether
they are real human conversations or computer generated output.

While adversarial methods have worked well for images (such as with
the use of GANs, generative adversarial networks), they aren’t as
productive for use in dialog systems. Unlike pixel values, words are
discrete and cannot be in nitesimally perturbed.

Additionally, teaching our conversational agents to mimick humans

may not be the ideal training approach. Anyone who has ever observed
two grown men getting into a Twitter ght knows that even human-
level intelligence is not always a su cient condition for productive

Recent, state-of-the-art conversational AI such as Alexa prize bots,
which were designed to be conversational bots that could talk about
any subject (a very di cult problem!), have been built with ensemble
methods, which use some combination of rule-based, retrieval-based,
and generative method approaches as dictated by context.

They may use a rule-based approach to sing a song, a retrieval-based

approach to talk about the news, and a generative approach to handle
other, unspeci ed use cases. The most advanced systems use
hierarchical reinforcement learning, which uses a low-level dialogue
policy to address the immediate task, while a higher-level policy
coordinates model selection or other strategic goals.

Though the use of ensemble methods seems promising from a

methodological perspective, two-star reviews on Amazon suggest that
conversational AI that use this approach still have a long way to go
before they can replace humans for your conversational needs.

Human dialogue relies extensively on context and external knowledge.
For example, if you told a chatbot that you were going to the Swan
Oyster Depot, that chatbot would probably recognize Swan Oyster
Depot as a restaurant, possibly a seafood restaurant, and it may tell
you to have a good time. Telling a local may result in a
recommendation for the Sicilian sashimi, but telling someone who
watches a lot of CNN may instead get you a monologue about Anthony
Bourdain’s fervent love of the place.

Your human conversational partner drew upon personal knowledge to

tell you something novel. The chatbot would probably not have,
because that data, like the bulk of human knowledge, was probably not
in its training dataset. That inability to incorporate real-world
knowledge also means that generative models are still very bad at
creating useful or meaningful chatter.

Most human knowledge does not reside in structured datasets and

continue to exist as vast quantities of unstructured data, in the form of
text and images. Mitsuku, which has won the Loebner prize three times
for being the most “human-like” chatbot, is interesting to talk to
because its dialogue can draw upon related knowledge about subjects
in its knowledge base. While generative models may be more inventive
when creating dialogue, Mitsuku is actually better “grounded” because
of its ability to learn and to use real-world knowledge representations.

The problem becomes more di cult if logical reasoning is involved. If

you asked your conversational AI to identify the piece of sashimi next
to the green leaf in picture, it probably wouldn’t be able to do so. What
is a fairly intuitive process for humans is di cult for the AI, as it would
have to 1. identify which object is the green leaf 2. know what sashimi is
3. understand the concept of “next to” 4. identify the correct piece of
sashimi if there were several on the plate and 5. match the texture and
color of that piece of sashimi to the correct sh candidate.

Grounded learning is an area of active research. A potential solution to

the sashimi identi cation task above is to use modular neural network
architecture. Much like how the sashimi identi cation task was broken
into its conceptual parts above, small neural networks that understand
a single concept is set upon one component of a task. As the
supervising system parses the input sentence, it generates a larger
neural network on the y that is customized to that particular sentence
and task.


Grounded learning still faces many problems and challenges, one of

which is the challenge of accessing knowledge bases in the context of
end-to-end di erentiable training for neural networks. For
backpropagation to be used to train an entire network, the
mechanisms which access external knowledge bases must also be fully
di erentiable. 

Novel architectures, such as Neural Turing Machines, employ fully

di erentiable addressing mechanisms to enable neural networks to
access and manipulate external memory. In the coming years, we
expect to see increased integration between neural networks and
knowledge graphs to enable relevant references while maintaining the
scalable, data-driven approach of neural dialog models.

Language is inherently interactive. Humans use language to facilitate
cooperation when they needed to solve problem together, and
practical needs in uences how language continues to develop.

For conversational AI, interactive learning remains an area of active

study despite decades of continued development. Terry Winograd’s
SHRDLU (1968-1970) and Percy Liang’s more modern version
SHRDLURN (2016) are two examples of simple cooperative learning


In SHRDLURN, the human operator knows the desired goal of the game
but has no director control over the game pieces; the computer has
control but does not understand language. The human player’s goal is
to iteratively instruct the computer to map language to concepts until it
can perform the correct actions to complete the task.

Again, though this task looks intuitive to humans, it is hard for

computers because they have no prior conception of language. It does
not understand the di erence between red or blue, or whether an
object is a pyramid or a cube.


As it turns out, the actual language that human players used to teach
the computer turned out to be less important than their ability to issue
clear and consistent commands. Based on his experiences with
SHRDLURN, Liang observed, “How do we represent knowledge, context,
memory? Maybe we shouldn’t be focused on creating better models,
but rather better environments for interactive learning.”

For a more detailed overview of interactive learning, read my article on

approaches in natural language processing and understanding.


Conversational UI is becoming increasingly ubiquitous in everyday life.
This trend will only become increasingly more pronounced as we
become more used to talking to our phones and intelligent speakers.
Eventually we expect all graphical user interfaces to be replaced or
augmented by conversational agents. 
The bulk of production conversational AI systems that power chatbots,
digital assistants, and customer support experiences are retrieval-
based methods, as are most third-party platforms that enable you to
develop conversational bots quickly.

If you want to use a more novel approach, such as grounded or

interactive learning, you’ll likely be limited by the research capabilities
of your machine learning engineering team, since these are less proven
R&D directions.

Deciding on a technical approach is only the rst step to building a

successful bot. Unfortunately, many chatbots with impressive
architectures still fall prey to common user experience failures or fail to
perform against your important user engagement metrics.

In my extended talk below on “The State of Conversational Arti cial

Intelligence”, I give a technical overview of the approaches highlighted
above but also dive into important business and design considerations
that you need to consider when building successful conversational AI

