Problem 1 Proposal

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 24

[GEEK-AI-MANIA ‘18 COMPETITION

WRITE-UP TEMPLATE V1.0]


[College Details]

[Member 1] – [Role]

[Member 2] – [Role]

Name Document Revision Date Comments


Contents
1.Abstract....................................................................................................................................................3
2.Problem Statement...................................................................................................................................3
3.Background Search...................................................................................................................................3
4.Approach to the Solution..........................................................................................................................3
5.Solution Description.................................................................................................................................3
6.Technology Stack (Software and Hardware).............................................................................................3
7.Technical Solution Architecture.................................................................................................................3
8.Validate Model & Result............................................................................................................................3
9.Results & Trained......................................................................................................................................4
10.Future Scope of Solution.........................................................................................................................4
11.Conclusion..............................................................................................................................................4
12.Appendix (as applicable).........................................................................................................................4
Abstract
Chatbots are “computer programs which conduct conversation through auditory or textual
methods”. Apple’s Siri, Microsoft’s Cortana, Google Assistant, and Amazon’s Alexa are four of
the most popular conversational agents today. They can help you get directions, check the
scores of sports games, call people in your address book, and can accidently make you order a
$170 dollhouse. These products all have auditory interfaces where the agent converses with you
through audio messages. In this post, we’ll be looking more at chatbots that operate solely on
the textual front. Facebook has been heavily investing in FB Messenger bots, which allow small
businesses and organizations to create bots to help with customer support and frequently asked
questions. Chatbots have been around for a decent amount of time (Siri released in 2011), but
only recently has deep learning been the go-to approach to the task of creating realistic and
effective chatbot interaction. In this work, we have used a Sequence To Sequence
model (Seq2Seq) model using recurrecnt neural networks (RNN). The Seq2Seq method is
implemented using the Tensorflow package with Python Programming.

Problem Statement
From a high level, the job of a chatbot is to be able to determine the best response for any
given message that it receives. This “best” response should either (1) answer the sender’s
question, (2) give the sender relevant information, (3) ask follow-up questions, or (4) continue
the conversation in a realistic way. This is a pretty tall order. The chatbot needs to be able to
understand the intentions of the sender’s message, determine what type of response message
(a follow-up question, direct response, etc.) is required, and follow correct grammatical and
lexical rules while forming the response.

It’s safe to say that modern chatbots have trouble accomplishing all these tasks. For all the
progress we have made in the field, we too often get chatbot experiences like this.

Chatbots are too often not able to understand our intentions, have trouble getting us the
correct information, and are sometimes just exasperatingly difficult to deal with. The usage of
machine learning concepts is one of the most effective methods in tackling this tough task.

The usage of machine learning algorithm will improve the identification of anomaly in real time
at a faster rate. In this work, we have also used a machine learning algorithm to detect
anomalies in UCSD Anomaly Detection Dataset.
Background Search
A recurrent neural network (RNN) is a class of artificial neural network where connections between
nodes form a directed graph along a sequence. This allows it to exhibit temporal dynamic behavior
for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state
(memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented,
connected handwriting recognition or speech recognition.

The recurrent neural network is represented as shown in the above figure. Each node at a time step
takes an input from the previous node and this can be represented using a feedback loop. We can
unfurl this feedback loop and represent it as shown in the figure below. At each time step, we take an
input x_i and a_i-1(output of the previous node) and perform computation on it and produce an
output h_i. This output is taken and given to the next node. This process continues until all the time
steps are evaluated.

The equations describing how the outputs are calculated at each time step is represented below.
Backpropagation in recurrent neural networks occurs in the opposite direction of the arrows drawn in
above figure. Like all other backpropagation techniques, we evaluate a loss function and obtain
gradients to update our weight parameters. The interesting part of backpropagation in RNN is that
backpropagation occurs from right to left. Since the parameters are updated from final time steps to
initial time steps, this is termed as backpropagation through time.

Approach to the Solution


In this work, we have used Seq2Seq with RNN for training purposes.

Sequence To Sequence model (Seq2Seq) introduced and become the Go-To model for Dialogue
Systems and Machine Translation. It consists of two RNNs (Recurrent Neural Network) : An Encoder
and a Decoder. The encoder takes a sequence (sentence) as input and processes one symbol(word)
at each timestep. Its objective is to convert a sequence of symbols into a fixed size feature vector
that encodes only the important information in the sequence while losing the unnecessary
information. You can visualize data flow in the encoder along the time axis, as the flow of local
information from one end of the sequence to another.
Each hidden state influences the next hidden state and the final hidden state can be seen as the
summary of the sequence. This state is called the context or thought vector, as it represents the
intention of the sequence. From the context, the decoder generates another sequence, one
symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the
previously generated symbols.

There are a few challenges in using this model. The most disturbing one is that the model cannot
handle variable length sequences. It is disturbing because almost all the sequence-to-sequence
applications, involve variable length sequences. The next one is the vocabulary size. The decoder has
to run softmax over a large vocabulary of say 20,000 words, for each word in the output. That is
going to slow down the training process, even if your hardware is capable of handling it.
Representation of words is of great importance. How do you represent the words in the sequence?
Use of one-hot vectors means we need to deal with large sparse vectors due to large vocabulary and
there is no semantic meaning to words encoded into one-hot vectors. Lets look into how we can face
these challenges, one by one.
Geek-AI-Mania‘18
2018
Padding
Before training, we work on the dataset to convert the variable length sequences into fixed length
sequences, by padding. We use a few special symbols to fill in the sequence.
1. EOS : End of sentence
2. PAD : Filler
3. GO : Start decoding
4. UNK : Unknown; word not in vocabulary
Consider the following query-response pair.
Q : How are you?
A : I am fine.
Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair
will be converted to:
Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]

Bucketing
Introduction of padding did solve the problem of variable length sequences, but consider the case of
large sentences. If the largest sentence in our dataset is of length 100, we need to encode all our
sentences to be of length 100, in order to not lose any words. Now, what happens to “How are you?” ?
There will be 97 PAD symbols in the encoded version of the sentence. This will overshadow the actual
information in the sentence.

Bucketing kind of solves this problem, by putting sentences into buckets of different sizes. Consider this
list of buckets : [ (5,10), (10,15), (20,25), (40,50) ]. If the length of a query is 4 and the length of its
response is 4 (as in our previous example), we put this sentence in the bucket (5,10). The query will be
padded to length 5 and the response will be padded to length 10. While running the model (training or
predicting), we use a different model for each bucket, compatible with the lengths of query and
response. All these models, share the same parameters and hence function exactly the same way.
If we are using the bucket (5,10), our sentences will be encoded to :
Q : [ PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
Geek-AI-Mania‘18
2018
Word Embedding
Word Embedding is a technique for learning dense representation of words in a low dimensional vector
space. Each word can be seen as a point in this space, represented by a fixed length vector. Semantic
relations between words are captured by this technique. The word vectors have some interesting
properties.
paris – france + poland = warsaw.
The vector difference between paris and france captures the concept of capital city.

Word Embedding is typically done in the first layer of the network : Embedding layer, that maps a word
(index to word in vocabulary) from vocabulary to a dense vector of given size. In the seq2seq model, the
weights of the embedding layer are jointly trained with the other parameters of the model.

Attention Mechanism
One of the limitations of seq2seq framework is that the entire information in the input sentence should
be encoded into a fixed length vector, context. As the length of the sequence gets larger, we start losing
considerable amount of information. This is why the basic seq2seq model doesn’t work well in decoding
large sequences. The attention mechanism, introduced in this paper, Neural Machine Translation by
Geek-AI-Mania‘18
2018
Jointly Learning to Align and Translate, allows the decoder to selectively look at the input sequence while
decoding. This takes the pressure off the encoder to encode every useful information from the input.

Seq2seq in TensorFlow
Geek-AI-Mania‘18
2018

Wrapper for seq2seq with buckets


Geek-AI-Mania‘18
2018
Solution Description
The entire model is divided into 2 small sub-models. The first sub-model is called as [E] Encoder, and the
second sub-model is called as [D] Decoder. [E] takes a raw input text data just like any other RNN
architectures do. At the end, [E] outputs a neural representation. This is a very typical work, but you need
to pay attention what this output really is. The output of [E] is going to be the input data for [D].
That is why we call [E] as Encoder and [D] as Decoder. [E] makes an output encoded in neural
representational form, and we don’t know what it really is. It is somewhat encrypted. [D] has the ability to
look inside the [E]’s output, and it will create a totally different output data.

Fig 1. Neural Machine Translation / Training Phase


In order to build such a model, there are 6 steps overall. I noted what functions to be implemented are
related to each steps.
(1) define input parameters to the encoder model
Geek-AI-Mania‘18
2018
enc_dec_model_inputs
(2) build encoder model
encoding_layer
(3) define input parameters to the decoder model
enc_dec_model_inputs, process_decoder_input, decoding_layer
(4) build decoder model for training
decoding_layer_train
(5) build decoder model for inference
decoding_layer_infer
(6) put (4) and (5) together
decoding_layer
(7) connect encoder and decoder models
seq2seq_model
(8) define loss function, optimizer, and apply gradient clipping
Encoder Input (1), (3)
enc_dec_model_inputs function creates and returns parameters (TF placeholders) related to building
model.
def enc_dec_model_inputs():
inputs = tf.placeholder(tf.int32, [None, None], name='input')
targets = tf.placeholder(tf.int32, [None, None], name='targets')
target_sequence_length = tf.placeholder(tf.int32, [None], name='target_sequence_length')
max_target_len = tf.reduce_max(target_sequence_length)
return inputs, targets, target_sequence_length, max_target_len

inputs placeholder will be fed with English sentence data, and its shape is [None, None]. The
first None means the batch size, and the batch size is unknown since user can set it. The
second None means the lengths of sentences. The maximum length of sentence is different from batch to
batch, so it cannot be set with the exact number.
 One option is to set the lengths of every sentences to the maximum length across all sentences in
every batch. No matter which method you choose, you need to add special character, <PAD> in
empty positions. However, with the latter option, there could be unnecessarily
more <PAD>characters.
Geek-AI-Mania‘18
2018
targets placeholder is similar to inputs placeholder except that it will be fed with French sentence data.
target_sequence_length placeholder represents the lengths of each sentences, so the shape is None, a
column tensor, which is the same number to the batch size. This particular value is required as an
argument of TrainerHelper to build decoder model for training. We will see in (4).
max_target_len gets the maximum value out of lengths of all the target sentences(sequences). As you
know, we have the lengths of all the sentences in target_sequence_length parameter. The way to get the
maximum value from it is to use tf.reduce_max.

Process Decoder Input (3)


On the decoder side, we need two different kinds of input for training and inference purposes repectively.
While training phase, the input is provided as target label, but they still need to be embeded. On the
inference phase, however, the output of each time step will be the input for the next time step. They also
need to be embeded and embedding vector should be shared between two different phases.

Fig 2. <GO> insertion


In this section, I am going to preprocess the target label data for the training phase. It is nothing special
task. What all you need to do is add <GO> special token in front of all target data. <GO> token is a kind of
guide token as saying like "this is the start of the translation". For this process, you need to know three
libraries from TensorFlow.
def process_decoder_input(target_data, target_vocab_to_int, batch_size):
# get '<GO>' id
go_id = target_vocab_to_int['<GO>']
Geek-AI-Mania‘18
2018
after_slice = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
after_concat = tf.concat( [tf.fill([batch_size, 1], go_id), after_slice], 1)
return after_concat

TF strided_slice
 extracts a strided slice of a tensor (generalized python array indexing).
 can be thought as splitting into multiple tensors with the striding window size from begin to end
 arguments: TF Tensor, Begin, End, Strides
TF fill
 creates a tensor filled with a scalar value.
 arguments: TF Tensor (must be int32/int64), value to fill
TF concat
 concatenates tensors along one dimension.
 arguments: a list of TF Tensor (tf.fill and after_slice in this case), axis=1
After preprocessing the target label data, we will embed it later when implementing decoding_layer
function.
Encoding (2)
Geek-AI-Mania‘18
2018

Fig 3. Encoding model highlighted — Embedding/RNN layers


As depicted in Fig 3, the encoding model consists of two different parts. The first part is the embedding
layer. Each word in a sentence will be represented with the number of features specified
as encoding_embedding_size. This layer gives much richer representative power for the words useful
explanation. The second part is the RNN layer(s). You can make use of any kind of RNN related techniques
or algorithms. For example, in this project, multiple LSTM cells are stacked together after dropout
technique is applied. You can use different kinds of RNN cells such as GRU.
def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob,
source_vocab_size,
encoding_embedding_size):
"""
:return: tuple (RNN output, RNN state)
"""
Geek-AI-Mania‘18
2018
embed = tf.contrib.layers.embed_sequence(rnn_inputs,
vocab_size=source_vocab_size,
embed_dim=encoding_embedding_size)

stacked_cells =
tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.LSTMCell(rnn_size),
keep_prob) for _ in range(num_layers)])

outputs, state = tf.nn.dynamic_rnn(stacked_cells,


embed,
dtype=tf.float32)
return outputs, state
Embedding layer
 TF contrib.layers.embed_sequence
RNN layers
 TF contrib.rnn.LSTMCell: simply specifies how many internal units it has
 TF contrib.rnn.DropoutWrapper: wraps a cell with keep probability value
 TF contrib.rnn.MultiRNNCell: stacks multiple RNN (type) cells
: how this API is used in action?

Encoding model
 TF nn.dynamic_rnn
: put Embedding layer and RNN layer(s) all together
Decoding — Training process (4)
Decoding model can be thought of two separate processes, training and inference. It is not they have
different architecture, but they share the same architecture and its parameters. It is that they have
different strategy to feed the shared model. For this(training) and the next(inference) section, Fig 4 shows
clearly shows what they are.
Geek-AI-Mania‘18
2018

Fig 4. Decoder shifted inputs


While encoder uses TF contrib.layers.embed_sequence, it is not applicable to decoder even though it may
require its input embeded. That is because the same embedding vector should be shared via training and
inferece phases. TF contrib.layers.embed_sequence can only embed the prepared dataset before running.
What needed for inference process is dynamic embedding capability. It is impossible to embed the output
from the inference process before running the model because the output of the current time step will be
the input of the next time step.
How we can embed? We will see soon. However, for now, what you need to remember is training and
inference processes share the same embedding parameters. For the training part, embeded input should
be delivered. On the inference part, only embedding parameters used in the training part should be
delivered.

Let’s see the training part first.


 tf.contrib.seq2seq.TrainingHelper
: TrainingHelper is where we pass the embeded input. As the name indicates, this is only a helper
instance. This instance should be delivered to the BasicDecoder, which is the actual process of
building the decoder model.
Geek-AI-Mania‘18
2018
 tf.contrib.seq2seq.BasicDecoder
: BasicDecoder builds the decoder model. It means it connects the RNN layer(s) on the decoder side
and the input prepared by TrainingHelper.
 tf.contrib.seq2seq.dynamic_decode
: dynamic_decode unrolls the decoder model so that actual prediction can be retrieved by
BasicDecoder for each time steps.

Decoding — Inference process (5)


 tf.contrib.seq2seq.GreedyEmbeddingHelper
: GreedyEmbeddingHelper dynamically takes the output of the current step and give it to the next
time step’s input. In order to embed the each input result dynamically, embedding parameter(just
bunch of weight values) should be provided. Along with it, GreedyEmbeddingHelper asks to give
the start_of_sequence_id for the same amount as the batch size and end_of_sequence_id.
 tf.contrib.seq2seq.BasicDecoder
: same as described in the training process section
 tf.contrib.seq2seq.dynamic_decode
: same as described in the training process section

Build the Decoding Layer (3), (6)


Embed the target sequences
 TF contrib.layers.embed_sequence creates internal representation of embedding parameter, so
we cannot look into or retrieve it. Rather, you need to create a embedding parameter manually
by TF Variable.
 Manually created embedding parameter is used for training phase to convert provided target
data(sequence of sentence) by TF nn.embedding_lookupbefore the training is run. TF
nn.embedding_lookup with manually created embedding parameters returns the similar result to
the TF contrib.layers.embed_sequence. For the inference process, whenever the output of the
current time step is calculated via decoder, it will be embeded by the shared embedding parameter
and become the input for the next time step. You only need to provide the embedding parameter
to the GreedyEmbeddingHelper, then it will help the process.
 How embedding_lookup works?
: In short, it selects specified rows
Geek-AI-Mania‘18
2018
 Note: Please be careful about setting the variable scope. As mentioned previously,
parameters/variables are shared between training and inference processes. Sharing can be
specified via tf.variable_scope.
Construct the decoder RNN layer(s)
 As depicted in Fig 3 and Fig 4, the number of RNN layer in the decoder model has to be equal to
the number of RNN layer(s) in the encoder model.
Create an output layer to map the outputs of the decoder to the elements of our vocabulary
 This is just a fully connected layer to get probabilities of occurance of each words at the end.
Build the Seq2Seq model (7)
In this section, previously defined functions, encoding_layer, process_decoder_input,
and decoding_layer are put together to build the big picture, Sequence to Sequence model.
Build Graph + Define Loss, Optimizer w/ Gradient Clipping
seq2seq_model function creates the model. It defines how the feedforward and backpropagation should
flow. The last step for this model to be trainable is deciding and applying what optimization algorithms to
use. In this section, TF contrib.seq2seq.sequence_loss is used to calculate the loss, then TF
train.AdamOptimizer is applied to calculate the gradient descent on the loss.

Technology Stack (Software and Hardware)


Python
TensorFlow- Package
Seq2Seq model
I7 Processor, NVIDIA 8 GB RAM (1080 Ti), 16 Gb RAM

Technical Solution Architecture


Working Model& Accuracy Test Result
A chatbot is implemented via seq2seq model.
1. run 'data.py' to produce some files we needed.
2. run 'train.py' to train the model.
3. run 'test_model.py' to predict.
Geek-AI-Mania‘18
2018
Geek-AI-Mania‘18
2018
Geek-AI-Mania‘18
2018
Validate Model & Result
Accuracy score for the model- 95.54
Confidence score-0.976

Results & Trained

Future Scope of Solution


Chatbots are being taught to assist people in dealing with mental health issues such as anxiety and
depression. The bots do not treat or diagnose - but human therapists have some reservations about
the tech.
Geek-AI-Mania‘18
2018
Conclusion
The proposed work has been successfully implemented using Seq2Seq model with Python
Programming. The usage of RNN leads to faster and accurate results.

You might also like