Introduction To Language Modelling and Deep Neural Network Based Text Generation - MC - Ai

$ BREAKING NEWS THEY CALL ME JANE GOODALL OF THE EAST: ANW BUSINESS FIELDS AI IN THE MULTI-CLOUD: EVERYTHING YOU
RYTHING YOU NEED TO KNOW A

THEY CALL ME JANE GOODALL OF THE EAST: ANW BUSINESS FIELDS AI IN THE MULTI-CLOUD: EVERYTHING YOU NEED TO KNOW A
THEY CALL ME JANE GOODALL OF THE EAST: ANW BUSINESS FIELDS AI IN THE MULTI-CLOUD: EVERYTHING YOU NEED TO KNOW A
HOME CONTACT RECENT POSTS ! "
ABOUT
Introduction to Language Modelling and Deep Neural
Network Based Text Generation MC.AI – Aggregated news
about artificial intelligence
# 10. JANUARY 2020
MC.AI collects interesting
articles and news about
Source: Deep Learning on Medium
artificial intelligence and
related areas. The
Domino Data Lab contributions come from
various open sources and
Ad Get software that centralizes the people
are presented and
here in a too
across the data science lifecycle.
collected form.
Ad
The copyrights are held by
Domino the original authors, the
source is indicated with
Open each contribution.
Contributions which should

be deleted from this
platform can be reported
using the appropriate form
(within the contribution).
MC.AI is open for direct

submissions, we look
forward to your
contribution!
SEARCH ON MC.AI
Introduction to Language Modelling and Deep Neural Network Based Text Generation
Introduction
NLP studies involve a number of important tasks like text classifications, sentiment analysis, machine
translation, text summarization etc. One other core tasks of NLP is related with language modelling which
involves generating text, conditioned on some input information. Before the recent advancement in deep
neural network models, the most commonly used methods for text generation were either based on
template or rule-based systems, or probabilistic language models such as n-gram or log-linear models
[Chen and Goodman, 1996, Koehn et al., 2003]. Language Model is the task of predicting what word
comes next or more generally, a system that assigns probability to a piece of a text sequence.N-gram is
the simplest language model and its performance is limited by its lack of complexity. Simplistic models
like this one cannot achieve fluency, enough language variation and correct writing style for long texts.
For these reasons, neural networks (NN) are explored as the new main standard despite their complexity.
And Recurrent Neural Networks (RNN) became a fundamental architecture for sequences of any kind.
RNN is nowadays considered as the default architecture for text but RNNs have problems of their own: it
cannot remember for long the content of the past and it struggles to create long relevant text sequences
because of exploding or vanishing gradient problems. For these reasons, other architectures such as
Long Short Term Memory (LSTM) [Alex Graves et all, 2014] and Gated Recurrent Units (GRU) [Kyunghyun
Cho et al, 2014] were developed and became the state of the art solution for many language generation
tasks. In this post, we will be using LSTM to generate sequences of text.
Language Model
Models that assign probabilities to sequences of words are called language models. There are primarily
two types of Language Models:
1) Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden
Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words.
2) Neural Language Models: They use different kinds of Neural Networks to model language and have
surpassed the statistical language models in their effectiveness.
N-Gram Models
We have described language models as calculating the probability of next word given a sequence of
words. Let’s begin with the task of computing P(w|h), the probability of a word w given some history h.
Suppose the history h is “its water is so transparent that” and we want to know the probability that the
next word is “the”:
P(the|its water is so transparent that).
Instead of computing the probability of a word given its entire history, we can approximate the history by
just the last few words. Below is the mathematical representation of different n-gram models:
For example, for bigram conditional probability can be calculated as:
Limitations of N-gram models:
N-gram based language models do have a few drawbacks:
• N-gram is the simplest language model and its performance is limited by its lack of complexity.
Simplistic models like this one can not achieve fluency,
• The higher the N, the better is the model usually. But this leads to lots of computation overhead that
requires large computation power in terms of RAM.
• N-grams are a sparse representation of language. This is because we build the model based on the
probability of words co-occurring. It will give zero probability to all the words that are not present in the
training corpus.
Due to these drawbacks, we will be building our character based text generation model based on neural
network architecture.
RNN’s (Recurrent Neural Networks):
A major characteristic of most neural networks such as densely connected networks and convnets, is
that they have no memory. Each input shown to them is processed independently, with no state kept in
between inputs. With such networks, in order to process a sequence or a temporal series of data points,
you have to show the entire sequence to the network at once: turn it into a single data point.
A recurrent neural network (RNN) processes sequences by iterating through the sequence elements and
maintaining a state containing information relative to what it has seen so far. In effect, an RNN is a type of
neural network that has an internal loop.
Simple RNN architecture

Types of RNN
Image from Andrej Karpathy Blog “Unreasonable Effectiveness of Recurrent Neural Networks”
There are five different types of RNN models:
(1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g.image
classification).
(2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or
negative sentiment).
(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English
and then outputs a sentence in French).
(5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the
video).
Pseudocode for RNN
Simple RNN unrolled over time

Disadvantages of RNN’s:
Major disadvantages of RNN’s are:
• Vanishing Gradients (this problem can be solved if you use LSTM or GRU’s)
• Exploding Gradients (this problem can be solved if you truncate or squash the gradients)
Short Term Dependencies:
Predict the last word in “the clouds are in the sky,”
Image from Christopher Olah Blog “Understanding LSTM Networks”

When the gap between the relevant information and the place that it’s needed is small, RNNs can learn to
use the past information.
Long Term Dependencies:

Predict the last word in the text “I grew up in France… I speak fluent French.”
Image from Christopher Olah Blog “Understanding LSTM Networks”

The gap between the relevant information and the point where it is needed to become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
So, as a summary, RNN’s are not able to keep track of long-term dependencies. It cannot process very
long sequences. That’s why we will use an upgraded version of RNN called LSTM model.
LSTM:
Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of
learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were
refined and popularized by many people in following work.
They work tremendously well on a large variety of problems, and are now widely used.
The repeating module in an LSTM

LSTM is made up of 3 gates.
1) Input Gate: We decide to add new stuff from the present input to our present cell state scaled by how
much we wish to add them.
2) Forget Gate: After getting the output of previous state, h(t-1), Forget gate helps us to take decisions
about what must be removed from h(t-1) state and thus keeping only relevant stuff.
3) Output Gate: Finally we’ll decide what to output from our cell state which will be done by our sigmoid
function.
And here are the formulas used for an LSTM networks:
Implementing Character Level Text Generation:
The universal way to generate sequence data in deep learning is to train a network to predict the next
token or next few tokens in a sequence, using the previous tokens as input.
Image from François Chollet’s book of Deep Learning with Python

When generating text, the way you choose the next character is crucially important. A naive approach is
greedy sampling, consisting of always choosing the most likely next character. But such an approach
results in repetitive, predictable strings that don’t look like coherent language.
A more interesting approach makes slightly more surprising choices: it introduces randomness in the
sampling process, by sampling from the probability distribution for the next character. This is called
stochastic sampling.
Sampling probabilistically from the softmax output of the model is neat: it allows even unlikely characters
to be sampled some of the time, generating more interesting looking sentences and sometimes showing
creativity by coming up with new, realistic sounding words that didn’t occur in the training data.
In order to control the amount of stochasticity in the sampling process, I have used a parameter called the
softmax temperature that characterizes the entropy of the probability distribution used for sampling. It
characterizes how surprising or predictable the choice of the next character will be. Given a temperature
value, a new probability distribution is computed from the original one (the softmax output of the model)
by reweighting it in the following way.
Sampling Strategy:
Image from François Chollet’s book of Deep Learning with Python
Training Data
In this study, I have used some of the writings (İnce Memed 1 and İnce Memed 2) of Yaşar Kemal
(modern Turkish author) to train an LSTM network.
The language model we’ll learn will be specifically a model of Yaşar Kemal’s writing style and topics of
choice, rather than a more generic model of the Turkish language.
Corpus length: 1420227
Number of sequences: 473389
Unique characters: 56
[‘\n’, ‘ ‘, ‘!’, ‘“‘, “‘“, ‘,’, ‘-’, ‘.’, ‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘:’, ‘;’, ‘?’, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’,‘i’, ‘j’, ‘k’, ‘l’, ‘m’, ’n’, ‘o’,
‘p’, ‘r’, ‘s’, ‘t’, ‘u’, ‘v’, ‘x’, ‘y’, ‘z’, ‘â’, ‘ç’, ‘ë’, ‘î’, ‘ö’, ‘ü’, ‘ğ’, ‘ı’, ‘ş’, ‘ ’,̇ ‘ ̈’]
Results:
I have used Keras library to train an LSTM network with the above mentioned training data.
Here is the training loss after 23 epochs:
Here is the training accuracy after 23 epochs:
Generated Text After 1st Epoch
— — Generating with seed: “u. halbuki memed onun tam aksi. sevinç içinde. memed de kapı»
— — — temperature: 0.2
u. halbuki memed onun tam aksi. sevinç içinde. memed de kapıya da ali safa bey de onun at da bir
gelecek bir atları da atın da bir sara bir tarladı. bu da da ali safa bey olursun da bir de bir at başından da
atın arak bir beni de bir sesle kara da bir tarla da yanında sara da bir at bir de da bir de beni da bir de da
kara saran söylemedi. bu kararak da da bir de da da da başını bir de bir serinden bir de kadar bir içinde bir
at sağın bir at başını da da da
bir de kadar bir içinde bir at sağın bir at başını da da da size çakırdikenlik bir çuyunların bulutuları benim
kim bilir ki sız soradın başını bir sapa baktı. da de karan da kalanlarının üstüne karastır. olmadan
kalıyordu. düşünü kara var. onun kadar sustun da yapıların da oturdular kadar bir bularak kalınıyordu. ali
safa bey seri da at geliyordu. sesi gibi memed de kimeler bir de köylerinden vardı. da yakıp kapıya da
çalına da sonra başını gibi de s
n vardı. da yakıp kapıya da çalına da sonra başını gibi de sağsız da attı yukaları seni gibi. artadağı suylarını
aldırsın memed atlar oradamiz diye mit daha tartısı ova toprakasın öldürdü. çazar delikleri mi dalartdı.
sonra bir korkayıp geçorme, dermiyor, dal vardı. döndü. sarkeni candarman ötlemiyordu. de sesle sarttı,
gizsen. bazların battı. geliyor, vurmayı boşu bir iğ tış, esme na da kadar allah sauk olyaradılar,
izrisinlediğini ata ya. da onun şah
Generated Text After 20th Epoch

— — Generating with seed: “ bacakları üstüne ancak dikilebilen koca osman, atlılar ge»
bacakları üstüne ancak dikilebilen koca osman, atlılar geçirdi. memed bir de karanlığa karartı bir karanlık
kaldı. sonra bir türlü bir karanlık bir kurşun ali safa bey bu ali safa bey de kalabalık kaldı. kadınların bir
karanlık karışının bu sabahları bir de ali safa bey de bir karanlık karanlık duruyordu. bir toprak karanlık bir
baban gelirdi. arkasında bir karanlık gibi değildi. bu kadar bir kurşun karanlık bir yanını kaldı. bir anlar
kaldı. bir anla
kurşun karanlık bir yanını kaldı. bir anlar kaldı. bir anların başını kalmasın başında döndü. bu sevin ana bu
ben bu büyük çekiyor. ali safa bey onu sonunu geçmiyordu. bu köylüler çok yaşardı. bana gelir. adam
kokusu durduğu kara bir kurşun ağamızı düşündü. ayrıldı. memed bu yanda kırmızı kalmış, insan
karanlığın altında senin bir yarasını olmaz. o da bağırarak en yaşanıyordu. o düşünür. karanlık bir
konuşuyordu. ne desin işlerinin atların içind
. karanlık bir konuşuyordu. ne desin işlerinin atların içinden. ı ̇şte: gözündeki verme göne gözlerini
banambandı. uyukoysunu turadan kalmış, bu doyucuları hiç gelken “yerlerde çıkardı… dimli kayanın bir
diyü geçiyor. birlik seler ne yaptılam da idiyordu. karanlarca durdu. sizili bir kap şeyi gelir tenk içinde
insanın altındaki devaşın dinini yüz ağılda… süleyman: a, diye dikte sızanın saza ğeni patı gittiği çizdi.
Result Analysis:
As can be seen from the outputs, a low temperature value results in repetitive and predictable text, but
local structure is highly realistic: in particular, all words are real Turkish words.
With higher temperatures, the generated text becomes more interesting, surprising, even creative; it
sometimes invents completely new words that sound somewhat plausible (such as banambandı and
karanlarca).
By training a bigger model, longer, on more data, you can achieve generated samples that look much more
realistic than this one. But, of course, you should not expect to generate any meaningful text, other than
by random chance: all you’re doing is sampling data from a statistical model of which characters come
after which characters.
Conclusion:
You can generate discrete sequence data by training a model to predict the next tokens(s), given previous
tokens. In the case of text, such a model is called a language model. It can be based on either words or
characters. Sampling the next token requires balance between adhering to what the model judges likely,
and introducing randomness. One way to handle this is the notion of softmax temperature. You should
experiment with different temperatures to find the right one.
Referrences:
[1] Alex Graves. “Generating Sequences With Recurrent Neural Networks”, 2014.
[2] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio
“Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, 2014.
[3] Sepp Hochreiter, Jurgen Schimidhuber “Long Short Term Memory”, Neural Computation: 9(8): 1735–
1780,1997.
[4] Andrej Karpathy Blog “Unreasonable Effectiveness of Recurrent Neural Networks”, May 21 2015.
[5] Christopher Olah Blog “Understanding LSTM Networks”, Aug 27 2015.
[6] François Chollet’s book “Deep Learning with Python” 1st Edition.
Related Articles
Interview with
Kaggle
Competitions
Grandmaster:
KazAnova (Rank
#3): Dr. Marios
Michailidis
28. SEPTEMBER 2018
Former Microsoft
AI Head Harry
Shum Returns to
Academia, Takes
New Position
With Tsinghua…
4. MARCH 2020
Hi ALL,
2. APRIL 2019
Product
Innovation by
Sachin Dev
Duggal Engineer
AI
9. APRIL 2020
← 5 AI Trends that are dominating In 2020

AI vs Machine Learning vs Deep Learning Machine Learning vs Deep Learning →
REQUEST FOR DELETION
Fields marked with an * are required

The blog posts on this website are all collected from different sources (via feeds).
If you are an author of a post and would like to have it deleted from this page, you can request it using this
form.
Or if you consider a contribution to be SPAM or inappropriate, please let me know!
Requests will be processed as soon as possible.
Name * Email *
URL of post to be deleted * Reason for the request *
https://mc.ai/introduction-to- please state the reason for your request

language-modelling-and-
deep-neural-network-based-
text-generation/
I understand, that this form is for "requests for deletion" only. *
Anti-Spam: How much is 5+2? *
Type your answer here
SUBMIT
Domino Data Lab
Ad Get software that centralizes the people and tools

Ad
Domino
Open
! "
mc.ai aggregates articles from different sources - copyright remains at original authors

Introduction To Language Modelling and Deep Neural Network Based Text Generation - MC - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Language Modelling and Deep Neural Network Based Text Generation - MC - Ai

Uploaded by

Copyright:

Available Formats

$ BREAKING NEWS THEY CALL ME JANE GOODALL OF THE EAST: ANW BUSINESS FIELDS AI IN THE MULTI-CLOUD: EVERYTHING YOU

RYTHING YOU NEED TO KNOW A

HOME CONTACT RECENT POSTS ! "

Open each contribution.

Contributions which should

MC.AI is open for direct

P(the|its water is so transparent that).

For example, for bigram conditional probability can be calculated as:

Limitations of N-gram models:

N-gram based language models do have a few drawbacks:

RNN’s (Recurrent Neural Networks):

Simple RNN architecture

Pseudocode for RNN

Simple RNN unrolled over time

Major disadvantages of RNN’s are:

Short Term Dependencies:

Predict the last word in “the clouds are in the sky,”

Image from Christopher Olah Blog “Understanding LSTM Networks”

Long Term Dependencies:

Image from Christopher Olah Blog “Understanding LSTM Networks”

The repeating module in an LSTM

And here are the formulas used for an LSTM networks:

Implementing Character Level Text Generation:

Image from François Chollet’s book of Deep Learning with Python

Corpus length: 1420227

Number of sequences: 473389

Here is the training loss after 23 epochs:

Here is the training accuracy after 23 epochs:

Generated Text After 1st Epoch

Generated Text After 20th Epoch

[2] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

“Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, 2014.

[5] Christopher Olah Blog “Understanding LSTM Networks”, Aug 27 2015.

← 5 AI Trends that are dominating In 2020

REQUEST FOR DELETION

Fields marked with an * are required

Or if you consider a contribution to be SPAM or inappropriate, please let me know!

Requests will be processed as soon as possible.

URL of post to be deleted * Reason for the request *

https://mc.ai/introduction-to- please state the reason for your request

I understand, that this form is for "requests for deletion" only. *

Anti-Spam: How much is 5+2? *

Type your answer here

Domino Data Lab

Ad Get software that centralizes the people and tools

You might also like