Professional Documents
Culture Documents
Introduction To Language Modelling and Deep Neural Network Based Text Generation - MC - Ai
Introduction To Language Modelling and Deep Neural Network Based Text Generation - MC - Ai
ABOUT
Introduction to Language Modelling and Deep Neural
Network Based Text Generation MC.AI – Aggregated news
about artificial intelligence
# 10. JANUARY 2020
MC.AI collects interesting
articles and news about
Source: Deep Learning on Medium
artificial intelligence and
related areas. The
Domino Data Lab contributions come from
various open sources and
Ad Get software that centralizes the people
are presented and
here in a too
across the data science lifecycle.
collected form.
Ad
The copyrights are held by
Domino the original authors, the
source is indicated with
SEARCH ON MC.AI
Introduction to Language Modelling and Deep Neural Network Based Text Generation
Introduction
NLP studies involve a number of important tasks like text classifications, sentiment analysis, machine
translation, text summarization etc. One other core tasks of NLP is related with language modelling which
involves generating text, conditioned on some input information. Before the recent advancement in deep
neural network models, the most commonly used methods for text generation were either based on
template or rule-based systems, or probabilistic language models such as n-gram or log-linear models
[Chen and Goodman, 1996, Koehn et al., 2003]. Language Model is the task of predicting what word
comes next or more generally, a system that assigns probability to a piece of a text sequence.N-gram is
the simplest language model and its performance is limited by its lack of complexity. Simplistic models
like this one cannot achieve fluency, enough language variation and correct writing style for long texts.
For these reasons, neural networks (NN) are explored as the new main standard despite their complexity.
And Recurrent Neural Networks (RNN) became a fundamental architecture for sequences of any kind.
RNN is nowadays considered as the default architecture for text but RNNs have problems of their own: it
cannot remember for long the content of the past and it struggles to create long relevant text sequences
because of exploding or vanishing gradient problems. For these reasons, other architectures such as
Long Short Term Memory (LSTM) [Alex Graves et all, 2014] and Gated Recurrent Units (GRU) [Kyunghyun
Cho et al, 2014] were developed and became the state of the art solution for many language generation
tasks. In this post, we will be using LSTM to generate sequences of text.
Language Model
Models that assign probabilities to sequences of words are called language models. There are primarily
two types of Language Models:
1) Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden
Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words.
2) Neural Language Models: They use different kinds of Neural Networks to model language and have
surpassed the statistical language models in their effectiveness.
N-Gram Models
We have described language models as calculating the probability of next word given a sequence of
words. Let’s begin with the task of computing P(w|h), the probability of a word w given some history h.
Suppose the history h is “its water is so transparent that” and we want to know the probability that the
next word is “the”:
Instead of computing the probability of a word given its entire history, we can approximate the history by
just the last few words. Below is the mathematical representation of different n-gram models:
• N-gram is the simplest language model and its performance is limited by its lack of complexity.
Simplistic models like this one can not achieve fluency,
• The higher the N, the better is the model usually. But this leads to lots of computation overhead that
requires large computation power in terms of RAM.
• N-grams are a sparse representation of language. This is because we build the model based on the
probability of words co-occurring. It will give zero probability to all the words that are not present in the
training corpus.
Due to these drawbacks, we will be building our character based text generation model based on neural
network architecture.
A major characteristic of most neural networks such as densely connected networks and convnets, is
that they have no memory. Each input shown to them is processed independently, with no state kept in
between inputs. With such networks, in order to process a sequence or a temporal series of data points,
you have to show the entire sequence to the network at once: turn it into a single data point.
A recurrent neural network (RNN) processes sequences by iterating through the sequence elements and
maintaining a state containing information relative to what it has seen so far. In effect, an RNN is a type of
neural network that has an internal loop.
Image from Andrej Karpathy Blog “Unreasonable Effectiveness of Recurrent Neural Networks”
There are five different types of RNN models:
(1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g.image
classification).
(2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or
negative sentiment).
(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English
and then outputs a sentence in French).
(5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the
video).
• Vanishing Gradients (this problem can be solved if you use LSTM or GRU’s)
• Exploding Gradients (this problem can be solved if you truncate or squash the gradients)
So, as a summary, RNN’s are not able to keep track of long-term dependencies. It cannot process very
long sequences. That’s why we will use an upgraded version of RNN called LSTM model.
LSTM:
Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of
learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were
refined and popularized by many people in following work.
They work tremendously well on a large variety of problems, and are now widely used.
1) Input Gate: We decide to add new stuff from the present input to our present cell state scaled by how
much we wish to add them.
2) Forget Gate: After getting the output of previous state, h(t-1), Forget gate helps us to take decisions
about what must be removed from h(t-1) state and thus keeping only relevant stuff.
3) Output Gate: Finally we’ll decide what to output from our cell state which will be done by our sigmoid
function.
The universal way to generate sequence data in deep learning is to train a network to predict the next
token or next few tokens in a sequence, using the previous tokens as input.
A more interesting approach makes slightly more surprising choices: it introduces randomness in the
sampling process, by sampling from the probability distribution for the next character. This is called
stochastic sampling.
Sampling probabilistically from the softmax output of the model is neat: it allows even unlikely characters
to be sampled some of the time, generating more interesting looking sentences and sometimes showing
creativity by coming up with new, realistic sounding words that didn’t occur in the training data.
In order to control the amount of stochasticity in the sampling process, I have used a parameter called the
softmax temperature that characterizes the entropy of the probability distribution used for sampling. It
characterizes how surprising or predictable the choice of the next character will be. Given a temperature
value, a new probability distribution is computed from the original one (the softmax output of the model)
by reweighting it in the following way.
Sampling Strategy:
Image from François Chollet’s book of Deep Learning with Python
Training Data
In this study, I have used some of the writings (İnce Memed 1 and İnce Memed 2) of Yaşar Kemal
(modern Turkish author) to train an LSTM network.
The language model we’ll learn will be specifically a model of Yaşar Kemal’s writing style and topics of
choice, rather than a more generic model of the Turkish language.
Unique characters: 56
[‘\n’, ‘ ‘, ‘!’, ‘“‘, “‘“, ‘,’, ‘-’, ‘.’, ‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘:’, ‘;’, ‘?’, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’,‘i’, ‘j’, ‘k’, ‘l’, ‘m’, ’n’, ‘o’,
‘p’, ‘r’, ‘s’, ‘t’, ‘u’, ‘v’, ‘x’, ‘y’, ‘z’, ‘â’, ‘ç’, ‘ë’, ‘î’, ‘ö’, ‘ü’, ‘ğ’, ‘ı’, ‘ş’, ‘ ’,̇ ‘ ̈’]
Results:
I have used Keras library to train an LSTM network with the above mentioned training data.
— — Generating with seed: “u. halbuki memed onun tam aksi. sevinç içinde. memed de kapı»
— — — temperature: 0.2
u. halbuki memed onun tam aksi. sevinç içinde. memed de kapıya da ali safa bey de onun at da bir
gelecek bir atları da atın da bir sara bir tarladı. bu da da ali safa bey olursun da bir de bir at başından da
atın arak bir beni de bir sesle kara da bir tarla da yanında sara da bir at bir de da bir de beni da bir de da
kara saran söylemedi. bu kararak da da bir de da da da başını bir de bir serinden bir de kadar bir içinde bir
at sağın bir at başını da da da
— — — temperature: 0.5
bir de kadar bir içinde bir at sağın bir at başını da da da size çakırdikenlik bir çuyunların bulutuları benim
kim bilir ki sız soradın başını bir sapa baktı. da de karan da kalanlarının üstüne karastır. olmadan
kalıyordu. düşünü kara var. onun kadar sustun da yapıların da oturdular kadar bir bularak kalınıyordu. ali
safa bey seri da at geliyordu. sesi gibi memed de kimeler bir de köylerinden vardı. da yakıp kapıya da
çalına da sonra başını gibi de s
— — — temperature: 1.0
n vardı. da yakıp kapıya da çalına da sonra başını gibi de sağsız da attı yukaları seni gibi. artadağı suylarını
aldırsın memed atlar oradamiz diye mit daha tartısı ova toprakasın öldürdü. çazar delikleri mi dalartdı.
sonra bir korkayıp geçorme, dermiyor, dal vardı. döndü. sarkeni candarman ötlemiyordu. de sesle sarttı,
gizsen. bazların battı. geliyor, vurmayı boşu bir iğ tış, esme na da kadar allah sauk olyaradılar,
izrisinlediğini ata ya. da onun şah
— — — temperature: 0.2
bacakları üstüne ancak dikilebilen koca osman, atlılar geçirdi. memed bir de karanlığa karartı bir karanlık
kaldı. sonra bir türlü bir karanlık bir kurşun ali safa bey bu ali safa bey de kalabalık kaldı. kadınların bir
karanlık karışının bu sabahları bir de ali safa bey de bir karanlık karanlık duruyordu. bir toprak karanlık bir
baban gelirdi. arkasında bir karanlık gibi değildi. bu kadar bir kurşun karanlık bir yanını kaldı. bir anlar
kaldı. bir anla
— — — temperature: 0.5
kurşun karanlık bir yanını kaldı. bir anlar kaldı. bir anların başını kalmasın başında döndü. bu sevin ana bu
ben bu büyük çekiyor. ali safa bey onu sonunu geçmiyordu. bu köylüler çok yaşardı. bana gelir. adam
kokusu durduğu kara bir kurşun ağamızı düşündü. ayrıldı. memed bu yanda kırmızı kalmış, insan
karanlığın altında senin bir yarasını olmaz. o da bağırarak en yaşanıyordu. o düşünür. karanlık bir
konuşuyordu. ne desin işlerinin atların içind
— — — temperature: 1.0
. karanlık bir konuşuyordu. ne desin işlerinin atların içinden. ı ̇şte: gözündeki verme göne gözlerini
banambandı. uyukoysunu turadan kalmış, bu doyucuları hiç gelken “yerlerde çıkardı… dimli kayanın bir
diyü geçiyor. birlik seler ne yaptılam da idiyordu. karanlarca durdu. sizili bir kap şeyi gelir tenk içinde
insanın altındaki devaşın dinini yüz ağılda… süleyman: a, diye dikte sızanın saza ğeni patı gittiği çizdi.
Result Analysis:
As can be seen from the outputs, a low temperature value results in repetitive and predictable text, but
local structure is highly realistic: in particular, all words are real Turkish words.
With higher temperatures, the generated text becomes more interesting, surprising, even creative; it
sometimes invents completely new words that sound somewhat plausible (such as banambandı and
karanlarca).
By training a bigger model, longer, on more data, you can achieve generated samples that look much more
realistic than this one. But, of course, you should not expect to generate any meaningful text, other than
by random chance: all you’re doing is sampling data from a statistical model of which characters come
after which characters.
Conclusion:
You can generate discrete sequence data by training a model to predict the next tokens(s), given previous
tokens. In the case of text, such a model is called a language model. It can be based on either words or
characters. Sampling the next token requires balance between adhering to what the model judges likely,
and introducing randomness. One way to handle this is the notion of softmax temperature. You should
experiment with different temperatures to find the right one.
Referrences:
[1] Alex Graves. “Generating Sequences With Recurrent Neural Networks”, 2014.
[3] Sepp Hochreiter, Jurgen Schimidhuber “Long Short Term Memory”, Neural Computation: 9(8): 1735–
1780,1997.
[4] Andrej Karpathy Blog “Unreasonable Effectiveness of Recurrent Neural Networks”, May 21 2015.
[6] François Chollet’s book “Deep Learning with Python” 1st Edition.
Related Articles
Interview with
Kaggle
Competitions
Grandmaster:
KazAnova (Rank
#3): Dr. Marios
Michailidis
28. SEPTEMBER 2018
Former Microsoft
AI Head Harry
Shum Returns to
Academia, Takes
New Position
With Tsinghua…
4. MARCH 2020
Hi ALL,
2. APRIL 2019
Product
Innovation by
Sachin Dev
Duggal Engineer
AI
9. APRIL 2020
Name * Email *
SUBMIT
Open
! "
mc.ai aggregates articles from different sources - copyright remains at original authors