Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

DEGREE PROJECT IN INFORMATION AND COMMUNICATION

TECHNOLOGY,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2020

Handwritten Recognition for


Ethiopic (Ge’ez) Ancient
Manuscript Documents

ADISU WAGAW TEREFE

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Handwritten Recognition for
Ethiopic (Ge’ez) Ancient
Manuscript Documents

ADISU WAGAW TEREFE

Master in Software Engineering of Distributed Systems


Date: November 20, 2020
Supervisor: Anne H
Examiner: Mihhail Matskin PROFESSOR, Kungliga Tekniska Hogskolan
Swedish title: Handskrivet erkännande för etiopiska (Ge’ez) Forntida
manuskriptdokument
School of Electrical Engineering and Computer Science
iii

Abstract
The handwritten recognition system is a process of learning a pattern
from a given image of text. The recognition process usually combines
a computer vision task with sequence learning techniques. Transcrib-
ing texts from the scanned image remains a challenging problem, es-
pecially when the documents are highly degraded, or have excessive
dusty noises. Nowadays, there are several handwritten recognition
systems both commercially and in free versions, especially for Latin
based languages. However, there is no prior study that has been built
for Ge’ez handwritten ancient manuscript documents. In contrast, the
language has many mysteries of the past, in human history of science,
architecture, medicine and astronomy.

In this thesis, we present two separate recognition systems. (1) A


character-level recognition system which combines computer vision
for character segmentation from ancient books and a vanilla Convo-
lutional Neural Network (CNN) to recognize characters. (2) An end-
to-end segmentation free handwritten recognition system using CNN,
Multi-Dimensional Recurrent Neural Network (MDRNN) with Con-
nectionist Temporal Classification (CTC) for the Ethiopic (Ge’ez) manuscript
documents.

The proposed character label recognition model outperforms 97.78%


accuracy. In contrast, the second model provides an encouraging re-
sult which indicates to further study the language properties for better
recognition of all the ancient books.

Key Words
CNN, CTC, Ge’ez, Handwritten Recognition, LSTM, MDRNN
iv

Sammanfattning
Det handskrivna igenkännings systemet är en process för att lära sig
ett mönster från en viss bild av text. Erkännande Processen kombinerar
vanligtvis en datorvisionsuppgift med sekvens inlärningstekniker. Tran-
skribering av texter från den skannade bilden är fortfarande ett ut-
manande problem, särskilt när dokumenten är mycket försämrad eller
har för omåttlig dammiga buller. Nuförtiden finns det flera handskrivna
igenkänningar system både kommersiellt och i gratisversionen, särskilt
för latin baserade språk. Det finns dock ingen tidigare studie som har
byggts för Ge’ez handskrivna gamla manuskript dokument. I motsats
till detta språk har många mysterier från det förflutna, i vetenskapens
mänskliga historia, arkitektur, medicin och astronomi.

I denna avhandling presenterar vi två separata igenkänningssystem.


(1) Ett karaktärs nivå igenkänningssystem som kombinerar bildigenkän-
ning för karaktär segmentering från forntida böcker och ett vanilj Con-
volutional Neural Network (CNN) för att erkänna karaktärer. (2) Ett
änd-till-slut-segmentering fritt handskrivet igenkänningssystem som
använder CNN, Multi-Dimensional Recurrent Neural Network (MDRNN)
med Connectionist Temporal Classification (CTC) för etiopiska (Ge’ez)
manuskript dokument.

Den föreslagna karaktär igenkännings modellen överträffar 97,78% nog-


grannhet. Däremot ger den andra modellen ett uppmuntrande resul-
tat som indikerar att ytterligare studera språk egenskaperna för bättre
igenkänning av alla antika böcker.

Nyckelord
CNN, CTC, Ge’ez, Handskrivet erkännande, LSTM, MDRNN
Acronyms

BLSTM Bi-directional Long Short Term Memory. 6

CNN Convolutional Neural Networks. 9

CTC Connectionist Temporal Classification. 6

LSTM Long Short Term Memory. 6

MDLSTM Multi dimensional Long Short Term Memory. 8

MLP Multilayer Perceptron. 9

OCR Optical Character Recognition. 6, 9

RNN Recurrent Neural Networks. 9

SDG Sustainable Development Goals. 5

SVM Support Vector Machine. 14

UN United Nations. 5

v
List of Figures

1.1 An Ancient Handwritten Ge’ez manuscript document


page, the snippet is taken from the book of the Miracle
of Jesus. The books are written from left to write in mul-
tiple columns, and they usually contain old paints of the
saints or other to clarify the content. The pigment in the
text shows the birth of Jesus Christ. . . . . . . . . . . . . 2

2.1 A Recurrent neural network architecture which shows


a rolled network (left) and unrolled one (right), image
taken from Colah’s blog post in 2015-08, Understanding
LSTMs see [here for the original image], Accessed date:
November 20, 2020 . . . . . . . . . . . . . . . . . . . . . . 11
2.2 LSTM gates and its interacting layers, image taken from
Colah’s blog post in 2015-08, Understanding LSTMs see
[here for the original image], Accessed date: November
20, 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Unlabelled 32x32 cropped image characters (syllables)


generated from the segmenter algorithm . . . . . . . . . . 17
3.2 The count of individual characters in the dataset, that is
generated from 134 books. As we can see in the plot the
most dominant letter in the collection is the sixth charac-
ter of the vowel row in Ge’ez scripts which is symbolised
as (”እ” ). The second most dominant character in the
list is the script (”ል” ), which is also the sixth character
of the consonants that are found in the second row. . . . 18

4.1 Model 1, character label recognition with CNN. The model


requires an explicit segmentation of the characters with
some specific size (32x32). . . . . . . . . . . . . . . . . . . 23

vi
LIST OF FIGURES vii

4.2 The character recognition model . . . . . . . . . . . . . . 27


4.3 A single line of text, formed by using Listing 2 script,
while the individual characters, including the spaces,
are taken randomly from different handwritten docu-
ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 A single line of text taken from the original book, pre-
sented to show the differences of the synthetic data and
the original document. . . . . . . . . . . . . . . . . . . . . 28
4.5 Model 2, an end to end recognition model architecture
which utilizes CNN, RNN specifically MDLSTM and CTC 31
4.6 An end-to-end model which is capable of predicting a
line of text without using an explicit segmentation of
characters from the line. . . . . . . . . . . . . . . . . . . . 32
4.7 The first few CNN layers of the network, which are ca-
pable of extracting essential features of the input images.
The Reshape and Dense layers are used to reduce dimen-
sions. The Gaussian noise, on the other hand, adds noise
to the data to protect overfitting. . . . . . . . . . . . . . . 33
4.8 Here, the middle layers of the model, which are BLSTN
layers built to learn the more in-depth features of the se-
quences. The sequences of texts are encoded here will
be decoded later in the CTC layer. . . . . . . . . . . . . . . 34
4.9 The final layer of the model, which is a CTC layer respon-
sible for transcribing the texts. The CTC layer mainly cal-
culates the loss of the model by taking the input labels
length, output labels length, input labels, output labels
and the softmax layer. . . . . . . . . . . . . . . . . . . . . 35

5.1 Shows the training and validation accuracies for the char-
acter level recognition, as we can see here, the model
performs 0.9875 (98.75 %), 0.9777 (97.77 %) and 0.9778
(97.78 %) training, validation and test accuracies respec-
tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 The losses of the training and the validation sets during
the training of the model. . . . . . . . . . . . . . . . . . . 38
5.3 Illustration of the training validation loss for the model
containing bidirectional LSTM layers. Although we have
a minimal training resource and dataset size, the model
keeps improving through time. . . . . . . . . . . . . . . . 39
viii LIST OF FIGURES

5.4 Illustration of the training validation loss for the model


containing unidirectional LSTM layers. In this model,
there was no any improvement when we introduce a
new data during the training. The possible reason is
that, only the forward pass was not enough to learn the
sequences and be able to predict accurately. . . . . . . . 40
5.5 Predictions during the training, lines of texts are differ-
ent in the dataset. As we can see the encoded texts are
predicted with 99% accuracy . . . . . . . . . . . . . . . . 40

A.1 The accuracy of the the-end-to-end model on new data,


as shown on the snippet output, the model makes mis-
takes on very similar characters. For example, on the
ninth row, one can see how hard it is to identify the dif-
ferences of the first character of the word both in the la-
bel and the prediction. . . . . . . . . . . . . . . . . . . . . 48
A.2 The overall network of the end-to-end model . . . . . . . 49
List of source codes

1 A segmenter function used to crop characters from the


given input image. . . . . . . . . . . . . . . . . . . . . . . 25
2 A text to image generator script, which makes images
of texts by selecting characters from the corpus and ran-
domly taking respected images from the list of labelled
images using the previous model. . . . . . . . . . . . . . . 30

ix
Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Research Methodology . . . . . . . . . . . . . . . . . . . . 4
1.7 Research Benefits, Sustainability and Ethical Aspects . . 4
1.7.1 Research Benefits . . . . . . . . . . . . . . . . . . . 5
1.7.2 Research Sustainability . . . . . . . . . . . . . . . . 5
1.7.3 Research Ethical Aspects . . . . . . . . . . . . . . . 5
1.8 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . 6

2 Theoretical Study 8
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . 9
2.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 9
2.3.1 Long Short Term Memory . . . . . . . . . . . . . . 11
2.4 Connectionist Temporal Classification . . . . . . . . . . . 13
2.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Methodology 15
3.1 Research Process . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Data Collection Techniques . . . . . . . . . . . . . . . . . 15
3.2.1 Dataset Labeling . . . . . . . . . . . . . . . . . . . 18
3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Software and Hardware . . . . . . . . . . . . . . . 19
3.3.2 Cloud Configuration . . . . . . . . . . . . . . . . . 21

x
CONTENTS xi

3.4 Data Analysis Tool . . . . . . . . . . . . . . . . . . . . . . 21

4 Implementation 22
4.1 Character Recognition Model . . . . . . . . . . . . . . . . 22
4.1.1 Character Segmentation . . . . . . . . . . . . . . . 22
4.1.2 Building CNN for Character Recognition . . . . . 26
4.2 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . 28
4.3 The End-to-End Recognition Model . . . . . . . . . . . . 31

5 Result and Analysis 36


5.1 Analysis of the Character Level Model . . . . . . . . . . . 36
5.2 Analysis of the End-to-end Model . . . . . . . . . . . . . 38

6 Conclusion and Future Work 42


6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Bibliography 44

A Unnecessary Appended Material 47


Chapter 1

Introduction

Handwriting recognition may be categorized into two systems, online


and offline recognition. In the online recognition system a sequence of
coordinates, representing the movement of the pen-tip, is captured. In
contrast, within the offline case, solely a scanned image of the text is
available. Due to the ease of extracting relevant options, online recog-
nition usually yields higher results [12].

Offline handwritten recognition uses both computer vision and deep-


learning methods to transcribe a fully scanned page. The computer vi-
sion task has been used to extract valuable pieces of information from
the given image of text either into character level or word level. How-
ever, trimming the handwriting text is much harder than extracting
texts from printed or typewritten books. Traditional methods [2] seg-
mented characters from words and compute segmentation hypotheses
for righteousness, for example, by performing a heuristic over-segmentation,
followed by the scoring of groups of segments [3].

1.1 Background
Geez is an ancient South Semitic language of the Ethiopic branch which
has survived in the form of a legendary language in Ethiopia [1] as well
as some other international institutions that teach Semitic studies as a
subject (including Uppsala University).

Today, Ge’ez is used only as of the primary language of the liturgy


of the Ethiopian and Eritrean Orthodox Tewahedo churches, the Ethiopian

1
2 CHAPTER 1. INTRODUCTION

and Eritrean Catholic churches, and the Beta Israel Jewish community.
The closest living languages to Ge’ez are Tigre and Tigrinya with lexi-
cal similarity at 71% and 68%, respectively [1].

Figure 1.1: An Ancient Handwritten Ge’ez manuscript document page,


the snippet is taken from the book of the Miracle of Jesus. The books are
written from left to write in multiple columns, and they usually contain
old paints of the saints or other to clarify the content. The pigment in
the text shows the birth of Jesus Christ.
CHAPTER 1. INTRODUCTION 3

Ge’ez language has 26 different consonants that are marked for seven
vowels which are known as a fidel (26 * 7 = 182), 4 * 5 additional let-
ters, 20 numbers, and eight punctuation marks. In full scripts are rep-
resented in 230 syllables. The Ge’ez syllable is one of the rare writing
systems among the Semitic languages in which vowels are indicated
[1]. As shown in figure 1.1, the writing system of Ge’ez language is
from left to write, and a colon separates every word in ancient hand-
written documents.

1.2 Problem Statement


Ancient Ethiopian history, beliefs, traditions, science and arts have been
documented by Ge’ez language. Most of these documents are available
in the national museum of Ethiopia, in Ethiopian Orthodox churches
and in some European countries such as France [8], the United King-
dom, Germany, Italy and Sweden. Although these documents have
valuable contents, they are disappearing due to different factors. The
manuscripts are exposed to termite, fire, theft, moth and fading. More-
over, these documents should be studied and well documented since
there are no prior systems available.

1.3 Research Question


The key questions that we are going to address in the research are:

1. How do we generate a dataset from available books that are suit-


able for deep learning tasks?

2. What amount of data is sufficient to effectively train and test the


handwriting recognition model?

3. Which approach works best to build and end-to-end recognition


model while preserving data that has a sequence to sequence in-
put and output features?

4. What kind of deep learning algorithm combination can perform


better to recognize texts on images?
4 CHAPTER 1. INTRODUCTION

1.4 Purpose
This thesis work has the following purposes,

• preserving these heritages from damage by converting into machine-


readable formats,

• improve retrieval of information via the Internet and other appli-


cations and

• allowing future researchers by digitizing the documents.

1.5 Goal
The main goal of this thesis work is to develop a handwritten recogni-
tion system which can convert Ge’ez manuscript documents into ed-
itable text formats (machine-readable, such as Unicode) by using the
current state-of-the-art machine learning and deep learning techniques.

1.6 Research Methodology


The research method describes the data preparations, the software and
hardware used, and how the data is collected and analysed, includ-
ing specifications of the procedures and measurements used for the
data analysis. However, a quantitative method, which is based on nu-
merical data is appropriate for this research. In the quantitative re-
search method, numerical information is collected, and the analysis is
done using mathematically-based methods [21]. We used qualitative
research method because we wanted to find and analyse the results
that we get from several experiments such as the loss and accuracy of
training and tests (ex. in character label, and line label recognitions).

1.7 Research Benefits, Sustainability and Eth-


ical Aspects
We expect that our research will have the following implications to-
wards the research benefits, sustainability and ethical aspects.
CHAPTER 1. INTRODUCTION 5

1.7.1 Research Benefits


As the thesis is about studying and converting the ancient handwritten
texts, the first and foremost beneficiary would be the owner of the doc-
uments that is the Ethiopian Orthodox Church and its researchers. As
mentioned in section 1.2, ancient books are found in several interna-
tional libraries and museums. Many international universities such as
[10, 22] are still studying the language as it is one of the oldest Semitic
languages. So these international universities and researchers would
be the next targets as they are interested in finding out the contents of
the books and in studying the language properties. Moreover, by con-
verting ancient documents into machine-readable formats, the project
saves the language from disappearance, which then benefits the com-
munity as a whole.

1.7.2 Research Sustainability


Heritages, especially ancient books, have been extraordinary origins
of the current modern human development. Hence, studies on such
resource will have an impact on sustainable development. For the bet-
ter achievements of sustainable developments set by United Nations
(UN)’s Sustainable Development Goals (SDG), such as quality educa-
tion[15], this kind of research has to be done more straightforwardly.
Since researchers can have easy access to the ancient books in a digital
format, we anticipate that we will have a contribution towards fulfilling
the UN’s SDG achievements.

1.7.3 Research Ethical Aspects


We performed extensive measurements and detailed analysis of the
two models concerning the accuracy and loss calculation. We attempted
to be very explicit in stating the setup and tools used in each dataset
preparation, the used hardware when training and the platform that
we used intending to facilitate reproducibility of our results while keep-
ing the privacy aspects of the available historical books.
6 CHAPTER 1. INTRODUCTION

1.8 Delimitation
We train our model to recognize only 200 syllables (198 characters and
two punctuation marks) out of 230 syllables due to several reasons.
Especially numbers in gees have several shapes that are not linked with
each other. For instance, the number four (፬ ) has a 0 like shape in the
middle and two curved dashed lines on the top and bottom of it, which
makes it challenging to extract numbers from the given image using
character segmentation algorithms.

1.9 Structure of the Thesis


This thesis project is about handwritten recognition system for Ge’ez
language, which aims to convert an input image of the ancient hand-
written vellum Ge’ez documents to editable text files. It mainly de-
scribes the approaches to the design and implementation of the hand-
written recognition system using the latest deep learning and machine
learning libraries.

Chapter 1 introduces about Ge’ez language followed by the objec-


tives and goals of the system. Chapter 2 describes the relevant back-
ground of handwriting recognition and Optical Character Recognition
(OCR) systems. In contrast, the third chapter discusses the methods
and data collection techniques that are used in this thesis. Chapter 4
represents the implementation of the system, which describes how the
problem is solved using Long Short Term Memory (LSTM) and Con-
nectionist Temporal Classification (CTC). The fifth chapter discusses
the results and findings of the thesis in detail. The last but least chap-
ter presents the conclusions and future of works of the paper.

This thesis includes two separate models. The first model is about
character label recognition which combines image processing for the
pre-processing tasks and a CNN model for the training and predic-
tion. The second model, on the other hand, is an end-to-end recogni-
tion model which does not require an explicit segmentation technique
for both the pre-processing and transcription phase. This model uses
the CNN layer as a feature extractor, Bi-directional Long Short Term
Memory (BLSTM) to find out the patterns and encode sequenced char-
CHAPTER 1. INTRODUCTION 7

acters, and CTS to calculate the loss of the prediction.


Chapter 2

Theoretical Study

This chapter provides necessary background information about hand-


writing recognition, CNN, recurrent neural networks, more specifi-
cally Multi dimensional Long Short Term Memory (MDLSTM) and
CTC. Additionally, the chapter describes the related studies that are
performed to recognize the Ge’ez (Ethiopic) language.

2.1 Background
Handwritten recognition is the process of converting a text on images
to machine-readable and editable formats. Unlike [12] traditional ap-
proaches used both the image processing techniques, to segment the
required features from the given image and artificial neural networks
to recognize the characters [2]. As indicated in the introduction sec-
tion, one can classify handwritten recognition systems as offline and
online recognition systems. Offline handwritten recognition is inher-
ently more challenging than online recognition systems, in which the
information is available only from the given input image. Whereas, in
the online case, we can extract features from both the pen movement
and the resulting image [12].

Moreover, the dimensions of the input images is still another chal-


lenge in offline handwritten recognition, where we found the image in
two or three dimensions. In contrast, the input image needs to be trans-
formed into a 1-dimensional image to be able to use multi-dimensional
recurrent neural networks.

8
CHAPTER 2. THEORETICAL STUDY 9

2.2 Convolutional Neural Networks


A Convolutional Neural Networks (CNN) is one part of deep neural
networks, usually used for image analysis. CNNs include many more
connections than weights; in which the architecture itself shows a form
of regularization. The regularization of the network realizes that CNNs
are specific versions of Multilayer Perceptron (MLP). These networks
automatically provide some degree of translation invariance (a prop-
erty of CNNs that allow them to be more robust to variations in their
positions)[4].

Layers (input, hidden, and output) in CNN are diversified sub-sampling


(pooling) layers to reduce the computational load, the memory usage,
and the number of parameters, which further produces down-sampled
versions of the input layers[4].
Moreover, CNNs can be expressed by the following properties;

• Kernels: are hyperparameters of the network defined by the


width and height of the receptive field.

• Input/Output channels: are also hyperparameters of the net-


work that can show the number of input and output channels.

• Depth of the convolution: typically, the number of filters (in-


put channels) must be equal to the number channels of the input
feature map.

However, CNNs have been used in many pattern recognition areas


such as handwritten recognition [23, 19]. Moreover, CNN with Recur-
rent Neural Networks (RNN) more specifically with Long Short Term
Memory(LSTM) is highly used in handwritten and Optical Character
Recognition (OCR) studies [20, 14].

2.3 Recurrent Neural Networks


Recurrent Neural Network (RNN) is one class of deep learning which
works best on sequential data. Sequential data is a type of data where
the order affects the meaning of the result. For example, when we hu-
mans read a text, we do not throw the previous words and start un-
derstanding each story from scratch. Instead, we save words, and they
10 CHAPTER 2. THEORETICAL STUDY

will stay persistently in our brain to get a full meaning of the context.
The order of words in a sentence (especially in a written document)
can have a significant effect on the definition.

In traditional neural networks, it is impossible to inference the pre-


vious output and be able to predict the next event. For example, imag-
ine we want to indicate tomorrow’s weather condition, where the pre-
diction is dependent on today’s and yesterday’s weather. There is no
way to refer back to yesterday’s and today’s outputs to predict tomor-
row’s event when using feed-forward neural networks.

However, we can model such types of problems by using RNNs.


Recurrent neural networks address this issue by using their internal
state as a memory to predict the next event. The state allows RNNs to
save the information for a longer time in the sequence.

Figure 2.1 shows the overall architecture of RNNs; the left side of
the picture represents the folded network while the right side of the pic-
ture is the unfolded version of the system. In the network, the inputs
are represented by Xt => X0 , X1 , X1 ...Xt , the outputs, on the other
hand, are represented by ht => h0 , h1 , h1 ...ht while the letter A repre-
sents the activation function. The network accepts X0 as its first input
from the sequence of the data with an output from h( t − 1) and then
performs output h0 based some activation function. The output h0 is
then used as an input with X1 to the next step. The process continues
the same way by taking a snapshot of the current state and passing its
result to the next step.

More over we can represent RNNs mathematically as follows:

current state of the network:


ht = f (ht−1 , Xt )

Activation of the steps:


ht = tanh(Whh ht−1 + W (xh)Xt )
CHAPTER 2. THEORETICAL STUDY 11

Although RNNs can model sequential data with long-term depen-


dencies, however, they have vanishing gradient and exploding prob-
lems [13]. Vanishing gradient problem happens when training a stan-
dard RNN during back-propagation. Since RNNs are capable of han-
dling long dependencies, the gradients which are back-propagated can
vanish. To overcome this problem, particular kinds of RNN have been
designed, such as LSTM and GURU.

Figure 2.1: A Recurrent neural network architecture which shows a


rolled network (left) and unrolled one (right), image taken from Co-
lah’s blog post in 2015-08, Understanding LSTMs see [here for the orig-
inal image], Accessed date: November 20, 2020

2.3.1 Long Short Term Memory


Long Short Term Memory (LSTM) networks are special versions of
RNNs, which are capable of storing and remembering information for
a longer time. Many authors, such as, [12, 11, 9], have contributed to
the development of LSTM networks after it was introduced by [13]. Its
main advantage is that it can learn from long term dependencies and
have the ability to avoid vanishing gradient and exploding problems.

Moreover, LSTM works well in classifying, processing and predict-


ing longer time series data with unknown time duration such as hand-
written prediction and OCR systems in general. Standard RNNs have
folded structure which is a repeating module of the neural network like
the tanh activation layer. LSTMs also have a similar chain-like design
but with a different repeating module architecture [18].

LSTMs repeating structure has four interacting layers and three gates.
12 CHAPTER 2. THEORETICAL STUDY

As illustrated in Figure 2.2, each line (cells state) carries a piece of spe-
cific information that is required by the internal computation. LSTM
uses these gates to decide and pass forward the information through
the network.

Figure 2.2: LSTM gates and its interacting layers, image taken from
Colah’s blog post in 2015-08, Understanding LSTMs see [here for the
original image], Accessed date: November 20, 2020

LSTM gates are explained shortly as follows.

1. Forget gate: the forget gate is used to decide what information


is going to pass through the next cell. The sigmoid (σ) layer (the
forget layer) determines by looking at the previous output layer
ht−1 and the input Xt . It outputs a number between 0 (ignore the
information) and 1 (let the information to the next state).

ft = σ(Wf .[ht−1 , Xt ] + bf )

2. Input gate: this gate accepts a new value and updates the ex-
isting information. Two interacting layers are used to decide the
update; the first is the sigmoid layer that validates the input and
decides which value to update. Then, the tanh layer creates the
list of new candidate C̃t values, that are going to be added to the
CHAPTER 2. THEORETICAL STUDY 13

next cell state.


it = σ(Wf .[ht−1 , Xt ] + bi )
C̃t = tanh(Wc .[ht−1 , Xt ] + bc )

3. Output gate: the last layer in the LSTM network is the output
layer which combines the input and the forget layer. The output
will be based on the cell information:

(a) The sigmoid layer will decide what parts of the cell state are
going to be transferred to the output layer.
(b) The information that is passed through the sigmoid layer
will be then pushed to tanh layer and multiply it by the out-
put of the sigmoid gate.
(c) The result can be decided by a sigmoid layer, as shown in
figure 2.2.

Ot = σ(Wo .[ht−1 , Xt ] + bo )
ht = Ot ∗ tanh(Ct )

2.4 Connectionist Temporal Classification


Connectionist Temporal Classification (CTC) is a type of neural net-
work layer (usually used in the output) that is used to tackle the prob-
lem of the alignment differences between the input and output. Con-
sider a handwritten recognition system which has an image pixel value
is an input and the corresponding characters as an output where the
alignment of pixel to characters is accurately unknown.

Mathematically, let us consider mapping the input pixel value se-


quences as X = [x1 , x2 , ..., xn ] to corresponding output sequences of
characters or words Y = [y1 , y2 , ..., yt ]. Hence, we have to find the accu-
rate mapping from X’s to Y ’s while considering;

• Both the lengths of X and Y can vary.

• The alignment of X to Y is unknown correctly.

• Ratio of the lengths of X to Y can vary.


14 CHAPTER 2. THEORETICAL STUDY

CTC solves these challenges ”by allowing the network to make label
predictions at any point in the input sequence, so long as the overall
sequence of labels is correct” [12]. For a given set of inputs X, CTC
generates an output distribution for all possible Y ’s. Moreover, the
CTC algorithm does not require the exact alignment of the input and
its corresponding output.

2.5 Related Works


Nowadays, there are plenty of available handwritten recognition sys-
tems and tools. These algorithms are prepared both commercially and
in the free version more specifically for Latin based characters. For
example, Google’s tesseract library is available for many international
languages as a free version. However, there are no prior studies of
Ge’ez characters. In this thesis, we tried to set foundations for further
studies fo the language.

An extensive literature survey shows that there is no prior study on


the language that has tried to digitise handwritten Ge’ez documents.
But there are a few studies that reveal the attempts towards recognising
the Amharic language machine-printed forms. The Amharic language
is a descendant of Ethiopic Ge’ez which uses all of the Ge’ez syllables
with some own additional characters.

However, the approaches to recognise the Amharic machine printed


documents are also different from this thesis work. For example, a
study which was done by Million Meshesha, Optical Character Recog-
nition of Amharic Documents [17] uses Support Vector Machine (SVM)
to train and recognize texts. While we used a modern and efficient ap-
proach to recognize ge’ez handwritten characters.
Chapter 3

Methodology

The purpose of this chapter is to provide an overview of the research


method used in this thesis. Section 3.1 describes the research process.
Section 3.2 focuses on the data collection techniques used for this re-
search. Section 3.3 describes the experimental setups. Finally, section
3.4 describes the tools that we used for data analysis.

3.1 Research Process


We started this thesis because there is a need to study Ge’ez language
in depth currently in Ethiopia, but without any available resource such
as books both in printed and digital formats. To further investigate the
language and save the available ancient handwritten books, we have to
convert them into machine-readable and editable formats. Since this
research is the first-ever study in handwritten recognition Ge’ez, we
started collecting available and representative documents in all cor-
ners. After collecting about 134 ancient books, we pre-processed the
book pages to use line detection and character segmentation algorithms.

3.2 Data Collection Techniques


Working with machine learning and deep learning heavily requires a
very massive amount of training and test data-set to solve the given
problem accurately. Data preparation is a critical step in machine learn-
ing and deep learning processes. It is a process that determines the ac-
curacy of the model. Data preparation is a crucial start to get an accu-

15
16 CHAPTER 3. METHODOLOGY

rate model that makes the data-set more suitable for machine learning
and deep-learning algorithms. Data formation process also requires
finding and using the right data collection mechanism.

As described in section 1.6, we used a quantitative research method,


as it complies with the aim of this research. For this research, we clas-
sified the data acquisition process into two sections. First, we collected,
sampled ancient books and trained a semi-automated labeller model.
Then, we used the trained model to generate the dataset to compute
the primary training and to introduce the solution for the problem that
we have stated.

Furthermore, in research like this thesis work, for recognizing hand-


written documents, the main task is to collect clean data and prepare
it for additional processing. Since there is no publicly available dataset
for Ge’ez handwritten documents, the process takes more than three
months before we run the first algorithm.

Considering that, we first built a character cropper algorithm that


takes an input image and splits each character from images of doc-
uments. As shown in Figure 3.1, the algorithm generates a cropped
character from a book with some numbered name prefixed by syllable.

After running character segmenter algorithm, we want to perform


mapping operations, which is grouping characters into their classes.
We have about 200 classes in this research that needs to be labelled.
The quality of the character segmentation is the other challenge to map
characters to their classes. Since the segmenter algorithm only detects
an available pixel value and makes a contour around it, it is possible to
consider an arbitrary stroke of a syllable as a standalone character.

In Ge’ez script, there are characters which have similar shapes, as


shown in Figure 3.1 row number 4, columns two and column 4. They
differ with each other with a little curve. As we can see from the pic-
ture, the fourth column character is very similar to the second-row
fourth column character as well. Hence, the lines and strokes become
even more challenging.
The algorithm we used here is also expected to identify the correct let-
ter and map with their right group, which then leads us to use the clas-
CHAPTER 3. METHODOLOGY 17

sification algorithms. To train and classify the characters, the dataset


has to be labelled correctly. Labelling the data is a very tedious and
time-consuming task.

Figure 3.1: Unlabelled 32x32 cropped image characters (syllables) gen-


erated from the segmenter algorithm

Another challenge that we faced during data collection was the dis-
tribution of the scripts that are available in the books. Ge’ez language
inherently uses some of the characters more frequently than others.
For example, as shown in Figure 3.2 the count of a few characters in
18 CHAPTER 3. METHODOLOGY

the dataset is higher than the rest, which then results in a poor gener-
alisation of the model.

Figure 3.2: The count of individual characters in the dataset, that is


generated from 134 books. As we can see in the plot the most dominant
letter in the collection is the sixth character of the vowel row in Ge’ez
scripts which is symbolised as (”እ” ). The second most dominant
character in the list is the script (”ል” ), which is also the sixth character
of the consonants that are found in the second row.

3.2.1 Dataset Labeling


The dataset labelling task is a subsequent task that follows data collec-
tion. As described in section 3.2, we can pull several characters from
the available books in a fraction of seconds. The question is, though,
how we can identify the correctly segmented syllables and map them
into their right classes? There are many ways to solve this kind of prob-
lem.

The naive mechanism can be using human resources and labelling


manually, which we did here for the first 8000 characters. This ap-
CHAPTER 3. METHODOLOGY 19

proach requires human resource for the labelling task. Many engineers
are still using this technique for some problems that are hard to use au-
tomated tools. The problem with this approach is that it takes a lot of
time, and it decreases the quality of the data since it is error-prone.
However, we have used this technique for our character label recogni-
tion model.

The second and efficient way is to use auto labelling tools or algo-
rithms. Especially if the algorithm is designed to the specific problem
as we did in this project, it will provide a very accurate labelled data.
However, in this thesis, we have used both techniques to label and train
both character label and the end to end recognition tasks.

For the end to end model labelling, we have used a Ge’ez bible cor-
pus and the cropped characters. We did the labelling by picking a word
from the corpus, then concatenating cropped characters to form a word
image. This process is described more in detail in section 4.2.

3.3 Experimental Setup


The following configurations were adjusted to perform the training and
testing of the two models.

3.3.1 Software and Hardware


We used several platforms to train and test the work. The character la-
bel model was trained and tested on MacBook pro that has 2.6 GHz
Quad-Core Intel Core i7 processor, with memory 16 GB 2133 MHz
LPDDR3 and graphics, Radeon Pro 450 2 GB Intel HD Graphics 530
1536 MB. Although it was possible to use this machine, we encounter
a prolonged processing speed and an unusual sound from the device.
When our dataset gets bigger and bigger, we moved our platform to
google cloud platform.

Here follows a description of some specific APIs and libraries that


we used during the development of the project and the measurements
in the qualitative method.
20 CHAPTER 3. METHODOLOGY

Keras
Keras is an open-source neural-network API that is written in Python.
As described by Keras documentation ”it offers consistent & simple
APIs, it minimizes the number of user actions required for common
use cases, and it provides clear & actionable error messages. It also
has extensive documentation and developer guides.”[6] From 2017,
Google’s TensorFlow team decided to support Keras in TensorFlow’s
core library.

Therefore Keras is capable of running on top of Tensorflow library.


It is mainly designed to enable fast experimentation with deep neural
networks, it focuses on being user-friendly, modular, and extensible.
In this thesis, we used the Tensorflow Keras version 2. We chose Keras
because of its simplicity and support to the connectionist temporal clas-
sification (CTC).

Tensorflow
TensorFlow is a free and open-source API, available to be used in many
areas, it is mainly used for machine learning applications such as deep
neural networks. It is used to express mathematical calculations that
are used by machine learning algorithms, and used to implement and
execute such algorithms [16].

To be able to use all features of Tensorflow, we used the tensor-


flow version of keras in this project. It is also possible to export Keras
tf models to JavaScript to deploy them on web applications that can be
consumed by browsers.

OpenCV
OpenCV (Open Source Computer Vision Library) is an open source
computer vision and machine learning software library [7]. In this the-
sis we used OpenCV 3.4.10 version for our data preprocessing tasks.
The line detection, character segmentation, image resizing and word
formations are done using OpenCV algorithms.
CHAPTER 3. METHODOLOGY 21

Jupyter notebook
The Jupyter Notebook is an open-source web application that is used to
create and share documents that contain live code, equations, visual-
izations and narrative text [5]. Since it has an interactive and easy web
interface for running and visualisation, we used it to test and visualise
the trained models in our local machine.

3.3.2 Cloud Configuration


Google cloud platform (GCP) configurations: We used GCP for the train-
ing since it has a better computation power. Thus, to allow the incom-
ing and outgoing network communications, while reading and writing
files back and forth, the firewall rules were configured. The port num-
bers 9042 and 9160 were opened explicitly on instances to facilitate the
read-write operations of the dataset that is copied to the instance mem-
ory.

Python 3, tensorflow and keras were installed on the instance that


we used on gcp. OpenCV 3.4 was also installed on the instance since
we used it to load and process the dataset on both models.

3.4 Data Analysis Tool


We used Jupyter Notebook as a statistical tool since it provides a wide
range of tools for data visualization and simple statistics to generate a
summary of metrics, and customisable graphics and figures.

In addition to the libraries we mentioned in section 3.3.1 we used a


few python libraries such as matplotlib, a python 2D plotting library ca-
pable of generating production-quality visualizations with a few lines
of code. Plotting is a term used for visualizing a data in data science or
machine learning fields.
Chapter 4

Implementation

This chapter covers the implementation of an end-to-end handwritten


recognition system, mechanisms to generate datasets from available
sources, ways to label datasets using trained models, and the descrip-
tion of the building and testing the models.

4.1 Character Recognition Model


In this section, we explain to the first model, as shown in Figure 4.1, that
we built to recognize geez characters for different purposes. First, we
created the model to see if it is possible to recognize the handwritten
characters individually by using the available books as the data source.
As we are dealing with a deep learning problem without prior available
dataset, we have to generate an appropriate and efficient amount of
dataset for the training, testing and validation tasks. Thus, we built the
first model, that is the character level recognition model to generate a
list of labelled images from a given text. We believe that exhaustively
labelled data results in better training data, which then results in an
accurate prediction.

4.1.1 Character Segmentation


Character segmentation in this context is an operation that we attempted
to chop the full image of text into Ge’ez letters. The task can be done in
several ways using computer vision algorithms. As shown in Listing
1, we define a function that requires the location of the scanned doc-
ument and the output images. The naming can the image characters

22
CHAPTER 4. IMPLEMENTATION 23

Figure 4.1: Model 1, character label recognition with CNN. The model
requires an explicit segmentation of the characters with some specific
size (32x32).

are prefixed with a word syllable meaning letter followed by a number.


We updated the name index after each iteration to protect overriding
the previously named characters.

The process starts by reading the list of input scanned pages based
on the specified path. While there are more scanned pages in the di-
rectory do the following:

1. Read the scanned page from the list

2. Make a copy of the original scanned page

3. Convert the three channel image (BGR color ) into two channel
image (gray scale)

4. Apply Gaussian filter to the gray scale image by using 1x1 kernel
size.

5. Apply thresholding on the blurred image

6. Find contours on the page, meaning return pixel values which


has higher values after applying the threshold and blurring.

7. For each contour in the page

• get the co-ordinates of the location with their width and


height
24 CHAPTER 4. IMPLEMENTATION

• if the width and height is below some value (23) skip that
contour
• draw a bounding box on the copied image (for demonstra-
tion purpose only)
• save the image which has higher pixel value
• append images into the file using the specified file name.

The problem of this approach is that a black point in the book page
is considered as a character. As described in section 3.2, scripts that
have lines and strokes on top are more challenging when getting the
bounding box. The algorithm sometimes considers them as standalone
characters while they are parts of the script.
CHAPTER 4. IMPLEMENTATION 25

def character_segmenter(input_dir_path, output_dir_path, name_index):


ip = os.listdir(input_dir_path)
for scanned_doc in ip:
scanned_image = cv2.imread(input_dir_path+'/'+scanned_doc)
scanned_image_in_gray = cv2.cvtColor(scanned_image,
cv2.COLOR_BGR2GRAY)
scanned_image_blured = cv2.GaussianBlur(scanned_image_in_gray,
(1,1), 0)
ret,th = cv2.threshold(scanned_image_blured, 0, 255,
cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
image, contours, hierarchy = cv2.findContours(th,
cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
for contour in contours:
[x, y, w, h] = cv2.boundingRect(contour)
if w < 23 and h < 23:
continue

cropped = scanned_image[y :y + h , x : x + w]
s ='fidel_' + str(name_index) + '.png'
if not os.path.exists(output_dir_path):
os.makedirs(output_dir_path)
cv2.imwrite(os.path.join(output_dir_path, s) , cropped)
name_index = name_index + 1

Listing 1: A segmenter function used to crop characters from the given


input image.
26 CHAPTER 4. IMPLEMENTATION

4.1.2 Building CNN for Character Recognition


A character-level recognition in this thesis is handled in two steps;

1. Detecting and identifying bounding boxes that contains a text in


the image.

2. Identifying the characters.

In the process, the CNN model typically learns the pattern. The whole
process involves pre-processing the data as described in section 4.1.1,
that is converting the image into suitable size and colour channel, train-
able feature extraction and classification.

As shown in Figure 4.1, we fed the network’s first layer, a 32x32


grayscale image. Most of the characters on the books are written in
black colour, so we used it as an advantage to optimise the computa-
tional power by reducing the number of channels from RGB to grayscale.
The first layer computes weighted sums of the input image pixels. Sub-
sequent layers, on the other hand, add weighted sums of the outputs
of the previous layers.

The last layer on our character recognition model has 200 neurons.
Although the language we are recognising has more than 235 charac-
ters, we are dealing with only the most common characters of it. Thus
the last layer neurons tell us that we want to classify handwritten geez
characters into 200 classes. For this reason, we used a softmax activa-
tion layer to calculate the probability of each character being predicted.

Why is it not enough building Character level recognition?


The process that we have performed so far recognises characters, and
one may ask, why are you building another model to identify charac-
ters in the image? The questions seem fair enough to ask. However,
segmentation is still the most demanding task to crop texts in the im-
age, including the small colons which separate words in Ge’ez. So, we
have to learn patterns regardless of their alignment in the image to de-
tect and identify characters, which is our next task in this thesis.
CHAPTER 4. IMPLEMENTATION 27

Figure 4.2: The character recognition model


28 CHAPTER 4. IMPLEMENTATION

4.2 Dataset Generation


This section presents the process of generating a dataset to our next
model. Dataset generation in this context is creating a new dataset from
scratch provided a set of character images with a corpus which is a col-
lection of different texts.

The previous model that we built as a character recognition model


plays a crucial role in creating the new dataset. As we mentioned in
Listing 1, the segmenter function only generates a list of small image
characters. But the trained character recognition model was able to
identify these images and puts with their corresponding label values.
Thus, the cropped images are now classified into 200 classes.

Now the setups are ready to create the magical ancient handwrit-
ten lines of texts on the image. Figures 4.2 and 4.3 are lines of texts that
are formed and taken from the ancient Ge’ez books, respectively. The
former line of text is formed by our text_to_image() function. The texts
on the line are taken from our set of images which are generated by
our character segmenter script. In contrast, the later Figure 4.3 image
is directly cropped from one of the original books that we used as a
data source.

Figure 4.3: A single line of text, formed by using Listing 2 script, while
the individual characters, including the spaces, are taken randomly
from different handwritten documents.

Figure 4.4: A single line of text taken from the original book, presented
to show the differences of the synthetic data and the original document.

The above two lines of images have insignificant differences except


for the weight of the pixels on the synthetic image. To this end, the full
CHAPTER 4. IMPLEMENTATION 29

set of labelled images are produced by using our script, as shown in


Listing 2.

The core idea of the function text_to_image() is that it is possible


to generate a dataset that does not require augmentation techniques
since all the images are from different sources with a different size and
pixel weight.

The script requires the paths of both the cropped images and the
corpus. The detailed explanation of the text to the image generator is
as follows. While there are characters in the corpus:
1. if the character is from the corpus and if it is available from the
labels list then, get the directory of the images named with the
current character

2. if the character is a new line or a space:

• then write the concatenated images a single image and name


it with the characters
• reset the concatenated word_images to None
• reset the set of characters to empty to hold the next word

3. if the directory that you gave me in step 1 is available:

• get all the image names and store into a list


• if there is any other file inside the directory like some hidden
objects, filter them, only save the file names that ends with
.jpg.
• from the set select one randomly
• based on the selected name read the image, then resize it to
a 32x32
• if the word_image is None, then save the selected one
• otherwise concatenate the resized image with the previous
images
• similarly concatenate characters until you find a space or a
new line terminator.
Once the text to image writer is done, then we are ready to build the
CRNN model with CTC as a cost function.
30 CHAPTER 4. IMPLEMENTATION

def text_to_image(lbl_img_path="data", corpus_path="corpus.docx"):


corpus = docx2txt.process(corpus_path)
word_image = None word_chars = ""
for g_char in corpus:
if not g_char == " " and g_char in allowed_g_chars:
g_char_dir = os.path.join(".", "c_dataset", lbl_ime_path,
g_char)
if g_char == " ":
cv.imwrite(os.path.join(".", "txt_img", word_chars +
".png"), word_image)
word_image = None word_chars = ""
if os.path.exists(g_char_dir):
g_char_files = os.listdir(g_char_dir)
g_char_files = list(filter(lambda x: x.endswith(".jpg"),
g_char_files))
if not g_char_files:
continue
try:
selected_file = os.path.join(g_char_dir,
g_char_files[random.randint(0, len(g_char_files)-1)])
except Exception as e:
print(e)
selected_image = cv.imread(selected_file)
selected_image = cv.resize(selected_image, (32,32))

if word_image is None:
word_image = selected_image
else:
word_image = np.concatenate((word_image,
selected_image), axis= 1)
word_chars += g_character

Listing 2: A text to image generator script, which makes images of texts


by selecting characters from the corpus and randomly taking respected
images from the list of labelled images using the previous model.
CHAPTER 4. IMPLEMENTATION 31

4.3 The End-to-End Recognition Model


In this section, we present a model that does not require an explicit seg-
mentation of characters to recognize the line of texts on the image. The
model, as shown below 4.5 is built using CNN, RNN and Connectionist
Temporal Classification (CTC).

Figure 4.5: Model 2, an end to end recognition model architecture


which utilizes CNN, RNN specifically MDLSTM and CTC

In the network;

1. The input image is fed to the standard CNN layers. The few lay-
ers of the CNN layer extracts the feature maps from the given
image and transfers the output to the next layer.
The challenge in this approach is, standard CNN layers only ac-
cept an image that has a specified size (width and height). It is
impractical to find lines of texts that have equal length.
However, we took the longest line in the list and padded the rest
of line images with a white pixel value.

2. Then the outputs of the CNN layer, which are feature maps are
fed into an RNN layer specifically to the bidirectional long short
term memory (BLSTM) layer. As described in section 2.3.1, LSTM
networks are capable of handling sequences, which identifies the
relationship between the characters.

3. Finally, the output of the BLSTM layer is fed into a CTC layer
which is a transcription layer. The CTC layer takes the sequence
of characters and learns their alignment with the image, includ-
ing redundant characters, and uses the probability distribution
to transcribe the output.
32 CHAPTER 4. IMPLEMENTATION

As explained in section 2.3, recurrent neural networks work best for


the sequence to sequence problems. They can preserve the information
they have seen in the past and use the knowledge they have to predict
the next sequence in the output. In addition to inferring the knowledge
they learn from sequences inputs to sequences outputs, they are capa-
ble of learning in both forward and backward directions.

Hence, predicting the output based on the previously learned data


is not sufficient. Training the network in two directions can reveal bet-
ter information by getting the information from past to future and vice
versa. That is done using bidirectional LSTM. When running in the
backwards direction, we are preserving the data from the end, and the
same analogy applies for running in the forward direction. Thus, we
maintain the information from both directions.

The pictorial representation of the model is shown below in Figure


4.4. As an example, the model takes an image which has a sequence
of characters on it and fed to the CNN layer. The CNN layer extracts
handy information such as vertical edges, horizontal edges etc. As
shown in Figure 4.5, we added some noises to protect overfitting the
model, since we have not generated enough dataset for the training.

Figure 4.6: An end-to-end model which is capable of predicting a line


of text without using an explicit segmentation of characters from the
line.

The CNN layers pool features to the bidirectional layers. As shown


in Figure 4.6, four bidirectional layers learn in both directions and then
concatenated for the next CTC layer.
CHAPTER 4. IMPLEMENTATION 33

Figure 4.7: The first few CNN layers of the network, which are capable
of extracting essential features of the input images. The Reshape and
Dense layers are used to reduce dimensions. The Gaussian noise, on
the other hand, adds noise to the data to protect overfitting.
34 CHAPTER 4. IMPLEMENTATION

Figure 4.8: Here, the middle layers of the model, which are BLSTN
layers built to learn the more in-depth features of the sequences. The
sequences of texts are encoded here will be decoded later in the CTC
layer.
CHAPTER 4. IMPLEMENTATION 35

Figure 4.9: The final layer of the model, which is a CTC layer responsi-
ble for transcribing the texts. The CTC layer mainly calculates the loss
of the model by taking the input labels length, output labels length,
input labels, output labels and the softmax layer.
Chapter 5

Result and Analysis

This chapter explains the collected results from both character label
and end-to-end models based on the outputs we get from the train-
ing of the models. It covers the performances of the models and the
benchmarking network results of the end-to-end recognition system.
The first part discusses the accuracy and loss of the first model, while
the second section shows the different outputs of our second model.

5.1 Analysis of the Character Level Model


The proposed character label model outperforms best when the num-
ber of training data gets bigger and bigger. Our benchmark in this
analysis is the size of the dataset and the batch size during the train-
ing. Since the images are tiny in size, we used the batch size of 256,
which then works well. As we can in Figure 5.1, the training converges
faster, in only 20 epochs. After 20 periods, the model starts overfitting
the data, and we stopped the training by using a technique called early
stopping.

The validation accuracy seemed to be stabled, indicating that the


model predicts well when it gets new data, as we classified our dataset
into a training set and validation set. The training accuracy tests the
model performance using the data it has already while the validation
accuracy is the measure of the accuracy of the model in new data that
it has not seen yet.

However, when the number of epochs was increased, the validation

36
CHAPTER 5. RESULT AND ANALYSIS 37

accuracy drops drastically indicating that the model is getting weaker


to generalise and predict accurately the data that it brings. But we used
an early stopping mechanism to prevent the model from overfitting.

Figure 5.1: Shows the training and validation accuracies for the char-
acter level recognition, as we can see here, the model performs 0.9875
(98.75 %), 0.9777 (97.77 %) and 0.9778 (97.78 %) training, validation
and test accuracies respectively.
38 CHAPTER 5. RESULT AND ANALYSIS

Similarly, the losses of the training set and validation set are illus-
trated in Figure 5.2. The training loose keeps decreasing while the val-
idation loss starts bending upwards. The reason is that the model be-
gins overfitting the data after iterating over it for 20 rounds. Since the
validation loss stops the improvement over time, we call the EarlyStop-
ing call back object to stop further training iterations.

Figure 5.2: The losses of the training and the validation sets during the
training of the model.

5.2 Analysis of the End-to-end Model


The proposed end-to-end model has been benchmarked with several
networks. First, we build and train the model without the first few
CNN layers. In this approach, the raw image was fed to the bidirec-
tional RNN network. The second approach was using unidirectional
LSTM instead of bidirectional LSTM. Finally, we perform using CNN
for the first few layers to extract valuable features, bidirectional LSTM
CHAPTER 5. RESULT AND ANALYSIS 39

to encode the internal representation of the characters and the CTC


layer to calculate the loses and to decode texts.

As shown in the graphs 5.3 and 5.4, the bidirectional LSTM layers
perform better. The bidirectional layers maintain the information from
both directions, which makes them suitable for text analysis problems.
In contrast, unidirectional LSTM layers only keep the past data and try
to infer that knowledge through prediction.

The accuracy of the model is shown in Figure 5.5. During the train-
ing, we intentionally used shorter lines of images which padded with
white pixels to save the training resource requirements. The padded
white pixels will be collapsed by the CTC algorithm later in the tran-
scription. However, we added a few long lines of texts in training to
the model to get a better result. But the training time and the accuracy
was not as expected due to less number of long lines of images in the
dataset.

Figure 5.3: Illustration of the training validation loss for the model
containing bidirectional LSTM layers. Although we have a minimal
training resource and dataset size, the model keeps improving through
time.
40 CHAPTER 5. RESULT AND ANALYSIS

Figure 5.4: Illustration of the training validation loss for the model con-
taining unidirectional LSTM layers. In this model, there was no any
improvement when we introduce a new data during the training. The
possible reason is that, only the forward pass was not enough to learn
the sequences and be able to predict accurately.

Figure 5.5: Predictions during the training, lines of texts are different
in the dataset. As we can see the encoded texts are predicted with 99%
accuracy
CHAPTER 5. RESULT AND ANALYSIS 41

Generally, the two models showed an outstanding and promising


result. As the character level recognition model has a 97.78% of ac-
curacy, it was enough to detect and recognize characters if we had an
accurate text extractor in the documents. Moreover, words in the doc-
ument are separated by a colon so that we could have taken advantage
of it. But due to the lack of such algorithms, we could not be able to
use this model to convert texts into the desired format fully.

However, the second model, which is the end-to-end recognition


gives us an encouraging result. Due to the lack of prior dataset and ap-
propriate hardware, we restricted the model to only train and predict
lines of texts in the documents.
Chapter 6

Conclusion and Future Work

This section summarises the goal and the main features of the thesis
work. Then, it points out the limitations and suggestions of the project
for future works.

6.1 Conclusion
Ge’ez is an unstudied language while it kept the mysteries of human
development both in science and spirituality. In this thesis, we in-
vestigated the ways to convert handwritten documents into machine-
readable and editable formats. We successfully generated the dataset
from scratch that can be used in further studies. We have also imple-
mented the character label and end-to-end recognition systems. We
tested our models which are built using the current state-of-the-art
deep-learning algorithms and achieved our goal on how we can use
and combine them to recognize patterns on an image and transcribe
them into text.

CNN with RNN remains the best approach to solve a problem which
has a sequence to sequence image data. Although the CNN layer re-
quires an equal-sized input image, it performs well in extracting the
patterns that are relevant for encoding texts in the image. Bidirectional
LSTMs have been an excellent choice to learn in both directions by il-
luminating gradient vanishing problems. The connectionist temporal
classification (CTC) algorithm is used to calculate the loss of the net-
work by collecting the loss from different layers, the label length, input
length and the label from input layers and the output from the output

42
CHAPTER 6. CONCLUSION AND FUTURE WORK 43

layer.

Moreover, a better training device and a massive amount of dataset


can help to fully convert the scanned images of ancient documents into
the desired format.

6.2 Limitations
The study is limited only to recognize 200 syllables (characters) out of
230 available characters in Ge’ez language as mentioned in 1.1 owing
to several reasons

1. Inherently the shapes of the characters (syllables) are not suit-


able to the existing segmenter algorithms. Mostly the contours
of the numbers; they are not connected pixels in the form of line
or circle instead they have separated curved lines on top and bot-
tom of each symbol which makes it challenging to consider it as
a corresponding pixel value during segmentation.

2. Hardware problem: we used personal computers, which has min-


imal resources like GPU and CPU to train and test the model
while expecting to classify 230 classes in the output layer of the
models that require powerful hardware.

Furthermore, the size of the dataset that we generated is not sufficient,


as deep-learning algorithms are hunger in data. Although the writing
system in the ancient Ge’ez document is very similar, the number of
books that we used to generate the dataset is somehow limited to 134.
Additionally, the corpus that we had in hand seemed to be bounded
since we used some parts of the bible.

6.3 Future Work


This thesis project sets the necessary foundations of converting ancient
Ge’ez handwritten documents into machine-readable formats. Our next
goal is to generate more data from different books and include the re-
maining characters to convert documents into the desired format fully.
Bibliography

[1] Marvin Lionel Bender. “The non-Semitic languages of Ethiopia.”


In: (1976).
[2] Yoshua Bengio et al. “Lerec: A NN/HMM hybrid for on-line hand-
writing recognition.” In: Neural computation 7.6 (1995), pp. 1289–
1303.
[3] Théodore Bluche. “Joint line segmentation and transcription for
end-to-end handwritten paragraph recognition.” In: Advances in
Neural Information Processing Systems. 2016, pp. 838–846.
[4] Jake Bouvrie. “Notes on convolutional neural networks.” In: (2006).
[5] Jupyter Notebook Documentation. Jupyter Notebook Documenta-
tion. https://jupyter.org/documentation. [Online; accessed
2020-02-02]. 2020.
[6] Keras Documentation. Keras API Reference. https://keras.io/
api/. [Online; accessed 2020-02-02]. 2020.
[7] OpenCV Documentation. OpenCV Documentation. https://docs.
opencv.org/3.4.10/d1/dfb/intro.html. [Online; accessed
2020-02-02]. 2020.
[8] The National Library of France. Quelques manuscrits. https://
gallica. bnf . fr/ html / und /afrique / quelques - manuscrits.
[Online; accessed 2019-08-02]. 2008.
[9] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. “Learn-
ing to forget: Continual prediction with LSTM.” In: (1999).
[10] Georg-August-Universität Göttingen. Non-European Cultural Stud-
ies. https://www.sub.uni-goettingen.de/geisteswissenschaften-
und-theologie/aussereuropaeische-kulturwissenschaften/
thematische-suche/. [Online; accessed 2020-11-02]. 2020.

44
BIBLIOGRAPHY 45

[11] Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. “Bidi-


rectional LSTM networks for improved phoneme classification
and recognition.” In: International Conference on Artificial Neural
Networks. Springer. 2005, pp. 799–804.
[12] Alex Graves and Jürgen Schmidhuber. “Offline handwriting recog-
nition with multidimensional recurrent neural networks.” In: Ad-
vances in neural information processing systems. 2009, pp. 545–552.
[13] Sepp Hochreiter. “The vanishing gradient problem during learn-
ing recurrent neural nets and problem solutions.” In: Interna-
tional Journal of Uncertainty, Fuzziness and Knowledge-Based Sys-
tems 6.02 (1998), pp. 107–116.
[14] Takaaki Hori et al. “Advances in joint CTC-attention based end-
to-end speech recognition with a deep CNN encoder and RNN-
LM.” In: arXiv preprint arXiv:1706.02737 (2017).
[15] Saskia D Keesstra et al. “The significance of soils and soil science
towards realization of the United Nations Sustainable Develop-
ment Goals.” In: Soil (2016).
[16] Martı ́n Abadi et al. TensorFlow: Large-Scale Machine Learning on
Heterogeneous Systems. Software available from tensorflow.org.
2015. URL: https://www.tensorflow.org/.
[17] M. Meshesha and C. Jawahar. “Optical Character Recognition of
Amharic Documents.” In: Afr. J. Inf. Commun. Technol. 3 (2007).
[18] Christopher Olah. Understanding LSTMs. http://colah.github.
io/posts/2015- 08- Understanding- LSTMs. [Online; accessed
2020-02-02]. 2015.
[19] Ahmed El-Sawy, EL-Bakry Hazem, and Mohamed Loey. “CNN
for handwritten arabic digits recognition based on LeNet-5.” In:
International conference on advanced intelligent systems and informat-
ics. Springer. 2016, pp. 566–575.
[20] Baoguang Shi, Xiang Bai, and Cong Yao. “An end-to-end train-
able neural network for image-based sequence recognition and
its application to scene text recognition.” In: IEEE transactions on
pattern analysis and machine intelligence 39.11 (2016), pp. 2298–
2304.
[21] Suphat Sukamolson. “Fundamentals of quantitative research.”
In: Language Institute Chulalongkorn University 1 (2007), pp. 2–3.
46 BIBLIOGRAPHY

[22] University of Toronto. The university is now one of the only places in
the world where students can learn Ge’ez. https://www.utoronto.
ca/news/u-t-launches-class-ancient-ethiopian-language-
very-nature-university. [Online; accessed 2020-10-08]. 2020.
[23] Chunpeng Wu et al. “Handwritten character recognition by al-
ternately trained relaxation convolutional neural network.” In:
2014 14th International Conference on Frontiers in Handwriting Recog-
nition. IEEE. 2014, pp. 291–296.
Appendix A

Unnecessary Appended Material

47
48 APPENDIX A. UNNECESSARY APPENDED MATERIAL

Figure A.1: The accuracy of the the-end-to-end model on new data,


as shown on the snippet output, the model makes mistakes on very
similar characters. For example, on the ninth row, one can see how
hard it is to identify the differences of the first character of the word
both in the label and the prediction.
APPENDIX A. UNNECESSARY APPENDED MATERIAL 49

Figure A.2: The overall network of the end-to-end model


TRITA-EECS-EX-2020:834

www.kth.se

You might also like