Professional Documents
Culture Documents
FULLTEXT01
FULLTEXT01
TECHNOLOGY,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2020
Abstract
The handwritten recognition system is a process of learning a pattern
from a given image of text. The recognition process usually combines
a computer vision task with sequence learning techniques. Transcrib-
ing texts from the scanned image remains a challenging problem, es-
pecially when the documents are highly degraded, or have excessive
dusty noises. Nowadays, there are several handwritten recognition
systems both commercially and in free versions, especially for Latin
based languages. However, there is no prior study that has been built
for Ge’ez handwritten ancient manuscript documents. In contrast, the
language has many mysteries of the past, in human history of science,
architecture, medicine and astronomy.
Key Words
CNN, CTC, Ge’ez, Handwritten Recognition, LSTM, MDRNN
iv
Sammanfattning
Det handskrivna igenkännings systemet är en process för att lära sig
ett mönster från en viss bild av text. Erkännande Processen kombinerar
vanligtvis en datorvisionsuppgift med sekvens inlärningstekniker. Tran-
skribering av texter från den skannade bilden är fortfarande ett ut-
manande problem, särskilt när dokumenten är mycket försämrad eller
har för omåttlig dammiga buller. Nuförtiden finns det flera handskrivna
igenkänningar system både kommersiellt och i gratisversionen, särskilt
för latin baserade språk. Det finns dock ingen tidigare studie som har
byggts för Ge’ez handskrivna gamla manuskript dokument. I motsats
till detta språk har många mysterier från det förflutna, i vetenskapens
mänskliga historia, arkitektur, medicin och astronomi.
Nyckelord
CNN, CTC, Ge’ez, Handskrivet erkännande, LSTM, MDRNN
Acronyms
UN United Nations. 5
v
List of Figures
vi
LIST OF FIGURES vii
5.1 Shows the training and validation accuracies for the char-
acter level recognition, as we can see here, the model
performs 0.9875 (98.75 %), 0.9777 (97.77 %) and 0.9778
(97.78 %) training, validation and test accuracies respec-
tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 The losses of the training and the validation sets during
the training of the model. . . . . . . . . . . . . . . . . . . 38
5.3 Illustration of the training validation loss for the model
containing bidirectional LSTM layers. Although we have
a minimal training resource and dataset size, the model
keeps improving through time. . . . . . . . . . . . . . . . 39
viii LIST OF FIGURES
ix
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Research Methodology . . . . . . . . . . . . . . . . . . . . 4
1.7 Research Benefits, Sustainability and Ethical Aspects . . 4
1.7.1 Research Benefits . . . . . . . . . . . . . . . . . . . 5
1.7.2 Research Sustainability . . . . . . . . . . . . . . . . 5
1.7.3 Research Ethical Aspects . . . . . . . . . . . . . . . 5
1.8 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . 6
2 Theoretical Study 8
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . 9
2.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 9
2.3.1 Long Short Term Memory . . . . . . . . . . . . . . 11
2.4 Connectionist Temporal Classification . . . . . . . . . . . 13
2.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Methodology 15
3.1 Research Process . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Data Collection Techniques . . . . . . . . . . . . . . . . . 15
3.2.1 Dataset Labeling . . . . . . . . . . . . . . . . . . . 18
3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Software and Hardware . . . . . . . . . . . . . . . 19
3.3.2 Cloud Configuration . . . . . . . . . . . . . . . . . 21
x
CONTENTS xi
4 Implementation 22
4.1 Character Recognition Model . . . . . . . . . . . . . . . . 22
4.1.1 Character Segmentation . . . . . . . . . . . . . . . 22
4.1.2 Building CNN for Character Recognition . . . . . 26
4.2 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . 28
4.3 The End-to-End Recognition Model . . . . . . . . . . . . 31
Bibliography 44
Introduction
1.1 Background
Geez is an ancient South Semitic language of the Ethiopic branch which
has survived in the form of a legendary language in Ethiopia [1] as well
as some other international institutions that teach Semitic studies as a
subject (including Uppsala University).
1
2 CHAPTER 1. INTRODUCTION
and Eritrean Catholic churches, and the Beta Israel Jewish community.
The closest living languages to Ge’ez are Tigre and Tigrinya with lexi-
cal similarity at 71% and 68%, respectively [1].
Ge’ez language has 26 different consonants that are marked for seven
vowels which are known as a fidel (26 * 7 = 182), 4 * 5 additional let-
ters, 20 numbers, and eight punctuation marks. In full scripts are rep-
resented in 230 syllables. The Ge’ez syllable is one of the rare writing
systems among the Semitic languages in which vowels are indicated
[1]. As shown in figure 1.1, the writing system of Ge’ez language is
from left to write, and a colon separates every word in ancient hand-
written documents.
1.4 Purpose
This thesis work has the following purposes,
1.5 Goal
The main goal of this thesis work is to develop a handwritten recogni-
tion system which can convert Ge’ez manuscript documents into ed-
itable text formats (machine-readable, such as Unicode) by using the
current state-of-the-art machine learning and deep learning techniques.
1.8 Delimitation
We train our model to recognize only 200 syllables (198 characters and
two punctuation marks) out of 230 syllables due to several reasons.
Especially numbers in gees have several shapes that are not linked with
each other. For instance, the number four (፬ ) has a 0 like shape in the
middle and two curved dashed lines on the top and bottom of it, which
makes it challenging to extract numbers from the given image using
character segmentation algorithms.
This thesis includes two separate models. The first model is about
character label recognition which combines image processing for the
pre-processing tasks and a CNN model for the training and predic-
tion. The second model, on the other hand, is an end-to-end recogni-
tion model which does not require an explicit segmentation technique
for both the pre-processing and transcription phase. This model uses
the CNN layer as a feature extractor, Bi-directional Long Short Term
Memory (BLSTM) to find out the patterns and encode sequenced char-
CHAPTER 1. INTRODUCTION 7
Theoretical Study
2.1 Background
Handwritten recognition is the process of converting a text on images
to machine-readable and editable formats. Unlike [12] traditional ap-
proaches used both the image processing techniques, to segment the
required features from the given image and artificial neural networks
to recognize the characters [2]. As indicated in the introduction sec-
tion, one can classify handwritten recognition systems as offline and
online recognition systems. Offline handwritten recognition is inher-
ently more challenging than online recognition systems, in which the
information is available only from the given input image. Whereas, in
the online case, we can extract features from both the pen movement
and the resulting image [12].
8
CHAPTER 2. THEORETICAL STUDY 9
will stay persistently in our brain to get a full meaning of the context.
The order of words in a sentence (especially in a written document)
can have a significant effect on the definition.
Figure 2.1 shows the overall architecture of RNNs; the left side of
the picture represents the folded network while the right side of the pic-
ture is the unfolded version of the system. In the network, the inputs
are represented by Xt => X0 , X1 , X1 ...Xt , the outputs, on the other
hand, are represented by ht => h0 , h1 , h1 ...ht while the letter A repre-
sents the activation function. The network accepts X0 as its first input
from the sequence of the data with an output from h( t − 1) and then
performs output h0 based some activation function. The output h0 is
then used as an input with X1 to the next step. The process continues
the same way by taking a snapshot of the current state and passing its
result to the next step.
LSTMs repeating structure has four interacting layers and three gates.
12 CHAPTER 2. THEORETICAL STUDY
As illustrated in Figure 2.2, each line (cells state) carries a piece of spe-
cific information that is required by the internal computation. LSTM
uses these gates to decide and pass forward the information through
the network.
Figure 2.2: LSTM gates and its interacting layers, image taken from
Colah’s blog post in 2015-08, Understanding LSTMs see [here for the
original image], Accessed date: November 20, 2020
ft = σ(Wf .[ht−1 , Xt ] + bf )
2. Input gate: this gate accepts a new value and updates the ex-
isting information. Two interacting layers are used to decide the
update; the first is the sigmoid layer that validates the input and
decides which value to update. Then, the tanh layer creates the
list of new candidate C̃t values, that are going to be added to the
CHAPTER 2. THEORETICAL STUDY 13
3. Output gate: the last layer in the LSTM network is the output
layer which combines the input and the forget layer. The output
will be based on the cell information:
(a) The sigmoid layer will decide what parts of the cell state are
going to be transferred to the output layer.
(b) The information that is passed through the sigmoid layer
will be then pushed to tanh layer and multiply it by the out-
put of the sigmoid gate.
(c) The result can be decided by a sigmoid layer, as shown in
figure 2.2.
Ot = σ(Wo .[ht−1 , Xt ] + bo )
ht = Ot ∗ tanh(Ct )
CTC solves these challenges ”by allowing the network to make label
predictions at any point in the input sequence, so long as the overall
sequence of labels is correct” [12]. For a given set of inputs X, CTC
generates an output distribution for all possible Y ’s. Moreover, the
CTC algorithm does not require the exact alignment of the input and
its corresponding output.
Methodology
15
16 CHAPTER 3. METHODOLOGY
rate model that makes the data-set more suitable for machine learning
and deep-learning algorithms. Data formation process also requires
finding and using the right data collection mechanism.
Another challenge that we faced during data collection was the dis-
tribution of the scripts that are available in the books. Ge’ez language
inherently uses some of the characters more frequently than others.
For example, as shown in Figure 3.2 the count of a few characters in
18 CHAPTER 3. METHODOLOGY
the dataset is higher than the rest, which then results in a poor gener-
alisation of the model.
proach requires human resource for the labelling task. Many engineers
are still using this technique for some problems that are hard to use au-
tomated tools. The problem with this approach is that it takes a lot of
time, and it decreases the quality of the data since it is error-prone.
However, we have used this technique for our character label recogni-
tion model.
The second and efficient way is to use auto labelling tools or algo-
rithms. Especially if the algorithm is designed to the specific problem
as we did in this project, it will provide a very accurate labelled data.
However, in this thesis, we have used both techniques to label and train
both character label and the end to end recognition tasks.
For the end to end model labelling, we have used a Ge’ez bible cor-
pus and the cropped characters. We did the labelling by picking a word
from the corpus, then concatenating cropped characters to form a word
image. This process is described more in detail in section 4.2.
Keras
Keras is an open-source neural-network API that is written in Python.
As described by Keras documentation ”it offers consistent & simple
APIs, it minimizes the number of user actions required for common
use cases, and it provides clear & actionable error messages. It also
has extensive documentation and developer guides.”[6] From 2017,
Google’s TensorFlow team decided to support Keras in TensorFlow’s
core library.
Tensorflow
TensorFlow is a free and open-source API, available to be used in many
areas, it is mainly used for machine learning applications such as deep
neural networks. It is used to express mathematical calculations that
are used by machine learning algorithms, and used to implement and
execute such algorithms [16].
OpenCV
OpenCV (Open Source Computer Vision Library) is an open source
computer vision and machine learning software library [7]. In this the-
sis we used OpenCV 3.4.10 version for our data preprocessing tasks.
The line detection, character segmentation, image resizing and word
formations are done using OpenCV algorithms.
CHAPTER 3. METHODOLOGY 21
Jupyter notebook
The Jupyter Notebook is an open-source web application that is used to
create and share documents that contain live code, equations, visual-
izations and narrative text [5]. Since it has an interactive and easy web
interface for running and visualisation, we used it to test and visualise
the trained models in our local machine.
Implementation
22
CHAPTER 4. IMPLEMENTATION 23
Figure 4.1: Model 1, character label recognition with CNN. The model
requires an explicit segmentation of the characters with some specific
size (32x32).
The process starts by reading the list of input scanned pages based
on the specified path. While there are more scanned pages in the di-
rectory do the following:
3. Convert the three channel image (BGR color ) into two channel
image (gray scale)
4. Apply Gaussian filter to the gray scale image by using 1x1 kernel
size.
• if the width and height is below some value (23) skip that
contour
• draw a bounding box on the copied image (for demonstra-
tion purpose only)
• save the image which has higher pixel value
• append images into the file using the specified file name.
The problem of this approach is that a black point in the book page
is considered as a character. As described in section 3.2, scripts that
have lines and strokes on top are more challenging when getting the
bounding box. The algorithm sometimes considers them as standalone
characters while they are parts of the script.
CHAPTER 4. IMPLEMENTATION 25
cropped = scanned_image[y :y + h , x : x + w]
s ='fidel_' + str(name_index) + '.png'
if not os.path.exists(output_dir_path):
os.makedirs(output_dir_path)
cv2.imwrite(os.path.join(output_dir_path, s) , cropped)
name_index = name_index + 1
In the process, the CNN model typically learns the pattern. The whole
process involves pre-processing the data as described in section 4.1.1,
that is converting the image into suitable size and colour channel, train-
able feature extraction and classification.
The last layer on our character recognition model has 200 neurons.
Although the language we are recognising has more than 235 charac-
ters, we are dealing with only the most common characters of it. Thus
the last layer neurons tell us that we want to classify handwritten geez
characters into 200 classes. For this reason, we used a softmax activa-
tion layer to calculate the probability of each character being predicted.
Now the setups are ready to create the magical ancient handwrit-
ten lines of texts on the image. Figures 4.2 and 4.3 are lines of texts that
are formed and taken from the ancient Ge’ez books, respectively. The
former line of text is formed by our text_to_image() function. The texts
on the line are taken from our set of images which are generated by
our character segmenter script. In contrast, the later Figure 4.3 image
is directly cropped from one of the original books that we used as a
data source.
Figure 4.3: A single line of text, formed by using Listing 2 script, while
the individual characters, including the spaces, are taken randomly
from different handwritten documents.
Figure 4.4: A single line of text taken from the original book, presented
to show the differences of the synthetic data and the original document.
The script requires the paths of both the cropped images and the
corpus. The detailed explanation of the text to the image generator is
as follows. While there are characters in the corpus:
1. if the character is from the corpus and if it is available from the
labels list then, get the directory of the images named with the
current character
if word_image is None:
word_image = selected_image
else:
word_image = np.concatenate((word_image,
selected_image), axis= 1)
word_chars += g_character
In the network;
1. The input image is fed to the standard CNN layers. The few lay-
ers of the CNN layer extracts the feature maps from the given
image and transfers the output to the next layer.
The challenge in this approach is, standard CNN layers only ac-
cept an image that has a specified size (width and height). It is
impractical to find lines of texts that have equal length.
However, we took the longest line in the list and padded the rest
of line images with a white pixel value.
2. Then the outputs of the CNN layer, which are feature maps are
fed into an RNN layer specifically to the bidirectional long short
term memory (BLSTM) layer. As described in section 2.3.1, LSTM
networks are capable of handling sequences, which identifies the
relationship between the characters.
3. Finally, the output of the BLSTM layer is fed into a CTC layer
which is a transcription layer. The CTC layer takes the sequence
of characters and learns their alignment with the image, includ-
ing redundant characters, and uses the probability distribution
to transcribe the output.
32 CHAPTER 4. IMPLEMENTATION
Figure 4.7: The first few CNN layers of the network, which are capable
of extracting essential features of the input images. The Reshape and
Dense layers are used to reduce dimensions. The Gaussian noise, on
the other hand, adds noise to the data to protect overfitting.
34 CHAPTER 4. IMPLEMENTATION
Figure 4.8: Here, the middle layers of the model, which are BLSTN
layers built to learn the more in-depth features of the sequences. The
sequences of texts are encoded here will be decoded later in the CTC
layer.
CHAPTER 4. IMPLEMENTATION 35
Figure 4.9: The final layer of the model, which is a CTC layer responsi-
ble for transcribing the texts. The CTC layer mainly calculates the loss
of the model by taking the input labels length, output labels length,
input labels, output labels and the softmax layer.
Chapter 5
This chapter explains the collected results from both character label
and end-to-end models based on the outputs we get from the train-
ing of the models. It covers the performances of the models and the
benchmarking network results of the end-to-end recognition system.
The first part discusses the accuracy and loss of the first model, while
the second section shows the different outputs of our second model.
36
CHAPTER 5. RESULT AND ANALYSIS 37
Figure 5.1: Shows the training and validation accuracies for the char-
acter level recognition, as we can see here, the model performs 0.9875
(98.75 %), 0.9777 (97.77 %) and 0.9778 (97.78 %) training, validation
and test accuracies respectively.
38 CHAPTER 5. RESULT AND ANALYSIS
Similarly, the losses of the training set and validation set are illus-
trated in Figure 5.2. The training loose keeps decreasing while the val-
idation loss starts bending upwards. The reason is that the model be-
gins overfitting the data after iterating over it for 20 rounds. Since the
validation loss stops the improvement over time, we call the EarlyStop-
ing call back object to stop further training iterations.
Figure 5.2: The losses of the training and the validation sets during the
training of the model.
As shown in the graphs 5.3 and 5.4, the bidirectional LSTM layers
perform better. The bidirectional layers maintain the information from
both directions, which makes them suitable for text analysis problems.
In contrast, unidirectional LSTM layers only keep the past data and try
to infer that knowledge through prediction.
The accuracy of the model is shown in Figure 5.5. During the train-
ing, we intentionally used shorter lines of images which padded with
white pixels to save the training resource requirements. The padded
white pixels will be collapsed by the CTC algorithm later in the tran-
scription. However, we added a few long lines of texts in training to
the model to get a better result. But the training time and the accuracy
was not as expected due to less number of long lines of images in the
dataset.
Figure 5.3: Illustration of the training validation loss for the model
containing bidirectional LSTM layers. Although we have a minimal
training resource and dataset size, the model keeps improving through
time.
40 CHAPTER 5. RESULT AND ANALYSIS
Figure 5.4: Illustration of the training validation loss for the model con-
taining unidirectional LSTM layers. In this model, there was no any
improvement when we introduce a new data during the training. The
possible reason is that, only the forward pass was not enough to learn
the sequences and be able to predict accurately.
Figure 5.5: Predictions during the training, lines of texts are different
in the dataset. As we can see the encoded texts are predicted with 99%
accuracy
CHAPTER 5. RESULT AND ANALYSIS 41
This section summarises the goal and the main features of the thesis
work. Then, it points out the limitations and suggestions of the project
for future works.
6.1 Conclusion
Ge’ez is an unstudied language while it kept the mysteries of human
development both in science and spirituality. In this thesis, we in-
vestigated the ways to convert handwritten documents into machine-
readable and editable formats. We successfully generated the dataset
from scratch that can be used in further studies. We have also imple-
mented the character label and end-to-end recognition systems. We
tested our models which are built using the current state-of-the-art
deep-learning algorithms and achieved our goal on how we can use
and combine them to recognize patterns on an image and transcribe
them into text.
CNN with RNN remains the best approach to solve a problem which
has a sequence to sequence image data. Although the CNN layer re-
quires an equal-sized input image, it performs well in extracting the
patterns that are relevant for encoding texts in the image. Bidirectional
LSTMs have been an excellent choice to learn in both directions by il-
luminating gradient vanishing problems. The connectionist temporal
classification (CTC) algorithm is used to calculate the loss of the net-
work by collecting the loss from different layers, the label length, input
length and the label from input layers and the output from the output
42
CHAPTER 6. CONCLUSION AND FUTURE WORK 43
layer.
6.2 Limitations
The study is limited only to recognize 200 syllables (characters) out of
230 available characters in Ge’ez language as mentioned in 1.1 owing
to several reasons
44
BIBLIOGRAPHY 45
[22] University of Toronto. The university is now one of the only places in
the world where students can learn Ge’ez. https://www.utoronto.
ca/news/u-t-launches-class-ancient-ethiopian-language-
very-nature-university. [Online; accessed 2020-10-08]. 2020.
[23] Chunpeng Wu et al. “Handwritten character recognition by al-
ternately trained relaxation convolutional neural network.” In:
2014 14th International Conference on Frontiers in Handwriting Recog-
nition. IEEE. 2014, pp. 291–296.
Appendix A
47
48 APPENDIX A. UNNECESSARY APPENDED MATERIAL
www.kth.se