Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

electronics

Article
Article
An
An Algorithm
Algorithm for
for Natural
Natural Images
Images Text
TextRecognition
Recognition
Using
Using Four
Four Direction
Direction Features
Features
Min 1 , Yujin Yan 2 , Hai Wang 1, * and Wei Zhao 2
Min Zhang
Zhang 1, Yujin Yan 2, Hai Wang 1,* and Wei Zhao 2
1 School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China
2
1 School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China
Key Laboratory of Electronic Equipment Structure Design, Ministry of Education, Xidian University,
2 Key Laboratory of Electronic Equipment Structure Design, Ministry of Education, Xidian University, Xi’an
Xi’an 710071, China
710071, China
* Correspondence: wanghai@mail.xidian.edu.cn; Tel.: +86-029-8820-3115
* Correspondence: wanghai@mail.xidian.edu.cn; Tel.: +86-029-8820-3115

Received: 25 July 2019;
Received: 2019; Accepted:
Accepted: 27 August 2019; Published: date
31 August 2019 

Abstract: Irregular
Abstract: Irregular text
text has
has widespread
widespread applications
applications in multiple areas. Different from regular text,
irregular text is difficult to recognize because
irregular text is difficult to recognize because of its
ofvarious shapes
its various and distorted
shapes patterns.
and distorted In this In
patterns. paper,
this
we develop
paper, a multidirectional
we develop convolutional
a multidirectional neural network
convolutional (MCN) to
neural network extracttofour
(MCN) direction
extract features
four direction
to fully describe
features to fully the textual
describe information.
the Meanwhile,
textual information. the character
Meanwhile, placementplacement
the character possibilitypossibility
is extractedis
as the weight
extracted of the
as the fourof
weight direction
the fourfeatures.
directionBased on these
features. Basedworks, we works,
on these proposewe thepropose
encodertheto fuse the
encoder
four direction
to fuse features
the four for thefeatures
direction generation
forofthe
feature code to of
generation predict the code
feature character sequence.
to predict theThe whole
character
network
sequence.isThe end-to-end trainableis due
whole network to usingtrainable
end-to-end images anddue word-level labels.and
to using images Theword-level
experiments on
labels.
standard benchmarks,
The experiments includingbenchmarks,
on standard the IIIT-5K, SVT, CUTE80,
including theand ICDAR
IIIT-5K, datasets,
SVT, CUTE80,demonstrate
and ICDAR the
superiority of the proposed
datasets, demonstrate the method on both
superiority regular
of the and irregular
proposed methoddatasets.
on bothThe developed
regular method
and irregular
shows anThe
datasets. increase of 1.2%
developed in theshows
method CUTE80 dataset of
an increase and 1.5%
1.2% in the
in the SVT dataset,
CUTE80 and 1.5%
dataset and has fewer
in the
parameters
SVT dataset,than andmost existing
has fewer methods.than most existing methods.
parameters

Keywords: text recognition;


Keywords: recognition; convolutional
convolutional neural
neural network;
network; long
long short-term
short-term memory
memory

1.
1. Introduction
Introduction
In
In aa natural
natural image,
image, text
text may
may appear
appear on on various
various objects,
objects, such
such asas advertising
advertising boards,
boards, road
road signs,
signs,
etc. By recognizing the text, people can acquire the context and semantic
etc. By recognizing the text, people can acquire the context and semantic information for image information for image
understanding. Therefore,scene
understanding. Therefore, scene text
text recognition
recognition attracts
attracts greatgreat interest
interest in theinfield
the of
field of computer
computer vision.
vision.
Although the optical character recognition (OCR) [1], which is widely studied for decades, can
Although the optical character recognition (OCR) [1], which is widely studied for decades, can
successfully
successfullydeal
dealwith
withthethe text
text in
in documents,
documents, itit isis still
still aa challenge
challenge to identify the
to identify text in
the text in natural
natural images.
images.
In
In recent
recent years,
years, with
with the
the development
development of of artificial
artificial intelligent
intelligent technology,
technology, there
there emerge
emerge aa large
large
number of text recognition methods [2–6] based on neural network. These
number of text recognition methods [2–6] based on neural network. These methods can achieve goodmethods can achieve good
performance
performance whenwhen dealing
dealing with
with the
the regular
regular text.
text. However,
However, in in practical
practical applications,
applications, the the scene
scene text
text is
is
usually
usuallyirregular
irregular(such
(suchasasslant,
slant,curved,
curved,perspective,
perspective, multiple
multiple lines, etc.),
lines, as shown
etc.), as shownin Figure 1, and
in Figure the
1, and
performance
the performanceof the
of above methods
the above is not
methods is satisfying.
not satisfying.

Figure 1.
Figure 1. Examples
Examples of irregular
irregular (slant,
(slant, curved,
curved, perspective,
perspective, multiple
multiplelines,
lines,etc.)
etc.) text
text in
in natural
natural images.
images.

Electronics2019,
Electronics 2019,8,8,971;
x; doi: FOR PEER REVIEW
doi:10.3390/electronics8090971 www.mdpi.com/journal/electronics
www.mdpi.com/journal/electronics
Electronics 2019, 8,
Electronics 2019, 8, 971
x FOR PEER REVIEW 22 of
of 13
13

All of the mentioned methods can be divided into two categories: one for predicting irregular text,
All of the mentioned methods can be divided into two categories: one for predicting irregular
and another for predicting regular text. In practical applications, the scene text is usually irregular
text, and another for predicting regular text. In practical applications, the scene text is usually irregular
(such as slant, curved, perspective, multiple lines, etc.), as shown in Figure 1, and the performance of
(such as slant, curved, perspective, multiple lines, etc.), as shown in Figure 1, and the performance of
the above methods is not satisfying. At present, there are many methods [7–12] to predict irregular
the above methods is not satisfying. At present, there are many methods [7–12] to predict irregular
text, which can be divided into two types. A common method [7–9] is to correct and then identify;
text, which can be divided into two types. A common method [7–9] is to correct and then identify;
for example, Shi et al. [7] propose a robust text recognizer with automatic rectification (RARE)
for example, Shi et al. [7] propose a robust text recognizer with automatic rectification (RARE) consisting
consisting of a spatial transformer network (STN) [13] and a sequence recognition network (SRN).
of a spatial transformer network (STN) [13] and a sequence recognition network (SRN). The irregular
The irregular text is corrected into approximately regular text by using STN, and after that SRN
text is corrected into approximately regular text by using STN, and after that SRN recognizes the
recognizes the corrected text through a sequence recognition technique from left to right. RARE can
corrected text through a sequence recognition technique from left to right. RARE can recognize some
recognize some natural images, but fails to recognize the vertical text. In addition, it is difficult to
natural images, but fails to recognize the vertical text. In addition, it is difficult to optimize STN to fit
optimize STN to fit various kinds of irregular text in natural images, such as curved or arbitrarily
various kinds of irregular text in natural images, such as curved or arbitrarily oriented text. Another
oriented text. Another method [10,11] predicts final results directly, such as when Yang et al. [10]
method [10,11] predicts final results directly, such as when Yang et al. [10] introduce an auxiliary
introduce an auxiliary dense character detection task that uses a fully convolutional network (FCN)
dense character detection task that uses a fully convolutional network (FCN) [14] to learn text visual
[14] to learn text visual representations and an alignment loss to guide the training of an attention
representations and an alignment loss to guide the training of an attention model. This method
model. This method recognizes two-dimensional text images effectively, but the text must be
recognizes two-dimensional text images effectively, but the text must be presented from left to right.
presented from left to right. Moreover, character-level bounding box annotations are required during
Moreover, character-level bounding box annotations are required during training for utilization of the
training for utilization of the attention block.
attention block.
Usually, most text recognition methods treat the text in natural images as a one-dimensional
Usually, most text recognition methods treat the text in natural images as a one-dimensional
sequence and recognize it from left to right, such as [6–8]. However, in reality the text can proceed in
sequence and recognize it from left to right, such as [6–8]. However, in reality the text can proceed in
any direction, often leading to unsatisfactory recognition results. In this paper, we propose a robust
any direction, often leading to unsatisfactory recognition results. In this paper, we propose a robust
end-to-end method that utilizes four direction features to predict scene text. To represent a character
end-to-end method that utilizes four direction features to predict scene text. To represent a character in
in any direction, the method learns the four direction features and combines them together.
any direction, the method learns the four direction features and combines them together. Moreover,
Moreover, the structure of the four direction features shares the neural network weights, which
the structure of the four direction features shares the neural network weights, which provides the
provides the whole network with high accuracy without an increase in the amount of data.
whole network with high accuracy without an increase in the amount of data.
As shown in Figure 2, first, we use a basic convolutional neural network (BCN) to extract the
As shown in Figure 2, first, we use a basic convolutional neural network (BCN) to extract the
low-level features. Then, we obtain the four direction feature sequences and the character placement
low-level features. Then, we obtain the four direction feature sequences and the character placement
possibility through a multidirectional convolutional neural network (MCN). The four direction
possibility through a multidirectional convolutional neural network (MCN). The four direction feature
feature sequences work in four directions: horizontal vectors → 𝑙⃗ (left → right)→ and 𝑟⃗ (right → left),
sequences work in four directions: horizontal
⃗⃗ vectors l (left → right) and r (right → left), vertical
vertical →vectors 𝑡⃗ (top → bottom) → and 𝑏 (bottom → top). Horizontal and vertical features are
vectors t (top → bottom) and b (bottom → top). Horizontal and vertical
extracted by sampling the height and width of the feature map down to 1. Then, characters features are extracted by
in the
sampling the height
natural images andrepresented
can be width of the asfeature map down
a sequence namedtofeature
1. Then, characters
code throughinthethe encoder,
natural images
which
can
combines the four direction feature sequences and corresponding placement possibility. Finally,four
be represented as a sequence named feature code through the encoder, which combines the the
direction feature sequences
decoder predicts the resultand corresponding
of the text from the placement possibility.
feature code. In thisFinally,
paper, the
ourdecoder
methodpredicts the
is not only
result of the
effective text from the
at recognizing feature
regular code.
text, butIn thissucceeds
also paper, our method is irregular
at predicting not only effective at recognizing
text in complex natural
regular
images. text, but also succeeds at predicting irregular text in complex natural images.

Four Feature Sequences

BCN MCN

p
Low-level Feature Character Placement Possibility

Decoder Encoder
“WELCOME”

Result Feature Code

Figure 2. Flowchart
Figure 2. Flowchart of
of our
our method.
method.
Electronics 2019, 8, 971 3 of 13

The rest of the paper is organized as follows: In Section 2, theories related to the proposed
algorithm are introduced. The details of our method are described in Section 3, and the experimental
results are given in Section 4. In Section 5, overall conclusions are given.

2. Related Works

2.1. Text Recognition


In recent years, a large number of papers related to text recognition have been published. Ye and
Doermann’s survey [15] analyzes and compares the technical challenges, methods, and performance
of text recognition research in color imagery. Most traditional methods follow a bottom-up pipeline
and the others work in the top-down style. The methods based on bottom-up [1,5,6,16–18] first detect
and recognize single characters, then integrate them into words based on a language model. However,
these methods require character-level labeling, and consume a large amount of resources.
The top-down pipeline approach directly predicts the whole text from the images without
additional character-level labels. The method used in [19] can recognize 90k words, where each word
corresponds to one classification in the CNN output layer. Therefore, the model can only recognize
90k words. Many recent works, such as [2,4,6,7,10,20] make use of a recurrent neural network (RNN).
The authors of [6,20] proposed an end-to-end neural network, and regarded the natural images
recognition as a sequence labeling problem. The network uses CNN for sequence feature extraction,
RNN for per-frame prediction, and CTC [21] loss for final label sequence prediction. The method in [4]
presents a recursive recurrent neural network with attention modeling (R2 AM). The model consists of
two blocks: recursive CNN for image feature extraction, and RNN with attention-based mechanism
for recognition. The method in [2] develops a focusing network (FN) to eliminate the attention drift of
the attention network (AN). However, these methods transform the image into a feature sequence
from left to right and cannot recognize irregular text (e.g., curved text) in natural images. In order to
tackle this problem, the authors of [7] proposed a solution named a space transformer network (STN),
which can transform the input image into a rectified image for a sequence recognition network (SRN)
to recognize. The method in [10] uses a FCN [14] to learn text visual patterns and recognizes the text
through the attention-based RNN.
In this paper, our method uses a top-bottom pipeline approach, which does not need to detect
individual characters. Different from existing top-down methods, our method first uses BCN and
MCN to extract the four feature sequences and their character placement possibility, then encodes
them by RNN, and finally predicts the results. At the same time, we can only use word-level labels to
train the whole network end-to-end.

2.2. BiLSTM
RNN, used in the top-down methods, is most probably a bidirectional long short-term memory
neural network (BiLSTM) [2,6,20]. RNN can distinguish ambiguous characters easily by utilizing the
context within the sequence. In addition, RNN can handle any length of sequence, which is very
helpful in text recognition. Meanwhile, RNN can return the back propagation error to the previous
layer and make it possible to train the whole network end-to-end.
The RNN layer consists of multiple RNN units with the same parameters. These units are arranged
in sequence. For the i th (i = 1, . . . T) unit, its input data contain the i th element of the sequence
[x1 , · · · , xi , · · · , xT ] and the output of the (i-1) th unit hi−1 . Based on the input data, the i th unit outputs
the result hi = f (xi , hi−1 ). In that way, RNN units can obtain past information for prediction. However,
RNN often suffers from the vanishing gradient problem [22], which makes it difficult to transmit the
gradient information consistently over a long time.
LSTM [23,24], as a special RNN, consists of a memory cell and three multiplicative gates: the
input gate, forget gate, and output gate. The memory cell allows the LSTM to store past contexts, while
the input and output gates allow LSTM to store contexts for the future. However, in the prediction
Electronics 2019, 8, 971 4 of 13

Electronics 2019, 8, x FOR PEER REVIEW 4 of 13


of natural images, both types of context are important. Therefore, we refer to [25] and combine two
combine
LSTMs two
into LSTMsAs
BiLSTM. into BiLSTM.
shown As shown
in Figure 3, oneinofFigure 3, oneutilizes
the LSTMs of the past
LSTMs utilizes
context andpast context
another and
utilizes
another utilizes future context, and after that the final prediction
future context, and after that the final prediction sequence is obtained. sequence is obtained.

y1 y2 y3 yT
xt xt ...

Input it Output Ot
Gate Gate
Memory
Cell
xt ht ...
Ct ...

Forget
ft Gate

xt x1 x2 x3 xT

(a) (b)
Figure 3. (a)
Figure 3. (a) The
The structure
structure of
of LSTM
LSTM unit:unit: a memory
memory cell
cell and
and three
three gates,
gates, namely
namely the the input
input gate,
gate, the
the
forget
forget gate,
gate, and
and the
the output
output gate.
gate. (b)
(b) The
The structure
structure of
of BiLSTM
BiLSTM used
used in
in our
our method.
method. It It combines
combines two
two
LSTMs,
LSTMs, one
one for
for past
past context
context (left
(left to
to right)
right) and
and one
one for
for future
futurecontext
context(right
(rightto
toleft).
left).

3. Framework
3. Framework
In order to recognize irregular text, we identify the four direction features separately, and then fuse
In order to recognize irregular text, we identify the four direction features separately, and then
the four direction features to get the final recognition result. Each direction feature is acquired using
fuse the four direction features to get the final recognition result. Each direction feature is acquired
the method developed in [6], which is proven to be effective for regular text recognition. As shown in
using the method developed in [6], which is proven to be effective for regular text recognition. As
Figure 4, the four direction features can be defined as follows:
shown in Figure 4, the four direction features can be defined as follows:
→ →
theleft-to-right
the left-to-rightdirection
directionfeature : t𝑡⃗,, ⃗𝑡′
feature: t⃗0
→ →
the 0
: b𝑏⃗⃗, b⃗⃗⃗⃗
thebottom-to-top
bottom-to-topdirection
directionfeature
feature: 𝑏′ (1)
→ → (1)
r0
the right-to-left direction feature : r , ⃗⃗⃗
the right-to-left direction feature: 𝑟→
⃗, 𝑟′→
the top-to-bottom direction feature : t , t0 ,
the top-to-bottom direction feature: 𝑡⃗, ⃗𝑡′⃗,
→ → → → → → → →
where
where the
thefeature sequencesl 𝑙⃗,, b𝑏⃗⃗, 𝑟⃗,r ,𝑡⃗ tare
featuresequences arethe
theoutputs
outputsof
of MCN,
MCN, and vectors l⃗0𝑙′⃗,, b⃗⃗⃗⃗
and the feature vectors 𝑏′0 ,, r⃗⃗⃗
𝑟′0 , ⃗𝑡′
t⃗0 are
are
the outputs of the first BiLSTM. Specifically, our method includes four parts: (1)
the outputs of the first BiLSTM. Specifically, our method includes four parts: 1) BCN for low-level BCN for low-level
features
features extraction;
extraction; (2) MCN for
2) MCN for extracting four direction feature sequences and character placement placement
possibility;
possibility; (3) encoder for
3) encoder for combing
combing four direction feature sequences with the the character
character placement
placement
possibility;
possibility; (4) decoder for
4) decoder for predicting
predicting the the final
final character sequence.
Electronics 2019, 8, 971 5 of 13
Electronics 2019, 8, x FOR PEER REVIEW 5 of 13

Image

Conv,64; S=(2,2)
BCN
Conv,128; S=(2,2)

Conv,256

ROTATE ROTATE ROTATE

Conv,256; S=(2,1), P=(1,0)

Conv,256; S=(2,1), P=(1,1) MCN

Conv,256; S=(2,1), P=(1,0)

Conv,256; S=(2,1)

Conv,256 Conv,64,K=(1,1)

BiLSTM,256 P
’ ’ ’ ’

Dense,512
encoder

BiLSTM,256

softmax decoder

Figure 4.
Figure 4. The
The network
network structure
structure of
of our
our method.
method. These parts are shown
shown in
in blue,
blue, orange,
orange, purple,
purple, and
and
green rectangles,
green rectangles, respectively.
respectively.The
Theconvolutional
convolutionallayers
layersare
arerepresented
representedas
as[name, channel,KK== kernel
[name,channel, kernel
size; SS= =pooling
size; pooling layer
layer strides, P = pooling
strides, P = pooling layer padding].
layer padding]. The
The dense anddense
BiLSTMandlayers
BiLSTM layers are
are represented
represented as [name,
as [name, channel]. channel].

3.1.
3.1. BCN
BCN and
and MCN
MCN
In
In our model,feature
our model, featureextraction
extraction is is
performed
performed by by
a basic convolutional
a basic convolutional neural network
neural (BCN)
network and
(BCN)
aand
multidirectional
a multidirectionalconvolutional
convolutionalneural network
neural (MCN).
network (MCN).
BCN
BCN is evolved from the standard CNN model. As
is evolved from the standard CNN model. As shown
shown in in Figure
Figure 4, 4, the
the convolutional
convolutional layers
layers
and
and max-pooling layers are used to extract the low-level features of the image. Before inputting into
max-pooling layers are used to extract the low-level features of the image. Before inputting into
the
the network,
network, all all natural
natural images
images should
shouldbe be scaled
scaledto 100×× 100
to 100 100 pixels. Since the
pixels. Since the input
input image
image is
is aa square,
square,
the
the feature
featuregraph
graphobtained
obtainedafter convolution
after convolution and pooling
and poolingis also a square.
is also On the
a square. Ononethehand, usingusing
one hand, BCN
can
BCNobtain the low-level
can obtain features
the low-level of the of
features image quickly.
the image On theOn
quickly. other
the hand,
other by reducing
hand, the sizethe
by reducing of size
the
feature map through pooling, the computational cost can be reduced
of the feature map through pooling, the computational cost can be reduced effectively. effectively.
In
In this
this work,
work, characters
characters in in natural
natural images
images areare represented
represented by by multiple
multiple perspectives,
perspectives, just
just like
like
vectors
vectors represented
represented by by the
the cross
cross coordinate.
coordinate. Therefore,
Therefore, wewe use
use MCN
MCN to to extract
extract features
features from
from multiple
multiple
perspectives of a BCN feature graph. The MCN network consists of five convolutional layers, and
represents the feature graph as a sequential feature. We use the rotator to rotate the feature maps of
Electronics 2019, 8, x FOR PEER REVIEW 6 of 13
Electronics 2019, 8, 971 6 of 13
BCN 90°, 180° and 270°, respectively. MCN and BCN are connected by the rotator, and MCN can get
the four direction feature sequences. As shown in Figure 5, the four direction feature sequences
perspectives of adirections:
represent four BCN feature leftgraph.
→ right, The MCN bottom network consists
→ top, rightof→ fiveleft,
convolutional
and top → layers, and
bottom.
represents the feature graph as a sequential feature.
Multidirectional feature sequences can be represented as follows: We use the rotator to rotate the feature maps of
◦ ◦ ◦
BCN 90 , 180 and 270 , respectively. MCN and BCN are connected by the rotator, and MCN can
𝑙⃗: (𝑙1 , 𝑙2 , ⋯As
get the four direction feature sequences. ), 𝑙𝑒𝑓𝑡 →
, 𝑙𝐿shown in 𝑟𝑖𝑔ℎ𝑡
Figure 5, the four direction feature sequences
represent four directions: left →𝑏⃗⃗right, bottom → top, right
: (𝑏1 , 𝑏2 , ⋯ , 𝑏𝐿 ), 𝑏𝑜𝑡𝑡𝑜𝑚 → 𝑡𝑜𝑝 → left, and top → bottom. Multidirectional
feature sequences can be represented as follows: (2)
𝑟⃗: (𝑟1 , 𝑟2 , ⋯ , 𝑟𝐿 ), 𝑟𝑖𝑔ℎ𝑡 → 𝑙𝑒𝑓𝑡

𝑡⃗: (𝑡1 ,l 𝑡2: ,(⋯
l1 ,, l𝑡2𝐿,),· ·𝑡𝑜𝑝
· , lL→
), 𝑏𝑜𝑡𝑡𝑜𝑚.
le f t → right

b : (b1 , b2 ,⃗ · ⃗⃗· · , bL ), bottom → top
The four direction feature sequences
→ 𝑙 , 𝑏, 𝑟⃗, 𝑡⃗ have a size of 𝐿 × 𝐷, where 𝐿 and 𝐷 represent (2)
r : (r1 , r2 , · · · , rL ), right → le f t
the length of the four feature sequences
→ and the channel number, respectively. By using MCN to
t : ( t , t , ·
extract sequential features in multiple 1directions,
2 · · , t ) , top → bottom.
L the network can not only reduce computational
resources, but also achieve high accuracy with a small amount of training data.

Figure5.5.The
Figure Theinput
inputimage
imagecancanbebedivided
dividedinto
into a series
a series of of areas,
areas, andand these
these areas
areas areare associated
associated withwith
the
the four
four direction
direction feature
feature sequences.
sequences.
→ → → →
The four direction feature sequences l , b , r , t have a size of L × D, where L and D represent the
length of the four feature sequences and the channel number, respectively. By using MCN to extract
3.2. Encoder
sequential features in multiple directions, the network can not only reduce computational resources,
but also achieve high accuracy with a small amount of training data.
Encoder is the process of combining four direction feature sequences into one. By using BiLSTM,
can encode the output feature sequences 𝑙⃗, 𝑏⃗⃗, 𝑟⃗, 𝑡⃗ of MCN into the four direction feature vectors
we Encoder
3.2.
⃗𝑙′⃗, 𝑏′
⃗⃗⃗⃗, 𝑟′
⃗⃗⃗, 𝑡′
⃗⃗. By this means, the four direction feature vectors contain the contextual information, which
Encoder is the process of combining four direction feature sequences into one. By using BiLSTM,
also represents the four directions’ features.→Each → →character
→ in a natural image can be described by
we can
⃗𝑙′⃗, ⃗⃗⃗⃗ ⃗⃗⃗ encode
⃗⃗ the output feature sequences l , b , r , t
𝑏′, 𝑟′, 𝑡′ ; therefore, we use MCN to calculate the possibility into
of MCN the four direction
of corresponding feature in
characters vectors
each
→ → → →
0 , b0 , r0 , t0 .
ldirection, which
By thisismeans,
termedthethe character
four placement
direction possibility
feature vectors 𝑝. For
contain the each text image,
contextual information, the character which
placement
also represents possibility
the four is directions’
used as thefeatures.
weight ofEach eachcharacter
feature vector, and ∑4𝑗=1
in a natural 𝑝𝑖𝑗 =
image 1 be
can with a size of by
described 𝑝
→ → → →
lequal , t0 ;𝐿therefore,
0 , b0 , r0 to × 4. As shown
we use inMCN
Figureto6,calculate
we use the the corresponding 𝑝 to combine characters
possibility of corresponding the four direction in each
feature vectors as follows:
direction, which is termed the character placement possibility p. For each text image, the character
⃗⃗⃗⃗
ℎ′ = [𝑙⃗⃗′ , ⃗⃗⃗⃗
𝑏 ′ , ⃗⃗⃗⃗
𝑟 ′ , ⃗⃗⃗
𝑡 ′ ]𝑝𝑖 4 (3)
placement possibility is used as the𝑖 weight of each feature
P
vector, and pij = 1 with a size of p equal
j=1
⃗⃗⃗⃗
ℎ′ = (ℎ ⃗⃗⃗⃗⃗′ , ⃗⃗⃗⃗⃗
′ ⃗⃗⃗⃗⃗′ ).
to L ×The 4. As combination of four
shown in Figure direction
6, we use thefeature sequences
corresponding is combine
p to the feature
thecode
four direction 1 ℎ 2 , ⋯ , ℎvectors
feature 𝐿

as follows:
→  → → → → 
h0i = l0 , b0 , r0 , t0 pi (3)
Electronics 2019, 8, 971 7 of 13
Electronics 2019, 8, x FOR PEER REVIEW 7 of 13

Feature Sequences

BiLSTM

Feature Vectors ’ ’ ’ ’

Character Placement p
⊙ p
⊙ p
⊙ p

Possibility

Encoder Feature Code

Figure 6.
Figure 6. The
The structure of the
structure of the encoder.
encoder.
→ → → →
The combination of four direction feature sequences is the feature code h0 = h01 , h02 , · · · , h0L .
3.3. Decoder
Decoder is the process of converting feature code into character sequence. In the decoder block,
3.3. Decoder
we use BiLSTM to handle the feature code ⃗⃗⃗⃗ ℎ′ by taking the contextual information into
Decoder isand
consideration, the the
process of converting
feature sequence ℎ⃗⃗feature code into
is obtained. Thecharacter sequence.
probability In the
of each area in decoder block,
the text image

is calculated
we use BiLSTMby to
thehandle the 𝑦feature
formula 𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(ℎ
0⃗⃗⃗⃗
code h by 𝑖 ), taking
and then
thethe probabilities
contextual obtained
information are
into input to the
consideration,

connectionist time classification (CTC) layer proposed by Graves et al. [21]; finally, the character
and the feature sequence h is obtained. The probability of each area in the text image is calculated by
sequence is achieved. → 
the formula yi = so f tmax
The probability y hashthe
i , and
size then
of 𝐿 the
× 𝐶,probabilities
where L is theobtained
length ofarethe
input to the connectionist
prediction sequence andtime C is
classification (CTC)
the classification of layer proposed
characters. We by useGraves et al. [21]; yfinally,
the probability the character
to predict sequence
the character is achieved.
sequence, but the
lengthTheofprobability y has is
text in images thenot
size of L ×toC,L.where
equal L is the
The CTC length of thedesigned
is specifically predictionforsequence
tasks inand C is the
which the
classification
lengths of theofinput
characters.
sequenceWeand use target sequenceyare
the probability to predict
unequal the
andcharacter sequence,
it is difficult but the the
to segment length
input of
text in images is not equal to L. The CTC is specifically designed for tasks
sequence to match the target sequence. Therefore, the CTC introduces the sequence-to-sequence in which the lengths of the
input
mappingsequence andB target
function sequence are unequal and it is difficult to segment the input sequence to
as follows:
match the target sequence. Therefore, the CTC introduces the sequence-to-sequence mapping function
B as follows: 𝐵 (𝑎𝑟𝑔 𝑚𝑎𝑥 𝑃(𝜋|𝑝)) (4)
 𝜋 
B argmaxP(π p) (4)
where B removes the repeated labels and non-character π labels. The CTC looks for an optimized path
(𝜋) with
where maximum
B removes probability
the repeated labelsthrough the input labels.
and non-character sequence. For example,
The CTC looks for an𝐵(−ℎℎ − 𝑒 −path
optimized 𝑙 − 𝑙𝑙(π−)
𝑜𝑜 −)maximum
with = ℎ𝑒𝑙𝑙𝑜, where ‘-’ represents
probability throughthe thenon-character
input sequence. label.
For example, B(−hh − e − l − ll − oo−) = hello,
where ‘-’ represents the non-character label.
4. Experiments
4. Experiments
4.1. Datasets
4.1. Datasets
The regular and irregular benchmarks are as follows:
The regular
CUTE80 andfor
(CT80 irregular
short) benchmarks are for
[26] is collected as follows:
evaluating curved text recognition. It contains 288
CUTE80
cropped (CT80
natural for for
images short) [26] No
testing. is collected
lexicon isfor evaluating curved text recognition. It contains
associated.
288 cropped
ICDARnatural images
2015 (IC15 for for testing.
short) [27] No lexicon
contains is associated.
2077 cropped images, of which more than 200 are
ICDAR
irregular 2015 (IC15 for short)
(arbitrarily-oriented, [27] contains
perspective, 2077 cropped
or curved). images,
No lexicon of which more than 200 are
is associated.
irregular (arbitrarily-oriented,
IIIT5K-Words (IIIT5K forperspective,
short) [28] or is curved).
collectedNo lexicon
from is associated.
the Internet, containing 3000 cropped
IIIT5K-Words (IIIT5K for short) [28] is collected from the Internet,
word images in its test set. Each image specifies a 50-word lexicon and a 1k-word containing 3000lexicon,
cropped word
both of
images in its test
which contain theset. Eachtruth
ground image specifies
words a 50-word
as well as otherlexicon
randomlyandpicked
a 1k-word lexicon, both of which
words.
contain the ground
Street View Texttruth words
(SVT for as well[17]
short) as other randomly
is collected frompicked
Googlewords.
Street View, and consists of 647
word images in its test set. Many images are severely corrupted by noise and blur, or have very low
resolutions. Each image is associated with a 50-word lexicon.
ICDAR 2003 (IC03 for short) [29] contains 251 scene images, labeled with text bounding boxes.
Each image is associated with a 50-word lexicon defined by Wang et al. [17]. For fair comparison, we
Electronics 2019, 8, 971 8 of 13

Street View Text (SVT for short) [17] is collected from Google Street View, and consists of 647 word
images in its test set. Many images are severely corrupted by noise and blur, or have very low
resolutions. Each image is associated with a 50-word lexicon.
ICDAR 2003 (IC03 for short) [29] contains 251 scene images, labeled with text bounding boxes.
Each image is associated with a 50-word lexicon defined by Wang et al. [17]. For fair comparison,
we discard images that contain non-alphanumeric characters or have fewer than three characters [17].
The resulting dataset contains 867 cropped images. The lexicons include the 50-word lexicons and the
full lexicon that combines all lexicon words.
ICDAR 2013 (IC13 for short) [16] is the successor to IC03, from which most of its data are inherited.
It contains 1015 cropped text images. No lexicon is associated.

4.2. Details
Network: In order to ensure that the network can recognize regular and irregular text, we input
100 × 100 pixel images, a size that mostly ensures we retain the character of the images. All convolutional
layers have 3 × 3 size of kernels, and the pooling layers have 2 × 2 size kernels. We use batch
normalization (BN) [30] and RELU activation behind each convolution layer. In the decoding process,
we designed the BiLSTM and 37 outputs (including 26 letters, 10 digits, and an EOS symbol).
Training: We trained our model using synthetic data [3] and real-world training datasets [17,27–29]
by the ADADELTA [31] optimization method. Meanwhile, we conducted data augmentation by
randomly rotating each image range from 0◦ to 360◦ once. Our method was implemented under the
Tensorflow framework [32], running on a computer with an 8-core CPU, 32G RAM, TitianXP GPU, and
Ubuntu 16.04.

4.3. Comparative Evaluation


As shown in Table 1, we compared the recognition results of our method with 17 related methods
in each regular dataset. Our method had the advantages of SVT and achieved a similar performance to
other top algorithms in the IIIT5K, IC03, and IC13 datasets. The proposed method had 0.2% accuracy
improvement over SVT-50 and 1.5% accuracy improvement over SVT-None compared to the most
effective method [2]. Compared with [2] and [19], we achieved a simpler structure. The method in [2]
is composed of two main structures: one is an attention network (AN) for recognizing characters, and
the other is a focusing network (FN) for adjusting attention by evaluating whether AN properly pays
attention to the target areas in the image. Additional position correction can make the network achieve
better results on the basis of recognition, but this step needs to obtain the position information of each
character during training. However, our network does not need additional character-level labeling in
the training stage, and using word-level labeling can train the network end-to-end. Our method is
more accurate than Cheng’s baseline [2], which removes FN that requires additional character-level
labeling. The output of the method in [19] is the whole word, so text beyond the scope of its vocabulary
cannot be recognized. The proposed algorithm, which can handle strings of arbitrary length and
random arrangement, is more suitable for natural scenes.
Table 2 shows that our method outperforms the other methods in the CT80 and IC15 datasets.
For example, our method can achieve 0.5% accuracy improvement in the IC15 dataset and 1.2% accuracy
improvement in the CT80 dataset compared with the most effective method [2]. Figure 7 shows some
pictures of dataset IC15 and CT80, which represent the problems of slant, curved, perspective, multiple
lines, and so on. Since the MCN can effectively obtain multidirectional text information, the proposed
method can well identify slant and perspective text. The encoding algorithm enables the network to
process curved text. Moreover, our network has strong robustness against noise and can recognize a
whole line of text accurately. In conclusion, this work achieves better results on irregular datasets and
higher robustness on regular datasets with less loss of accuracy.
Electronics 2019, 8, 971 9 of 13

Table 1. Results in regular datasets. ’50’, ‘1k’ and ‘Full’ in the second row refer to the lexicon used,
while ‘None’ means recognition without a lexicon.
Electronics 2019, 8, x FOR PEER REVIEW 9 of 13
IIIT5K SVT IC03 IC13
Method Table 1. cont.
50 1k None 50 None 50 Full None None
ABBYY [17] 24.3 IIIT5K
- - 35.0 SVT
- 56.0 55.0 IC03 - IC13
-
Method
Wang et al. [17] -50 1k
- None
- 50
57.0 None
- 50
76.0 Full
62.0 None
- None
-
Mishra etal.
Shi et al. [6]
[25] 64.1
97.6 57.5
94.4 -
78.2 73.0
96.4 -
80.8 81.8
98.7 67.8
97.6 -
89.4 -
86.7
Wang
Shi et
et al. [33]
al. [7] -
96.2 -
93.8 -
81.9 70.0
95.5 -
81.9 90.0
98.3 84.0
96.2 -
90.1 -
88.6
Goel etbaseline
Cheng’s al. [34] [2] -
98.9 -
96.8 -
83.7 77.3
95.7 -
82.2 89.7
98.5 -
96.7 -
91.5 87.6
89.4
Bissacco et al.
Cheng et al. [2][1] -
99.3 -
97.5 -
87.4 90.4
97.1 78.0
85.9 -
99.2 -
97.3 -
94.2 -
93.3
Alsharif [35] - - - 74.3 - 93.1 88.6 - -
Ours 99.0 96.9 87.2 97.3 87.4 97.8 96.4 91.0 90.1
Almazan et al. [36] 91.2 82.1 - 89.2 - - -
Yao et al. [37] 80.2 69.3 - 75.9 - 88.5 80.3 - -
Table 2 shows
Jaderberg that our- method- outperforms
et al. [16] - the
86.1other methods
- in the CT80
96.2 91.5 and -IC15 datasets.
-
Su and Luour
For example, [38]method can - -
achieve 0.5%- accuracy
93.0 improvement
- 92.0
in the82.0 -
IC15 dataset and- 1.2%
accuracyGordo [39]
improvement in 93.3
the CT8086.6 - 91.8 with the
dataset compared - most- effective- method - [2]. Figure
- 7
Jaderberg et al. [19] 97.1 92.7 - 95.4 80.7 98.7 98.6 93.1 90.8
shows some pictures of dataset IC15 and CT80, which represent the problems of slant, curved,
Jaderberg et al. [16] 95.5 89.6 - 93.2 71.7 97.8 97.0 89.6 81.8
perspective, multiple
Shi et al. [6] lines, and so94.4
97.6 on. Since
78.2the MCN
96.4 can80.8
effectively
98.7 obtain
97.6multidirectional
89.4 86.7 text
information, the
Shi et al. [7] proposed 96.2method93.8can well
81.9 identify
95.5 slant
81.9and perspective
98.3 96.2 text.90.1
The encoding
88.6
Cheng’senables
algorithm baselinethe [2] network
98.9 to process
96.8 83.7 text.
curved 95.7Moreover,
82.2 our 98.5
network96.7 91.5 robustness
has strong 89.4
Cheng et al. [2] 99.3 97.5 87.4 97.1 85.9 99.2 97.3
against noise and can recognize a whole line of text accurately. In conclusion, this work achieves 94.2 93.3
Ourson irregular99.0
better results 96.9and higher
datasets 87.2 97.3 87.4
robustness 97.8 datasets
on regular 96.4 91.0 less 90.1
with loss of
accuracy.
Table 2. Results in irregular datasets. ‘None’ means recognition without a lexicon.
Table 2. Results in irregular datasets. ‘None’ means recognition without a lexicon.
CT80 IC15
Method CT80 IC15
Method None None
None None
Shi et al. [6] 54.9 -
Shi et al. [6] 54.9 -
Shi et al. [7] 59.2 -
Cheng et al. [2]Shi et al. [7] 63.9
59.2 - 63.3
Cheng et al. [2] 63.9 63.3
ours 65.1 63.5
ours 65.1 63.5

Figure 7. Correct recognition (left); incorrect samples


samples (right).
(right).

To
To better
betterillustrate thethe
illustrate advantages of our
advantages of structure, we provide
our structure, Table 3, which
we provide Table includes
3, whichproperties
includes
named end-to-end, CharGT, dictionary, and model size. End-to-end: this
properties named end-to-end, CharGT, dictionary, and model size. End-to-end: this column is to show
columnwhether
is to
the
show model can the
whether dealmodel
with the
canimage directly
deal with without
the image any handcrafted
directly without anyfeatures. As can
handcrafted be observed
features. As can
from Table 3,from
be observed only Table
the model in the
3, only [6,16,19]
modeland ours can and
in [6,16,19] be trained end-to-end.
ours can be trained CharGT:
end-to-end.thisCharGT:
column
is
this column is to indicate whether the model needs character-level annotations in training. Theof
to indicate whether the model needs character-level annotations in training. The output ours
output
is
ofaours
sequence, and our and
is a sequence, methodour only usesonly
method word-level labeling. labeling.
uses word-level Dictionary: this column
Dictionary: this is to indicate
column is to
indicate whether the model depends on a special dictionary and so is unable to handle out-of-
dictionary words or random sequences. Our algorithm can handle strings of arbitrary length and
random arrangement. Model size: this column is to report the storage space of the model. As shown
in Table 3, there are fewer parameters in our model than in the models in [16,19], which means our
Electronics 2019, 8, 971 10 of 13

whether the model depends on a special dictionary and so is unable to handle out-of-dictionary words
or random sequences. Our algorithm can handle strings of arbitrary length and random arrangement.
Model size: this column is to report the storage space of the model. As shown in Table 3, there are fewer
parameters in our model than in the models in [16,19], which means our model is smaller. Ours has
only 15.4 million parameters, slightly more than the model in [6], which can only recognize regular
text. Table 3 clearly shows that our model is more practical.

Table 3. Comparison of various methods. Attributes for comparison include: (1) End-to-end: the
model being end-to-end trainable; (2) CharGT: the model needs character-level annotation beyond
training; (3) dictionary: the model can only recognize words belonging to a special dictionary; (4) model
size: the number of model parameters. M stands for millions.

Method End-to-End CharGT Dictionary Model Size


Wang et al. [17] × NEEDED NOT NEED -
Mishra et al. [25] × NEEDED NEEDED -
Wang et al. [33] × NEEDED NOT NEED -
Goel et al. [34] × NOT NEED NEEDED -
Bissacco et al. [1] × NEEDED NOT NEED -
Alsharif [35] × NEEDED NOT NEED -
Almazan et al. [36] × NOT NEED NEEDED -
Yao et al. [37] × NEEDED NOT NEED -
Jaderberg et al. [16] × NEEDED NOT NEED -
Su and Lu [38] × NOT NEED NOT NEED -
Gordo [39] × NEEDED NEEDED -

Jaderberg et al. [19] NOT NEED NEEDED 490 M

Jaderberg et al. [16] NOT NEED NOT NEED 304 M

Shi et al. [6] NOT NEED NOT NEED 8.3 M

ours NOT NEED NOT NEED 15.4 M

4.4. Comparison with Baseline Models


To evaluate the efficiency of the proposed method, we also trained two baseline models: Without_4
and Without_P. Without_4 combines the four direction feature maps of the rotator rotation into one
through a convolutional layer. Without_P combines these features through a dense layer. We conducted
experiments on both regular and irregular text recognition datasets.
From Table 4, we can see that the baseline models do not recognize regular or irregular text
well. Unlike the proposed model and the Without_P model, the Without_4 model achieves only
39.8% accuracy in the CT80 dataset containing curved text. These comparison results indicate that the
four direction features are necessary for recognizing curved text. In both regular and irregular text
datasets, the accuracy of the Without_P model is lower than that of the proposed model. These results
demonstrate the necessity of an encoder, which can effectively combine the four direction features with
character placement possibility.

Table 4. Results from regular and irregular datasets.

Regular Dataset Irregular Dataset


Method
IIIT5K SVT IC03 IC13 CT80 IC15
Without_4 73.3 77.4 82.4 78.9 39.8 50.4
Without_P 79.6 81.2 85.7 82.5 54.5 54.6
ours 87.2 87.4 91.0 90.1 65.1 63.5
Electronics 2019, 8, 971 11 of 13

5. Conclusions
In this paper, we present a novel neural network for scene text recognition by (1) designing BCN
and MCN to extract four direction features and the corresponding character placement possibility,
(2) using an encoder to integrate the four direction features and their placement possibility and thus
gain a feature code, and (3) applying a decoder to recognize the character sequence. Different from
most existing methods, our method can recognize both regular and irregular text from scene images.
The proposed algorithm outperforms other methods in irregular datasets and achieves a similar
performance to the top methods in regular datasets. Our method can achieve 87.4% accuracy in the
SVT dataset, 65.1% in the CT80 dataset, and 63.5% in the IC15 dataset. The experiments on regular and
irregular datasets demonstrate that our method achieves superior or highly competitive performance.
Moreover, we trained two baseline models, and the results prove that the proposed method with four
direction features and character placement possibility is effective. In the future, we plan to make it
more practical for real-world applications.

Author Contributions: Conceptualization, M.Z. and Y.Y.; methodology, H.W. and W.Z.; software, Y.Y. and
W.Z.; validation, M.Z. and H.W.; formal analysis, M.Z.; investigation, Y.Y.; resources, H.W.; data curation, W.Z.;
writing—original draft preparation, Y.Y.; writing—review and editing, W.Z.; visualization, Y.Y.; supervision, M.Z.;
project administration, M.Z.; funding acquisition, H.W.
Funding: This research was funded by the China Postdoctoral Science Foundation (2018M633471), the National
Natural Science Foundation of Shaanxi Province under grant 2019JQ-270, the AeroSpace T.T. and C. Innovation
Program, and the China Scholarship Council (201806965054).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Bissacco, A.; Cummins, M.; Netzer, Y.; Neven, H. PhotoOCR: Reading Text in Uncontrolled Conditions.
In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia,
1–8 December 2013.
2. Cheng, Z.; Bai, F.; Xu, Y.; Zheng, G.; Pu, S.; Zhou, S. Focusing Attention: Towards Accurate Text Recognition
in Natural Images. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV),
Venice, Italy, 22–29 October 2017.
3. Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Synthetic Data and Artificial Neural Networks for
Natural Scene Text Recognition. arXiv. 2014. Available online: https://arxiv.org/abs/1406.2227 (accessed on
9 December 2014).
4. Osindero, S.; Lee, C.Y. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA,
27–30 June 2016.
5. Neumann, L.; Matas, J. Real-time scene text localization and recognition. In Proceedings of the 2012 IEEE
Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012.
6. Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and
Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. & Mach. Intell. 2016, 39, 2298–2304.
7. Shi, B.; Lyu, P.; Wang, X.; Yao, C.; Bai, X. Robust Scene Text Recognition with Automatic Rectification.
In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas,
NV, USA, 27–30 June 2016.
8. Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. ASTER: An Attentional Scene Text Recognizer with
Flexible Rectification. IEEE Trans. Pattern Anal. & Mach. Intell. 2019, 41, 2035–2048.
9. Gao, Y.; Chen, Y.; Wang, J.; Lei, Z.; Zhang, X.Y.; Lu, H. Recurrent Calibration Network for Irregular Text
Recognition. arXiv. 2018. Available online: https://arxiv.org/abs/1812.07145 (accessed on 18 December 2018).
10. Yang, X.; He, D.; Zhou, Z.; Kifer, D.; Giles, C.L. Learning to Read Irregular Text with Attention Mechanisms.
In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17),
Melbourne, Australia, 19–25 August 2017.
Electronics 2019, 8, 971 12 of 13

11. Wang, P.; Yang, L.; Li, H.; Deng, Y.; Shen, C.; Zhang, Y. A Simple and Robust Convolutional-Attention
Network for Irregular Text Recognition. arXiv. 2019. Available online: https://arxiv.org/abs/1904.01375
(accessed on 2 April 2019).
12. Bai, F.; Cheng, Z.; Niu, Y.; Pu, S.; Zhou, S. Edit Probability for Scene Text Recognition. In Proceedings
of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA,
18–22 June 2018.
13. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks in Advances
in neural information processing systems. arXiv. 2016. Available online: https://arxiv.org/abs/1506.02025
(accessed on 4 February 2016).
14. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. , Boston, MA,
USA, In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015,
Boston, MA, USA, 8–10 June 2015.
15. Ye, Q.; Doermann, D. Text detection and recognition in imagery: A survey. IEEE Trans. pattern Anal. Mach.
Intell. 2014, 37, 1480–1500. [CrossRef] [PubMed]
16. Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep structured output learning for unconstrained
text recognition. arXiv. 2015. Available online: https://arxiv.org/abs/1412.5903 (accessed on 10 April 2015).
17. Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International
Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011.
18. Wang, K.; Belongie, S. Word spotting in the wild. In Proceedings of the European Conference on Computer
Vision, Heraklion, Crete, Greece, 5–11 September 2010.
19. Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Reading text in the wild with convolutional neural
networks. Int. J. Comput. Vis. 2016, 116, 1–20. [CrossRef]
20. He, P.; Huang, W.; Qiao, Y.; Loy, C.C.; Tang, X. Reading scene text in deep convolutional sequences.
In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA,
12–17 February 2016.
21. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: labelling
unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international
conference on Machine learning, Pittsburgh, PA, USA, 25–29 June 2006.
22. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult.
IEEE Trans. Neural Netw. 1994, 5, 157–166. [CrossRef] [PubMed]
23. Gers, A.F.; Schraudolph, N.N.; Schmidhuber, J. Learning Precise Timing with LSTM Recurrent Networks.
J. Mach. Learn. Res. 2002, 3, 115–143.
24. Graves, A. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780.
25. Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings
of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC,
Canada, 26–31 May 2013.
26. Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural
scene images. Expert Syst. Appl. 2014, 41, 8027–8048. [CrossRef]
27. Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.;
Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th
International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015.
28. Mishra, A.; Alahari, K.; Jawahar, C. Scene text recognition using higher order language priors. In Proceedings
of the BMVC-British Machine Vision Conference, Surrey, UK, 3–7 September 2012.
29. Lucas, S.M.; Panaretos, A.; Sosa, L.; Tang, A.; Wong, S.; Young, R. ICDAR 2003 robust reading competitions.
In Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh,
UK, 6 August 2003.
30. Ioffe, S.; Szegedy, C. Batch Normalization: Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift. arXiv. 2015. Available online: https://arxiv.org/abs/1502.03167 (accessed on
2 March 2015).
31. Zeiler, M.D. ADADELTA: An Adaptive Learning Rate Method. arXiv. 2012. Available online: https:
//arxiv.org/abs/1212.5701 (accessed on 22 December 2012).
Electronics 2019, 8, 971 13 of 13

32. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.;
et al. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium
on Operating Systems Design and Implementation (OSDI ’16), Savannah, GA, USA, 2–4 November 2016.
33. Wang, T.; Wu, D.J.; Coates, A.; Ng, A.Y. End-to-end text recognition with convolutional neural networks.
In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan,
11–15 November 2012.
34. Goel, V.; Mishra, A.; Alahari, K.; Jawahar, C.V. Whole is greater than sum of parts: Recognizing scene text
words. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition,
Washington, DC, USA, 25–28 August 2013.
35. Alsharif, O.; Pineau, J. End-to-End Text Recognition with Hybrid HMM Maxout Models. arXiv. 2013.
Available online: https://arxiv.org/abs/1310.1811 (accessed on 7 October 2013).
36. Almazán, J.; Gordo, A.; Fornés, A.; Valveny, E. Word Spotting and Recognition with Embedded Attributes.
IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2552–2566. [CrossRef] [PubMed]
37. Yao, C.; Bai, X.; Shi, B.; Liu, W. Strokelets: A Learned Multi-scale Representation for Scene Text Recognition.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH,
USA, 24–27 June 2014.
38. Su, B.; Lu, S. Accurate scene text recognition based on recurrent neural network. In Proceedings of the Asian
Conference on Computer Vision, Singapore, 1–5 November 2014.
39. Gordo, A. Supervised mid-level features for word image representation. In Proceedings of the 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like