Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Deep Learning Assisted Smart Glasses as Educational

Aid for Visually Challenged Students


Hawra AlSaid, Lina AlKhatib, Aqeela AlOraidh, Shoaa AlHaidar, Abul Bashar

College of Computer Engineering and Sciences


Prince Mohammad Bin Fahd University
Al-Khobar, Saudi Arabia 31952
Email: abashar@pmu.edu.sa

Abstract— Computer Vision Technology has played a levels of needs and not all levels require special places and
significant role in assisting visually challenged people to carry special schools. For instance, people with vision difficulties can
out their day to day activities without much dependency on study with other students if they have an appropriate
other people. Smart glasses in one such solution which enables environment. In order to solve this issue, we can use the help of
blind or visually challenged people to “read” images. This computer vision technology to make special aids which the
paper is an attempt in this direction to build a novel smart glass visually impaired people can live comfortably, as far as
which has the ability to extract and recognize text captured from possible.
an image and convert it to speech. It consists of a Raspberry Pi It is observed that most blind people are intelligent and can
3 B+ microcontroller which processes the image captured from study if they have the chance to be able to study in regular
a webcam super-imposed on the glasses of the blind person. government administered schools as they exist almost
Text detection is achieved using the OpenCV software and open everywhere. It is a misconception among majority who think
source Optical Character Recognition (OCR) tools Tesseract people who are blind or with vision difficulties cannot live
and Efficient and Accurate Scene Text Detector (EAST) based alone and they need help of other people at all times. In fact,
on Deep Learning techniques. The recognized text is further they do not need help all the times, they can be independent
processed by Google’s Text to Speech (gTTS) API to convert most of the times and they have the chance to live like other
to an audible signal for the user. A second feature of this people.
solution is to provide location-based services to the blind people One of the popular solution in this scenario is to use Smart
by identifying locations in an academic building using the RFID Glasses for the visually impaired people [3]. These types of
technology. This solution has been extensively tested in a glasses make the use of computer vision hardware and software
university environment for aiding visually challenged students. tools (camera, image processing, image classification and
The novelty of the implemented solution lies in providing the speech processing). Such a solution gives a chance to visually
desired computer vision functionalities of image/text impaired people to lead a comfortable life with other people and
recognition which is economical, small-sized, accurate and uses study in any school or university without the need of help from
open source software tools. This solution can be potentially other people every time. It has been observed that the use of
used for both educational and commercial applications. Smart Glasses has increased the percentage of educated people.
Most schools, colleges and universities are accepting students
Keywords: Image Recognition; Speech processing; Optical with vision difficulties. It is expected that from next academic
Character Recognition; Deep Learning; Raspberry Pi; Python. year Prince Mohammad bin Fahd University (PMU) will accept
blind students for admission [4]. The college would like to start
I. INTRODUCTION using smart glasses for the first time in this setup and help
In our societies, there are many people who are suffering students to improve their education level with minimum
from different diseases or handicap. According to World Health assistance from the instructor.
Organization (WHO), about 8% of the population in eastern This was the motivation behind the design and
Mediterranean region has vision difficulties, which includes development of smart glasses is to help blind and visually
blindness, low vision and some kind of visual impairment [1]. impaired students with their studies. These glasses are designed
Such people need to be provided special facilities so that they to use the computer vision technology to capture an image and
can live comfortably. Especially in the field of education, there extract English text and convert it into audio signal with the aid
are special schools and universities for people with special of speech synthesis. Also, it was decided to add a feature of
needs [2]. Most blind people and people with vision difficulties translating text/words from English to Arabic language as the
were not in a position to complete their studies special schools majority of the students at PMU are Arabic speaking.
for people with special needs are not available everywhere and
most of them are private and expensive. So the only alternative The main objectives of the proposed system can now be
was that they study at home acquiring basic knowledge from summarized as the follows: capturing image, extracting text
their parents. This education was not technical enough and from the image, identifying the correct text, converting text to
hence cannot compete with other people. There are different speech, translate the text to other language, to integrate the

978-1-7281-2882-5/19/$31.00 ©2019 IEEE


Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 18,2021 at 05:11:01 UTC from IEEE Xplore. Restrictions apply.
for text detection and recognition is around 600 ms for one
image [7].
For text recognition in natural scene images, this method
has proposed an accurate and robust solution. It has used an
(MSER) algorithm to detect almost all characters from any
image. The datasets used for this system are ICDAR 2011 and
Multilingual datasets. The results showed that the MSER has
achieved 88.52% accuracy in character level recall [8].
Fig. 1: Conceptual Design of the Proposed System
End-to-end real-time text recognition and localization
system have used ER (External Regions) detector that covered
different hardware and software modules, and to test and about 94.8% of the characters, and the processing time of an
troubleshoot the working of the proposed system. image with 800×600 resolution was 0.3s on a standard personal
computer. The system used two datasets ICDAR 2011 and
The rest of the paper is organized as follows. Section II will SVT. On the ICDAR 2011 dataset, the method achieved 64.7%
provide an overview of the various solutions provided in the of image-recall. For SVT dataset, it achieved 32.9% of image-
area of using deep learning based computer vision techniques
recall [9].
for implementing smart glasses for the visually impaired. The
details of the proposed system design and implementation will Text detection and localization using Oriented Stroke
be presented in Section III. Experimental results and their Detection is a method that took advantage of two important
implications will be discussed in Section IV. Conclusions and methods that are connected to a component with a sliding
future directions of the proposed solution will be provided in window. The character or the letter has been recognized as a
Section V. region in the image that has some strokes in a particular
direction and particular position. The dataset that has been used
II. RELATED WORK is ICDAR 2011, the experimental results showed 66% recall
Text detection and recognition have been a challenging which is better than the previous methods [10].
issue in different computer vision fields. There are many An Efficient and Accurate Scene Text (EAST) detector
research papers that have discussed different methods and method is a simple and powerful pipeline that allows detecting
algorithms for extracting the text from the images. The main a text in natural scenes, and it achieves high accuracy and
purpose of this literature review is to evaluate some of these efficiency. Three datasets have been used in this study, ICDAR
methods and their effectiveness regarding their text detection 2015, COCO-Text and MSRA-TD500. The experiments have
accuracy rates. shown that this method has better results than previous methods
In end-to-end text recognition with the power of regarding accuracy and efficiency [11].
Convolutional Neural Networks combined with the new Tesseract is an open source Optical Character Recognition
unsupervised feature, learning growth took advantage of the (OCR) engine whose development has been sponsored by
known framework for the training to achieve high accuracy of Google Inc. The version 4 of Tesseract is based on a deep
the text and character detection and recognition modules. These learning-based Artificial Recurrent Neural Network called as
two models have been combined using simple methods to build Long Short-Term Memory (LSTM) architecture. This OCR
end to end text recognition system. The datasets that have been engine can support up to 116 languages and has reasonable
used are ICDAR 2003 and SVT. The method of 62-way character recognition rates [12].
character classifier obtained 83.9% of accuracy for a cropped
character from the first dataset [5]. The idea of smart glasses is to create wearable computer
vision-based glasses for various purposes (e.g. reading, search,
In novel scene text recognition, which is an algorithm that navigation). These uses determine the type of practical glasses
mainly depended on machine learning methods. Two types of that are needed to be designed and developed. In the early stages
classifiers have been designed to achieve more accuracy, the of development, the smart glasses were simple and provide
first one was developed to generate candidates, but the second features for carrying out basic tasks which served as a front-end
one was for filtering of candidates that are non-textual. A novel display for the remote system. However, recent smart glasses
technique has been developed to take advantage of multi- have become more sophisticated and provide several features
channel information. Two datasets have been used in this study, for aiding blind and visually impaired people. A comparative
ICDAR 2005, ICDAR 2011. This method has achieved summary of practically implemented smart glasses solutions
significant results in different evaluation scenarios [6]. have been presented in Table I. It provides a description of the
In photoOCR which is a system designed to detect and conceptual model, their benefits, drawbacks and possible
extract any text from any image using machine learning improvements required to make them better. Also, it provides
techniques, it also used different distributed language modeling. the research gap in this area and which is filled with our
The goal of this system was to recognize any text from any proposed solution (CCES, PMU).
challenging image such as poor quality or blurred images. This Based on the literature survey presented above regarding the
system has been used in different application such as Google different text detection techniques from images, we propose to
Translate. The datasets that have been used for this system are implement a novel solution with the following features. The
ICDAR and SVT. The results showed that the processing time conceptual design of the proposed system is shown in Fig. 1.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 18,2021 at 05:11:01 UTC from IEEE Xplore. Restrictions apply.
Table I: Comparative Summary of Smart Glasses Solutions

Solution Developer Conceptual Design Benefits Drawbacks Improvements


High resolution
Helps low vision Does not improve Water proof versions
camera for image and
people, avoiding vision as it just an are under
eSight 3 CNET's video capture for low-
surgery aid development
vision people

Symbols to audio For only people Can be improved to


Keisuke Shimakage, Support for dyslexic
Oton Glass conversion, normal with reading support blind people
Japan people. Converts
looking glasses, difficulty and no also by including
images to words and
supports English and support for blind proximity sensors.
then to audio
Japanese languages people
Suman Kanuganti Aira agents help Waiting time To include language
Aira uses smart
users to interpret connected to the translation features.
Aira glasses to scan the
their surroundings by Aira agents in order
environment.
smart glasses. to be able to sense.
Eyesynth is consist 3D Can use verbal audio
Allows blind/limited
cameras, which turns for better feel and
sight people to ‘feel It is expensive
the scene to sound navigation services.
Eyesynth Eyesynth, Spain the space’ through (575.78$) and it
signal (non-verbal) to
sounds. It converts only recognizes the
provide information
spatial and visual objects and
about of position, size,
information into directions.
and shape. It is
audio.
language independent.
Google Glasses show Can capture images Reduce costs to make
information without and videos, get It is expensive it more affordable for
Google using hands gestures. directions, send (1,349.99$) and the the consumers.
Google Inc.
Glasses Users can messages, audio glasses are not very
communicate with the calling and real- time helpful for blind
Internet via normal translation using people.
voice commands. word lens app.
Currently it supports
Helps blind people to only the English
avoid obstacles, to language, cannot be
Our The glasses can
Help people who have aid in reading and used while driving,
proposed CCES, PMU support other
vision difficulties learning, convert processing unit is
solution languages, it also can
especially blind image to text and separated from the
be smaller and easy to
people. search for glasses, and
wear.
information about the captures the object
words on the Internet. with a specific
range of distances.

a. Camera and ultrasonic sensor based smart glasses to below.


capture an image having embedded text.
b. Optical character recognition software based on EAST (i) Raspberry Pi 3 Model B+
and Tesseract open source tools. Raspberry Pi is a credit card-sized computer. It needs to be
c. Google text to speech conversion of the identified text connected with a keyboard, mouse, display, power supply, SD
for the visually challenged person to hear. card and installed operating system [14]. Raspberry Pi is a low-
d. RFID-based navigation system to enable the visually cost embedded system that can do a lot of significant tasks. It
impaired person to explore the academic building for can be run as a no-frills PC, a pocketable coding computer, a
locating various lecture and lab rooms. hub for homemade hardware and more. It includes GPIO
(General Purpose Input/Output) pins to control various sensors
and actuators. Raspberry Pi 3 used for many purposes such as
education, coding, and building hardware projects. It is used as
III. SYSTEM DESIGN AND IMPLEMENTATION a low-cost embedded system to control and connect all of the
The proposed system consists of two main parts, the I/O components together. It uses the Raspbian or NOOBs as the
hardware and the software. This section describes the details of operating system which can accomplish many important tasks
the hardware used and the software components used. Fig. 2 However, for our solution we decided to work on Raspbian as
shows the process diagram of the proposed system. the operating system.
A. Hardware Design & Implementation (ii) Digital Camera
The various sub-components of the hardware system that The webcam has a view angle of 60° with a fixed focus. It can
have been used to make the smart glasses system are described

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 18,2021 at 05:11:01 UTC from IEEE Xplore. Restrictions apply.
Fig. 2: Process Diagram of the Proposed System

capture images with maximum resolutions of 1289 x 720 pixels.


It is compatible with most operating systems platforms such as
Linux, Windows, and MacOS. It has a USB port and a built-in
mono microphone. In the solution, the Webcam will be used as
the eyes of the person who wears the smart glasses. The camera
is going to capture a picture when the button is pressed (called
as Button 1, see Fig. 4), in order to detect and recognize the text Fig. 3: Procedural steps during the OCR process
from the image.
(iii) Ultrasonic Sensor
B. Software Design & Implementation
The purpose of ultrasonic sensors is to measure the distance
Below are the software components that have been used in
using ultrasonic waves. Ultrasonic sensors emit the ultrasonic
the proposed system for programming the functionalities of the
waves and receive back the echo. So, by measuring the time the
ultrasonic sensor will measure the distance to the object. It can smart glasses system.
sense distances in the range from 2-400 cm. In the smart
glasses, the ultrasonic sensor is used to measure the distance (i) OCR Tools: Tesseract and EAST
between the camera and an object to detect the text from the OCR (Optical Character Recognition) is used to convert
text image. It was observed based on experimentation that the typed, printed or handwritten text into machine-encoded text.
distance to the object should be from 40 cm to 150 cm to capture There are some OCR software engines which try to recognize
a clear image (see Fig. 4). any text in images such as Tesseract and EAST. In this project
(iv) RFID Sensor Tesseract version 4 is used because it is the best open source
OCR engines. OCR process consists of multiple stages, as
A Radio Frequency IDentification (RFID) sensor consists shown in Fig. 3
of two main devices, namely the RFID reader and the RFID tag.
The RFID tag has digital data, integrated circuits, and a tiny (a) Preprocessing: The main goal of this step is to reduce
antenna to send information to the RFID reader. Signals the noise that resulted from scanning the document where the
frequencies are usually between 125 to 134 kHz and 140 to characters might be broken or smeared and causes poor rates of
148.5 kHz for low frequencies and 850 to 950 MHz and 2.4 to recognition. Preprocessing is done by smoothing the digitized
2.5 GHz for high frequencies. characters through filling and thinning. Another aim of this step
is to normalize the data to get characters of uniform size,
The RFID reader is mainly used to collect information from rotation, and slant. Moreover, significant compression in the
the RFID tag with the help of electromagnetic fields. The amount of information is achieved through thresholding and
process of transformation data from the tag to the reader is done thinning techniques.
by the radio waves. However, in order to achieve this process
successfully the RFID tag and the RFID reader should be within (b) Segmentation: In this process, the characters or words
a range between 3-300 feet. Any object can be identified are isolated. The words are segmented into isolated characters
quickly when it is scanned, and the RFID can recognize it. that are recognized separately. Most of OCR algorithms
RFID has many applications such as passport, smart cards, and segment words into isolated characters which are recognized
home applications. The RFID sensor is used in our solution to individually. Usually, segmentation is done by isolating every
attach the RFID reader in the hall and various classrooms so the connected component.
blind person can recognize them. (c) Feature Extraction: This process will capture the
(v) Headphones significant features of symbols and it has two types of
algorithms which are pattern recognition and feature
Wired headphones was used in our solution since Raspberry extraction/detection.
Pi 3 Model B & B+ come with Audio jack and it is better to take
advantage of this feature rather than occupying one of the four (d) Classification: OCR systems use the techniques of
USB ports that can be useful for other peripherals. The pattern recognition that assigns an unknown sample into a
headphones will be used to help the user listen to the text that is predefined class. One of the ways to help in character
been converted to audio after it has been captured by the camera classifying is to use the English directory.
or to listen to the translation of the text. The headphones are (e) Post-processing: This process includes grouping and
going to be small, lightweight and connected to the glasses, so error detection & correction. In grouping, the symbols relate to
the user will not be concerned about losing them or feel strings. The result of plain symbol recognition in the text is a
uncomfortable wearing them.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 18,2021 at 05:11:01 UTC from IEEE Xplore. Restrictions apply.
group of individual symbols.
(ii) OpenCV Libraries
OpenCV is a library of programming functions for real-time
computer vision, the library is cross-platform and free for use
under the open-source BSD license [15]. For the installation of
the OpenCV 4 libraries, the recommended operating system for
the raspberry pi B+ which is Raspbian Stretch was installed.
Win32 Disk Imager was used to flash the SD card.
(iii) Google Text to Speech (gTTS) API
One of the most important functions of the smart glasses is
text to voice conversion. In order to implement this task, we
installed gTTS (Google Text-to-Speech). It is a python library
that interfaced with Google Translate API [13]. gTTS has many
features such as convert ultimate length of text to voice, provide
error pronunciation using customizable text pre-processors and
support many languages and retrieve them when needed. We
used the gTTS to perform language translation from English to
Arabic (called as Button 2, see Fig. 4).

IV. EXPERIMENTAL RESULTS AND DISCUSSION


A. Text Recognition from captured image
In this result, the main goal was to check if the text detector Fig. 4: Procedural steps for the design and development of the prototype
that was used in this solution, which is EAST pre-trained text
detector and the text recognizer which is OCR using Tesseract academic building.
is working well. The test showed mostly good results on big D. Challenges and Limitations
clear texts and failed on small texts. We found that the
recognition depends on the clarity of the text in the picture, its During the implementation of this system, the first
font theme, size, and spacing between the words. One such challenge was to decide which microcontroller will be
result is presented in Fig. 5. Here the image had the text appropriate. After some research it was found that Raspberry Pi
“Electronics Lab & Embedded Systems Lab”, which was has the required features suitable for system objectives.
recognized as “Electronics Lab Embedded Svstems Lab”, which Initially, Raspberry Pi zero w was the choice because it was
can be said to be about 80% accurate. smaller and lighter than the others versions. It could be set up
easily on the glasses rather than the person holds it by hand.
B. Text to Speech Conversion
However, we also found that out that the Raspberry Pi 3 has
In this result, the idea was to check if the detected text is higher processing power since it consists of quad core processor
converted to audio text. We used gTTS (Google Text To and it is has faster processing speed than Raspberry Pi zero w.
Speech) libraries after it was found that the voice quality was It also has a larger memory and extra I/O ports for connecting
better and clearer than the other TTS libraries such as Festival peripheral devices. Since we also decided to use OCR in
TTS, Espeak TTS, and Pico TTS. The voice was clear for the
conjunction with OpenCV and Efficient and Accurate Scene
right detected words. One such result is presented in Fig. 6.
Here the text captured from the image was Moonshine, which Text Detector (EAST), the Raspberry Pi should be able to do
was accurately converted into audio signal, which was clearly multitasking efficiently. Then we finally decided to opt for
audible. Raspberry Pi 3.
We had proposed glasses with a built-in camera and it
C. Final Prototype turned out that this built-in camera is not compatible with
Fig. 7 shows the images of the final prototype which we Raspbian OS, and it was only compatible with Windows and
developed. As can be see, the hardware circuit (Raspberry Pi MacOS, so we tried to install Windows IoT (operating systems
processor board) can be worn on the arm and it is mobile with from Microsoft designed for use in embedded systems.) on the
the user. The glasses can be worn on the face which consists of Raspberry but unfortunately it didn’t work. To solve this
the normal sunglasses mounted with the camera and the problem, we decided to use a webcam which was compatible
ultrasonic sensor. We admit that this is not very comfortable but with the Raspberry Pi.
it has other benefits like low cost ($330), open source hardware
(Raspberry Pi) and software (OpenCV, Tesseract and Python).
Fig. 7 also shows the testing and working of the RFID-based V. CONCLUSIONS AND FUTURE WORK
navigation unit. It provides the visually challenged with This paper has proposed, implemented and tested a novel
information (in audio signals) about the current location in the

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 18,2021 at 05:11:01 UTC from IEEE Xplore. Restrictions apply.
worthwhile to include multi-lingual feature (e.g.
French or Urdu ) in the speech translation module.
● To improve the direction and warning messages to the
user, we can include GPS-based navigation and alert
system.
● To provide for more space visibility, we can include a
wide angle camera (e.g. 2700 degrees as compared to
600 currently used).
● Finally, to provide for more real-time experience, we
can include video processing instead of still images.

Fig. 5: OCR result with EAST detector ACKNOWLEDGEMENT


We sincerely thank the College of Computer Engineering
and Science (CCES) and the management of Prince
Mohammed Bin Fahd University (PMU) for their cooperation
and support in accomplishing this BS senior design project.
REFERENCES
[1] World Health Organization, "Global data on visual impairments 2010".
Retrieved from: https://www.who.int/blindness/GLOBALDATAFINAL
forweb.pdf (Last accessed on 30th May, 2019).
[2] Best Colleges, “ College guide for students with visual impairments”,
Retrieved from: https://www.bestcolleges.com/resources/ college-
planning-with-visual-impairments/ (Last accessed on 30th May, 2019)
[3] Google Inc., "Google Glasses". Retrieved from https://en.wikipedia.org/
wiki/Google_Glass (Last accessed on 30th May, 2019).
[4] Humanitarian projects, "Prince Sultan bin Abdulaziz College for the
visually impaired". Retrieved from http://www.princemohammad.org/
Fig. 6: Google Text to Speech result en/Initiatives-College-for-the-Visually-Impaired.aspx (Last accessed on
30th May, 2019).
[5] T. Wang, D. J. Wu, A. Coates, A. Y. Ng, "End-to-end text recognition
with convolutional neural networks” IEEE 21st International Conference
on Pattern Recognition (ICPR), pp.3304-3308, 2012.
[6] H. I., Koo, D. H. Kim, "Scene text detection via connected component
clustering and non-text filtering" IEEE Transactions on Image
Processing, 22(6), pp. 2296-2305.
[7] A. Bissacco, M. Cummins, Y. Netzer, H. Neven, "Reading text in
uncontrolled conditions", Proceedings of the IEEE International
Conference on Computer Vision, pp. 785-792, 2013.
[8] X. C. Yin, X. Yin, K. Huang, H. W. Hao, "Robust text detection in
natural scene images " IEEE transactions on Pattern Analysis and
Machine Intelligence, 36(5), pp. 970-980, 2014.
[9] L. Neumann, J. Matas, "Real-time scene text localization and
recognition ", Proceedings of IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 3538-3545,
2012.
[10] L. Neumann, J. Matas, "Scene text localization and recognition with
Fig. 7: Final developed prototype oriented stroke detection", Proceedings of IEEE International
Conference on Computer Vision, pp. 97-104, 2013.
smart glasses for visually challenged students, which has the [11] X. Zhou, C. Ylao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, "EAST:
features to capture an image, extract and recognize the an efficient and accurate scene text detector", Proceedings of the IEEE
embedded text and convert it to speech. Our design and conference on Computer Vision and Pattern Recognition, pp. 5551-5560,
2017.
implementation was a practical demonstration of how open
[12] GitHub Inc., "Tesseract OCR". Retrieved from
source hardware and software tools can be integrated to provide
https://github.com/tesseract-ocr (Last accessed on 30th May, 2019).
a solution which is low-cost, lightweight, re-configurable and
[13] Python Software Foundation, "Python". Retrieved from
scalable. However, there are some limitations in our proposed https://www.python.org/ (Last accessed on 30th May, 2019).
solution which can be addressed in the future implementations. [14] Raspberry Pi 3 B+: Retrieved from https://www.raspberrypi.org/ (Last
Hence, we recommend the following features to be accessed on 30th May, 2019).
incorporated in the future versions. [15] OpenCV. Retrieved from https://opencv.org/ (Last accessed on 30th May,
2019).
● In order to cater to a wide variety of users, it would be

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 18,2021 at 05:11:01 UTC from IEEE Xplore. Restrictions apply.

You might also like