Professional Documents
Culture Documents
Is204 - 6
Is204 - 6
Is204 - 6
by
A Research Paper Submitted to the Mapúa Senior High School Department in Partial
Fulfillment of the Requirements for
Mapúa University
October 2020
Chapter 1
INTRODUCTION
Digital Image Processing (DIP) is one of the most popular research fields in the study of
machine learning. It is the process of recognizing and committing data in a form that can be
recognized by a computer with applications such as facial recognition, and the scanning of
documents. DIP is an important part of modern technology and lifestyle and many processes
require that the methods they use give reliable results. One such method or technique is called
Optical Character Recognition (OCR) and has been implemented in a variety of services. Tafti et
al (2016). writes that it is a “classic machine learning challenge” and that it has helped develop
the modern world. The researchers aim to perform a comparative analysis on the performance of
Google Document’s OCR and MATLAB built-in OCR. Primarily to measure the reliability of a
free, well-known, and easy-to-use service provided by Google with that of the OCR function of
the computer programming language that is MATLAB, in terms of their capacity to translate
special characters.
In 2015, a performance comparison study done by Vijayanari & Sakila found that while
Google Documents was successful in detecting characters, it was not at all successful in
evaluation done by Tafti et al. (year?) says that Google Documents has promising results with
high accuracy rates in most image types except in blurry, noisy, hand-written and skewed images.
A research done by Jasoirat and Malik (2018) designed a program that uses MATLAB’s OCR
function with a webcam proved to be successful, but their work did not include any samples that
3
Current Optical Character Recognition (OCR) technology has advanced to the point
where it can detect the presence and number of characters in a line. While most OCR software
have performed well when translating English alphabetical characters or letters, it is with special
characters that performance is found wanting. There is open-source software that is being
developed that addresses this, but examples are limited to integration of other languages with
English characters such as Portugeese (Laurie, 2019). There is a gap in information on special
The study aims to find if either MATLAB or Google Documents is better than the other at
translating special characters. The researchers plan to use MATLAB’s built-in OCR software and
compare the results with Google Doc’s automatic OCR feature, which has a high success rate for
Developers and programmers would find this study useful as the researcher’s document
the capabilities of MATLAB OCR. Persons who need to know the capabilities of Google
Document’s automatic OCR software compared to other software that needs to be set up, such as
MATLAB, may find this document useful. There is other value in studying both of these services
such as integrating them to a software project like plagiarism detection or streamlining processes.
The study is limited to the use of OCR. No other machine learning technique would be
used in testing image processing quality. MATLAB has most of the necessary functions to aid in
the project, with some modifications needed to enhance translation quality. The research is only
limited to the comparison of the image-to-text translation software of MATLAB and the OCR
feature by Google Documents. MATLAB will be modified through its OCR training feature. No
4
Chapter 2
REVIEW OF LITERATURE
OCR is a process used to interpret the documents and convert it into an actual editable
text. It contains various processes from preparing the scanned documents to actual recognition
and conversion of the text. The researchers took advantage of the process of OCR method and
will use it in order to scan some text and images that other programs could take advantage of.
Figure 1. OCR Processes
In a book written about Optical Character Recognition (OCR) by Chaudhuri et al, it is
defined as “ the process of classification of optical patterns contained in a digital image”. Paper
documents and digital images that were captured by a camera could be turned into editable text in
which a computer can process through OCR. Although the field has advanced over three decades
being applied in many fields of science and various industries, it is said that the technology has
not yet achieved the same level of accuracy as the human eye. The conventional way of inputting
data to a computer is through a keyboard, however this is not always the most efficient method
when faced with a substantial amount of text. OCR can be used to solve a problem like this and
simplify its process without the need for much human interference. OCR stemming from machine
5
sufficient training data would be able to turn images of text into text files with high accuracy. In
a general study on the general state of plagiarism detection technology, there has been a certain
lack of capacity when identifying plagiarized “plagiarized figures, tables, equations and scanned
hand written text, the reason for this is because the font in a written paper is different from the
font used by the machine. According to a study by Brown, Fay and Walker, It is undoubtedly one
The purpose of OCR is to scan a document and convert it to an editable text. This can be
used for various programs. According to a study about Telugu fonts and OCR by Anuradha
(2012), any OCR compose of preprocessing stage and followed by the Recognition stage and
then the post-processing stage but the last one won’t occur as much as the first two stages.
Preprocessing is where the scanned document will be translated into editable code recognized by
the computer, and recognition is where the identified characters will be converted into an editable
text. Pre-processing will vary depending on the document. It will include the improvement of the
quality of the image. There is also the post-processing stage where accuracy can be increased if
Nicomsoft and how-ocr-works, which are websites dedicated to OCR technology, pre-processing
is where the preparation happens. It consists of various stages to fix the scanned documents, and
it depends on how many issues it will encounter. Some of the stages are De-skew where the
scanned documents will be tilted if it is not aligned properly. There is also the Despeckle stage or
6
noise reduction, where it will fix some defect spots caused by dust, scratches and some other
minor issue. Another stage would be Binarization. This always happened when using OCR. This
is where the scanner would color the image into black and white, black being the font and text,
white being the background of it. This would make the computer scan the text and differentiate
the background to the test. The line-removal stage is where the cleaning of non-glyph boxes and
lines. There is also the layout analysis or zoning stage where the text would be divided into
blocks and differentiate which are the first blocks, second and so on. The line and word detection
stage is where the baseline of the words or letter is established and if necessary, separates it.
Script recognition stage, in script documents, the script’s wording may change at a later part of
the story and this stage helps recognize where it is necessary to change it’s way so that the OCR
can handle the specific part of the script. There are also the two segmentation stages, the first one
is the word segmentation stage where it separates or isolates one word from another. The other
one would be the character/line segmentation where the characters would be separated since some
lines or part of the characters overlap the other character. The last one would be to normalize the
scanned documents, fix it’s aspect ratio, scale, etc. in order to prepare for the actual recognition
part. Next we move on to the actual recognition part, this is important since this is the main
algorithm of OCR. Some of the characters are very similar e.g. “I” “l” ”1” and the algorithm
might produce a different output due to some uncertain images. It depends on the algorithm used
but the text recognition helps to recognize the difference between characters. An example of the
algorithm of it would be the Matrix matching that compares the images pixels by pixels. These
studies also gave the researchers a complete overview on how the OCR works.
7
Applications of OCR
OCR or Optical Character Recognition Technology has a variety of uses that are useful
for academics, technology, and even help impaired people. There was a study done by Praveena
et al (2019) where they created the Pi Book Reader. The Pi Book Reader is a device that can read
ebooks out loud in order to help visually impaired people to follow the story. This device is done
with the assistance of OCR and image processing where they use image processing to capture
and process the image and extract the texts using the OCR technology and the extracted text is
then converted from Text to Speech and it is read outloud using speaker or earphones for the
Plagiarism is a major issue in academics today, it is not a modern issue since the act of
plagiarising goes way back centuries. The modern era worsens the issue due to the emergence of
the internet, various people can share information to other people through the help of the internet.
Any person can easily access the world wide web and find information they can plagiarize. There
are various loopholes in most of the Plagiarism Detector (PD) tools today, a study called
Turnitoff made by Heather (2010) which reads more into the faults of famous PD TurnItIn in the
early 2010s. Plagiarism Detector Tools work by extracting the text, then search the scanned text
online or in a large database, look if the contents are plagiarized and show the results. The second
stage would be the most complex however, the researchers will be focusing on the first stage as
this is where the attacks usually happen. According to the study by Heather (2010), the first stage
has a loophole where one can use an image disguised as text using PDF. Most of the tools today
will not detect the plagiarized content if it’s in an image because they only scan the editable text,
not the images. This loophole can be fixed by using Image Processing through a process of what
is called OCR.
8
In a research paper done by Vamvakas, Gatos, Stamatopoulos and Perantonis. OCR can
also be used to recognize historical texts either printed or handwritten even without any
knowledge of the font. This can be done by creating a database for the OCR where the program
would store the data of the previous documents and use it as a training to recognize the font
eventually. By this method, the more documents that the program scanned, the more knowledge
of various fonts it would gain. This is also one of the uses of OCR where people could save time
A study conducted by Mulay and Puri (2015), called HawkEye, a mobile system that is
used to detect code-cloning. Code-cloning is a kind of plagiarism where the code of a program is
being used as an owned work. The Hawkeye also used an OCR system for multi-language and
can be used by simply snapping a picture of the code using a mobile camera and the Hawkeye
would do the job of detecting whether the code is plagiarized work. The OCR would convert the
image, extract the codes while removing the unnecessary texts, and convert it into an editable text
file. The study is very similar to the present study since it uses the OCR as a way to further
improve the current plagiarism detector tool. The system was used to help detect the plagiarism
for a program code while the current study is about helping the current plagiarism tool as a
whole.
MATLAB
MATLAB is simple yet a powerful programming language, but it has a rather niche use.
This programming language also has powerful features regarding image detection and
processing, the researchers find the flexibility of the language compared to other popular
9
languages such as Python advantageous. MATLAB OCR has various add-ons that can recognize
various languages and also math equations. The amount of OCR applications that can detect
complex mathematical equations and can be modified is limited, this is why MATLAB is the
According to Dalal and Daya (2018), machine replication of human functions such as
reading are hard to achieve. However, over these past years, technology advanced and now,
machines can read just as much as humans do. This is achieved by using the technology of OCR
MATLAB is usually used in engineering and intensive computing. MATLAB has various
toolboxes and functions and that is why according to Goyal (2019), Matlab is a very convenient
A study of using MATLAB processing an image by Abdullah, Palash, Rahman, Islam and
Alim (2016) states that Image Processing is a processing of images using a mathematical
operation by using any form of signal processing, Most image processing treats image as a two
dimensional platform, x and y axis, and apply the standard signals processing on it. It usually
revolves around the actual processing of the images but sometimes it even includes the
recognition of the character through optic devices. You could say that OCR is an example of
Image Processing. Like OCR, it consists of step by step stages, Inputting an image, analyzing and
manipulating the images, and showing the output of the altered image. The study includes the
representation of color and image as data whereas the color would not be an RGB instead, a fixed
color that is just mixed together resulting in an RGB, but with different numerical values to easily
10
Based on a study written by Tiwari, Mishra, Bhatia, and Yadav (2013) by the use of OCR
in MATLAB it states that MATLAB may vary in technical computing challenges that can be
addressed more easily than for standard programming languages given such as C, C++, JAVA,
and FORTRAN. It is also possible to transform an image or some other data, such as sound to a
matrix and then perform multiple activities, to get the desired effects and values. A broad variety
of technologies are possible, including signal and image analysis, image accusation, and neural
network. The accuracy by using MATLAB based on the output was calculated using the samples
to demonstrate the precision of the English handwritten and sample text picture with the
MATLAB OCR algorithm. The researchers use a sample paper to scan into 300 dpi from the HP
deskjet scanner. Afterwards the files were screened, binarized, cropped and resized. The
segmentation to segment character was performed on each line taking into account the
characteristics of English Verdana font templates. The recognition precision was 85% to 90% due
With all these OCR and Image Processing Applications, There is still a very limited
number of programs that can recognize Bangla Characters, and even if they do, they still can't
recognize the whole Bangla Character. A study by Hossain, Ahmed, Sarkar, and Al-Amin (2018)
uses MATLAB to develop a system that can recognize the Bangladesh Alphabet. This is
achieved by using various Processes of OCR such as binarization, noise removal, segmentations,
feature extraction and recognition. The proposed system here is developed by using the following
steps: It first starts by inputting a Printed Bangla Script in a scanner, next is getting the raw
scanned documents and converting the scanned image into a grayscale image. The grayscale
image is then converted into a Binary image, the Binary image went through a Segmentation and
11
Feature Extraction process. And the output went through the process of Classification and
enhanced the accuracy using the Post-processing with an actual editable text as an output. The
said study can benefit the present study as it both uses the process of OCR in the MATLAB, even
though the study was used to develop a system that can recognize Bangla Characters, The
researchers could still used this as reference to develop a system that is leaning more towards
Google Documents
In a study by Benito and Munoz (2013) indicated that making use of Information and
Communication is only rightful to improve educational environment conditions. With the goal to
enable the acquisition of generic competences of ICT to work online, students were given a text
to work individually, to read and review. Groups were then formed to work on the document
through the means of Google Docs. After presenting their work, students were given
questionnaires that would show statistical data regarding their knowledge and opinion about
Google Docs. Results show that 75% of the class had no knowledge of Google Docs prior to the
activity. However, 92% say that they would continue to use it in other educational and
professional documents. This educational experience has been very satisfactory for students and
professors alike.
Google Docs OCR function has been overlooked, however, it is still a reliable, free and
accessible function. In a research conducted by Tafti et al. (2016), Google Docs OCR function
produced promising results in evaluating the given dataset, and without using advanced image
processing procedures such as denoising and image registration. Google Docs OCR achieved the
scores 74% accuracy in analyzing colored images, 75% after performing low-level image
processing on the colored images. Additionally, Google Docs OCR scored 77% accuracy for
12
gray-scale images then, later on, scored 81% accuracy for low- level processed gray-scale
images. Results showed that Google Docs outperformed other applications that utilize OCR
METHODOLOGY
Flowchart of Methods
Google Documents already has an OCR function that the researchers could use as
reference for the program developed in Matlab. The researchers would determine the accuracy
percentage of this function in OCR and compare it with other OCR tools. The effectiveness of
this program in character accuracy and special character accuracy (+, =, (, ), /, !, |, ∫,etc…) will be
used as a benchmark to measure how effective the researcher’s prototype is. This study will be
using the language’s built in OCR capabilities in getting its results, the researchers used the built-
In inputting a sample image in Google Documents OCR, uploading the image then right
click and open in a word file is all that is needed. It is not as straightforward when inputting a
14
sample image in MATLAB, it requires some basic code to be setup where bounding boxes need
to be measured. After inputting the code and locating the image file to be translated and running
the file, the output would then be found in what the output variable was assigned to in the
Comparative Analysis
The research uses the method done by Vijayanari and Sakila (2015). Two performance
measures are taken: conversion accuracy and error rate. Conversion accuracy (CA) is when all
characters, letters, and numbers are converted correctly, or incorrectly, as long as the character
was registered by the program. Error rate is how much of the text was not converted at all.
Special character accuracy (SA) and special character error (SER) rate work in the same way but
The researchers would count the respective characters of each variable from the samples
results and divide that number by the total amount of each respective variable. After all samples
have CA, & SA calculated, the researchers would calculate each respective mean.
15
The following table shows all one hundred samples that the research had used. A check
mark signifies that the program has translated the image with perfect accuracy. If not, then the
10
11
12
13
14
15
16
17
18
19
20
17
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
18
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
19
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
20
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
21
Performance Measures
Continuing Vijayanari et al’s (2015) method, two strategies are taken for character
accuracy and special character accuracy. In order to find character accuracy (CA) and character
character error rate (SER) is very similar. It measures if the program correctly converted all
The equations for accuracy were borrowed from Vijayanari et al (2015) and also
correspond to the formula for accuracy given by Tafiti et al (2016) in their evaluation of Google
8
9
10
11
12
13
14
15
16
17
18
19
20
23
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
24
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
25
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
26
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
27
Statistical Analysis
With n>30, the researchers used a z-test to test the variance in the data. The null
hypothesis would be that Google Documents and MATLAB have the same accuracy rates for
special characters. With the alternative hypothesis being that MATLAB has a higher or lower
rate of accuracy. This formula of z-test where x̅is the sample mean, µ is the population mean, σl
Google
Documents
MATLAB
28
REFERENCES
Abdullah, Palash, Rahman, Islam & Alim. (2016). Digital Image Processing Analysis using
Matlab. http://www.ajer.org/papers/v5(12)/Q05120143147.pdf
Al-Zuhairi, Maher. (2018). Re: What is the minimum sample size required to train a Deep
Learning model - CNN?. Retrieved
from:https://www.researchgate.net/post/What_is_the_minimum_sample_size_required_t
o_train_a_Dep_Learning_model-CNN/5a930d3602d229c2506b5fbf/citation/download.
Basavaprasad B. & Ravi M. (2014). A Study On The Importance Of Image Processing And Its
applications.
https://pdfs.semanticscholar.org/7656/d3db8962a5a75d162842065319155db73af8.pdf
Batomalaque, Camacho, Dalida, & Delmo. (2019). Image to Text Conversion Technique for
Anti-Plagiarism System.
http://ijasc.ascons.org/digital-library/15866?fbclid=IwAR2EagiHS1VTK8S7aFDviocMm
kBZXbySgqmYIGfgCerusoKQ98t4--ObBv8
Benito, Munoz. (2013). Google Docs: an experience in collaborative work in the University.
Enseñanza & Teaching, 30(1), 159 – 180. doi:10.14201
Dalal, J., Daiya S.,(2019). Image Processing Based Optical Recognition using Matlab
http://www.ijesrt.com/issues%20pdf%20file/Archive-2018/May-2018/51.pdf
Eisa, T., Salim, N., & Alzahrani, S. (2015). Existing plagiarism detection techniques: A
systematic mapping of the scholarly literature. Online Information Review. 39. (pp.
383-400). https://doi.org/10.1108/OIR-12-2014-0315
Foltynek, T., Meuschke, N.,& Gipp, B. (2019, October). Academic Plagiarism Detection: A
Literature Review. ACM Comput. Surv, 52(6). https://doi.org/10.1145/3345317
Ivo V. (2014). How OCR Works, A Close Look at Optical Character Recognition.
https://how-ocr-works.com/
29
Jasrotia, D.,& Malik, A. (2018). Webcam Based Optical Character Recognition Using Matlab.
International Journal of Engineering Sciences & Research Technology 7(8),
https://doi.org/10.5281/zenodo.1336727
Mohammad Litton H., Tafiq A., Sarkar S., Al-Amin Md., (2019). Development of an Alphabetic
Character Recognition System Using Matlab for Bangladesh
http://www.ijsrp.org/research-paper-0119.php?rp=P858129
Praveena, V., Shruthi, S., Narmadha, S.,& Menaga, D. (2019). A Developmental Approach of
OCR Based Assistive System For Visually Impared People. IEEE 5th International
Conference on Science, Technology, Engineering, Mathematics. Retrieved from:
https://www.academia.edu/44332993/A_developmental_approach_of_OCR_based_assist
ive_system_for_visually_impaired_people?fbclid=IwAR2nDLIgODY7WwIA89DRRdvr
ZM3MdKBafs1BZRqwSOr_HPQG-nScQmVtmP4
Sahu, Narendra & Sonkusare, Manoj. (2017). A Study on Optical Character Recognition
Techniques. International Journal of Computational Science, Information Technology
and Control Engineering.
https://www.researchgate.net/publication/313334780_A_Study_on_Optical_Character_R
ecognition_Techniques
Tafti, A., Baghaie, A., Assefi, M., Arabnia, H., Yu, Z.,& Peissig, P. (2016). OCR as a Service:
An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and
Transym. Advances in Visual Computing. ISVC 2016. Lecture Notes in Computer
Science, https://doi.org/10.1007/978-3-319-50835-1_66
Tiwari, S., Mishra, S., Bhatia, P., Yadav, Km Praveen., (2013). Optical Character Recognition
using MATLAB.
http://ijarece.org/wp-content/uploads/2013/08/IJARECE-VOL-2-ISSUE-5-579-582.pdf
Vijayanari, S., & Sakila, A. (2015). Performance Comparison of Different OCR Tools.
International Journal of Ubicomp (IJU), 6(3). https://doi.org/10.5121/iju.2015.6303
30
31