Is204 - 6

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 27

Performance Comparison of Google Documents

OCR and MATLAB OCR in terms of Special


Characters

by

Vincent Carl V. Salvador


Wifraim San R. Miguel
Rockwel Chester D. Rios
John Gabriel B. Romanes

A Research Paper Submitted to the Mapúa Senior High School Department in Partial
Fulfillment of the Requirements for

Practical Research (RES02)

Mapúa University
October 2020
Chapter 1

INTRODUCTION

Digital Image Processing (DIP) is one of the most popular research fields in the study of

machine learning. It is the process of recognizing and committing data in a form that can be

recognized by a computer with applications such as facial recognition, and the scanning of

documents. DIP is an important part of modern technology and lifestyle and many processes

require that the methods they use give reliable results. One such method or technique is called

Optical Character Recognition (OCR) and has been implemented in a variety of services. Tafti et

al (2016). writes that it is a “classic machine learning challenge” and that it has helped develop
the modern world. The researchers aim to perform a comparative analysis on the performance of

Google Document’s OCR and MATLAB built-in OCR. Primarily to measure the reliability of a

free, well-known, and easy-to-use service provided by Google with that of the OCR function of

the computer programming language that is MATLAB, in terms of their capacity to translate

special characters.

In 2015, a performance comparison study done by Vijayanari & Sakila found that while

Google Documents was successful in detecting characters, it was not at all successful in

converting special characters (Σ, “, ≥, ≤, =, +, *, -, /, ^, %, #, |, etc…). Although, in 2016, an

evaluation done by Tafti et al. (year?) says that Google Documents has promising results with

high accuracy rates in most image types except in blurry, noisy, hand-written and skewed images.

A research done by Jasoirat and Malik (2018) designed a program that uses MATLAB’s OCR

function with a webcam proved to be successful, but their work did not include any samples that

contained special characters.

3
Current Optical Character Recognition (OCR) technology has advanced to the point

where it can detect the presence and number of characters in a line. While most OCR software

have performed well when translating English alphabetical characters or letters, it is with special

characters that performance is found wanting. There is open-source software that is being

developed that addresses this, but examples are limited to integration of other languages with

English characters such as Portugeese (Laurie, 2019). There is a gap in information on special

character translation (Vijayanari & Sakila, 2015).

The study aims to find if either MATLAB or Google Documents is better than the other at

translating special characters. The researchers plan to use MATLAB’s built-in OCR software and
compare the results with Google Doc’s automatic OCR feature, which has a high success rate for

English characters (Vijayanari & Sakila, 2015).

Developers and programmers would find this study useful as the researcher’s document

the capabilities of MATLAB OCR. Persons who need to know the capabilities of Google

Document’s automatic OCR software compared to other software that needs to be set up, such as

MATLAB, may find this document useful. There is other value in studying both of these services

such as integrating them to a software project like plagiarism detection or streamlining processes.

The study is limited to the use of OCR. No other machine learning technique would be

used in testing image processing quality. MATLAB has most of the necessary functions to aid in

the project, with some modifications needed to enhance translation quality. The research is only

limited to the comparison of the image-to-text translation software of MATLAB and the OCR

feature by Google Documents. MATLAB will be modified through its OCR training feature. No

other OCR software will be used in comparison.

4
Chapter 2

REVIEW OF LITERATURE

Optical Character Recognition (OCR) Technology

OCR is a process used to interpret the documents and convert it into an actual editable

text. It contains various processes from preparing the scanned documents to actual recognition

and conversion of the text. The researchers took advantage of the process of OCR method and

will use it in order to scan some text and images that other programs could take advantage of.
Figure 1. OCR Processes
In a book written about Optical Character Recognition (OCR) by Chaudhuri et al, it is

defined as “ the process of classification of optical patterns contained in a digital image”. Paper

documents and digital images that were captured by a camera could be turned into editable text in

which a computer can process through OCR. Although the field has advanced over three decades

being applied in many fields of science and various industries, it is said that the technology has

not yet achieved the same level of accuracy as the human eye. The conventional way of inputting

data to a computer is through a keyboard, however this is not always the most efficient method

when faced with a substantial amount of text. OCR can be used to solve a problem like this and

simplify its process without the need for much human interference. OCR stemming from machine

learning technology means that a machine learning algorithm with

5
sufficient training data would be able to turn images of text into text files with high accuracy. In

a general study on the general state of plagiarism detection technology, there has been a certain

lack of capacity when identifying plagiarized “plagiarized figures, tables, equations and scanned

documents or images”. Another problem involving OCR, is the font-recognition, specifically,

hand written text, the reason for this is because the font in a written paper is different from the

font used by the machine. According to a study by Brown, Fay and Walker, It is undoubtedly one

of the most challenging problems regarding OCR.

The purpose of OCR is to scan a document and convert it to an editable text. This can be

used for various programs. According to a study about Telugu fonts and OCR by Anuradha
(2012), any OCR compose of preprocessing stage and followed by the Recognition stage and

then the post-processing stage but the last one won’t occur as much as the first two stages.

Preprocessing is where the scanned document will be translated into editable code recognized by

the computer, and recognition is where the identified characters will be converted into an editable

text. Pre-processing will vary depending on the document. It will include the improvement of the

quality of the image. There is also the post-processing stage where accuracy can be increased if

the output is constrained.

According to the Karthick, Ravindrakumar, Francis, Ilankannan (2019) , supported by

Nicomsoft and how-ocr-works, which are websites dedicated to OCR technology, pre-processing

is where the preparation happens. It consists of various stages to fix the scanned documents, and

it depends on how many issues it will encounter. Some of the stages are De-skew where the

scanned documents will be tilted if it is not aligned properly. There is also the Despeckle stage or

6
noise reduction, where it will fix some defect spots caused by dust, scratches and some other

minor issue. Another stage would be Binarization. This always happened when using OCR. This

is where the scanner would color the image into black and white, black being the font and text,

white being the background of it. This would make the computer scan the text and differentiate

the background to the test. The line-removal stage is where the cleaning of non-glyph boxes and

lines. There is also the layout analysis or zoning stage where the text would be divided into

blocks and differentiate which are the first blocks, second and so on. The line and word detection

stage is where the baseline of the words or letter is established and if necessary, separates it.

Script recognition stage, in script documents, the script’s wording may change at a later part of

the story and this stage helps recognize where it is necessary to change it’s way so that the OCR

can handle the specific part of the script. There are also the two segmentation stages, the first one
is the word segmentation stage where it separates or isolates one word from another. The other

one would be the character/line segmentation where the characters would be separated since some

lines or part of the characters overlap the other character. The last one would be to normalize the

scanned documents, fix it’s aspect ratio, scale, etc. in order to prepare for the actual recognition

part. Next we move on to the actual recognition part, this is important since this is the main

algorithm of OCR. Some of the characters are very similar e.g. “I” “l” ”1” and the algorithm

might produce a different output due to some uncertain images. It depends on the algorithm used

but the text recognition helps to recognize the difference between characters. An example of the

algorithm of it would be the Matrix matching that compares the images pixels by pixels. These

studies also gave the researchers a complete overview on how the OCR works.

7
Applications of OCR

OCR or Optical Character Recognition Technology has a variety of uses that are useful

for academics, technology, and even help impaired people. There was a study done by Praveena

et al (2019) where they created the Pi Book Reader. The Pi Book Reader is a device that can read

ebooks out loud in order to help visually impaired people to follow the story. This device is done

with the assistance of OCR and image processing where they use image processing to capture

and process the image and extract the texts using the OCR technology and the extracted text is

then converted from Text to Speech and it is read outloud using speaker or earphones for the

visually impaired person

Plagiarism is a major issue in academics today, it is not a modern issue since the act of

plagiarising goes way back centuries. The modern era worsens the issue due to the emergence of

the internet, various people can share information to other people through the help of the internet.
Any person can easily access the world wide web and find information they can plagiarize. There

are various loopholes in most of the Plagiarism Detector (PD) tools today, a study called

Turnitoff made by Heather (2010) which reads more into the faults of famous PD TurnItIn in the

early 2010s. Plagiarism Detector Tools work by extracting the text, then search the scanned text

online or in a large database, look if the contents are plagiarized and show the results. The second

stage would be the most complex however, the researchers will be focusing on the first stage as

this is where the attacks usually happen. According to the study by Heather (2010), the first stage

has a loophole where one can use an image disguised as text using PDF. Most of the tools today

will not detect the plagiarized content if it’s in an image because they only scan the editable text,

not the images. This loophole can be fixed by using Image Processing through a process of what

is called OCR.

8
In a research paper done by Vamvakas, Gatos, Stamatopoulos and Perantonis. OCR can

also be used to recognize historical texts either printed or handwritten even without any

knowledge of the font. This can be done by creating a database for the OCR where the program

would store the data of the previous documents and use it as a training to recognize the font

eventually. By this method, the more documents that the program scanned, the more knowledge

of various fonts it would gain. This is also one of the uses of OCR where people could save time

by converting written paper into a digital text.

A study conducted by Mulay and Puri (2015), called HawkEye, a mobile system that is

used to detect code-cloning. Code-cloning is a kind of plagiarism where the code of a program is

being used as an owned work. The Hawkeye also used an OCR system for multi-language and

can be used by simply snapping a picture of the code using a mobile camera and the Hawkeye
would do the job of detecting whether the code is plagiarized work. The OCR would convert the

image, extract the codes while removing the unnecessary texts, and convert it into an editable text

file. The study is very similar to the present study since it uses the OCR as a way to further

improve the current plagiarism detector tool. The system was used to help detect the plagiarism

for a program code while the current study is about helping the current plagiarism tool as a

whole.

MATLAB

MATLAB is simple yet a powerful programming language, but it has a rather niche use.

This programming language also has powerful features regarding image detection and

processing, the researchers find the flexibility of the language compared to other popular

9
languages such as Python advantageous. MATLAB OCR has various add-ons that can recognize

various languages and also math equations. The amount of OCR applications that can detect

complex mathematical equations and can be modified is limited, this is why MATLAB is the

ideal language for this study.

According to Dalal and Daya (2018), machine replication of human functions such as

reading are hard to achieve. However, over these past years, technology advanced and now,

machines can read just as much as humans do. This is achieved by using the technology of OCR

and pattern recognition..

MATLAB is usually used in engineering and intensive computing. MATLAB has various

toolboxes and functions and that is why according to Goyal (2019), Matlab is a very convenient

tool that can be used for processing images and blocks.

A study of using MATLAB processing an image by Abdullah, Palash, Rahman, Islam and

Alim (2016) states that Image Processing is a processing of images using a mathematical
operation by using any form of signal processing, Most image processing treats image as a two

dimensional platform, x and y axis, and apply the standard signals processing on it. It usually

revolves around the actual processing of the images but sometimes it even includes the

recognition of the character through optic devices. You could say that OCR is an example of

Image Processing. Like OCR, it consists of step by step stages, Inputting an image, analyzing and

manipulating the images, and showing the output of the altered image. The study includes the

representation of color and image as data whereas the color would not be an RGB instead, a fixed

color that is just mixed together resulting in an RGB, but with different numerical values to easily

represent in the process of coding.

10
Based on a study written by Tiwari, Mishra, Bhatia, and Yadav (2013) by the use of OCR

in MATLAB it states that MATLAB may vary in technical computing challenges that can be

addressed more easily than for standard programming languages given such as C, C++, JAVA,

and FORTRAN. It is also possible to transform an image or some other data, such as sound to a

matrix and then perform multiple activities, to get the desired effects and values. A broad variety

of technologies are possible, including signal and image analysis, image accusation, and neural

network. The accuracy by using MATLAB based on the output was calculated using the samples

to demonstrate the precision of the English handwritten and sample text picture with the

MATLAB OCR algorithm. The researchers use a sample paper to scan into 300 dpi from the HP

deskjet scanner. Afterwards the files were screened, binarized, cropped and resized. The

segmentation to segment character was performed on each line taking into account the

characteristics of English Verdana font templates. The recognition precision was 85% to 90% due

to inappropriate hand drawn characters.

With all these OCR and Image Processing Applications, There is still a very limited
number of programs that can recognize Bangla Characters, and even if they do, they still can't

recognize the whole Bangla Character. A study by Hossain, Ahmed, Sarkar, and Al-Amin (2018)

uses MATLAB to develop a system that can recognize the Bangladesh Alphabet. This is

achieved by using various Processes of OCR such as binarization, noise removal, segmentations,

feature extraction and recognition. The proposed system here is developed by using the following

steps: It first starts by inputting a Printed Bangla Script in a scanner, next is getting the raw

scanned documents and converting the scanned image into a grayscale image. The grayscale

image is then converted into a Binary image, the Binary image went through a Segmentation and

11
Feature Extraction process. And the output went through the process of Classification and

enhanced the accuracy using the Post-processing with an actual editable text as an output. The

said study can benefit the present study as it both uses the process of OCR in the MATLAB, even

though the study was used to develop a system that can recognize Bangla Characters, The

researchers could still used this as reference to develop a system that is leaning more towards

english and special characters.

Google Documents

In a study by Benito and Munoz (2013) indicated that making use of Information and

Communication is only rightful to improve educational environment conditions. With the goal to

enable the acquisition of generic competences of ICT to work online, students were given a text

to work individually, to read and review. Groups were then formed to work on the document

through the means of Google Docs. After presenting their work, students were given

questionnaires that would show statistical data regarding their knowledge and opinion about

Google Docs. Results show that 75% of the class had no knowledge of Google Docs prior to the
activity. However, 92% say that they would continue to use it in other educational and

professional documents. This educational experience has been very satisfactory for students and

professors alike.

Google Docs OCR function has been overlooked, however, it is still a reliable, free and

accessible function. In a research conducted by Tafti et al. (2016), Google Docs OCR function

produced promising results in evaluating the given dataset, and without using advanced image

processing procedures such as denoising and image registration. Google Docs OCR achieved the

scores 74% accuracy in analyzing colored images, 75% after performing low-level image

processing on the colored images. Additionally, Google Docs OCR scored 77% accuracy for

12
gray-scale images then, later on, scored 81% accuracy for low- level processed gray-scale

images. Results showed that Google Docs outperformed other applications that utilize OCR

technology such as Tesseract, ABBYY, FineReader, and Transym.


13
Chapter 3

METHODOLOGY

Flowchart of Methods

Figure 1. Flowchart of Methods


Procedure

Google Documents already has an OCR function that the researchers could use as

reference for the program developed in Matlab. The researchers would determine the accuracy

percentage of this function in OCR and compare it with other OCR tools. The effectiveness of

this program in character accuracy and special character accuracy (+, =, (, ), /, !, |, ∫,etc…) will be

used as a benchmark to measure how effective the researcher’s prototype is. This study will be

using the language’s built in OCR capabilities in getting its results, the researchers used the built-

in OCR trainer to test the capabilities of the function.

In inputting a sample image in Google Documents OCR, uploading the image then right

click and open in a word file is all that is needed. It is not as straightforward when inputting a

14
sample image in MATLAB, it requires some basic code to be setup where bounding boxes need

to be measured. After inputting the code and locating the image file to be translated and running
the file, the output would then be found in what the output variable was assigned to in the

workspace tab, in the property of ‘Words’.

Figure 2. Simplified Application Process

Figure 3. Research Flow Diagram

Comparative Analysis

The research uses the method done by Vijayanari and Sakila (2015). Two performance

measures are taken: conversion accuracy and error rate. Conversion accuracy (CA) is when all

characters, letters, and numbers are converted correctly, or incorrectly, as long as the character

was registered by the program. Error rate is how much of the text was not converted at all.

Special character accuracy (SA) and special character error (SER) rate work in the same way but

only on non-alphanumeric characters or special characters (Σ, “, ≥, ≤, =, +, *, -, /, ^, %, #,

|, etc…) are converted correctly.

The researchers would count the respective characters of each variable from the samples

results and divide that number by the total amount of each respective variable. After all samples

have CA, & SA calculated, the researchers would calculate each respective mean.

15
The following table shows all one hundred samples that the research had used. A check

mark signifies that the program has translated the image with perfect accuracy. If not, then the

characters that resulted from the translation are inputted instead.


16

Table 1. Translation Results


Sample Original Text Google Documents MATLAB OCR
Number

10

11

12

13

14

15

16

17

18

19

20

17

Table 2. Translation Results


Sample Original Text Google Documents MATLAB OCR
Number

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

18

Table 3. Translation Results


Sample Original Text Google Documents MATLAB OCR
Number
41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

19

Table 4. Translation Results


Sample Original Text Google Documents MATLAB OCR
Number

61

62
63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

20

Table 5. Translation Results


Sample Original Text Google Documents MATLAB OCR
Number

81

82

83

84
85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

21

Performance Measures

Continuing Vijayanari et al’s (2015) method, two strategies are taken for character

accuracy and special character accuracy. In order to find character accuracy (CA) and character

error rate (CER) from the output the following formula.

Character Accuracy (CA) = (a/n)*100

Character Error Rate (CER) = 100-CA.

(where a=Total number of characters in the resultant document

n=Total number of characters in the input document)


Similarly, the formula calculated for special character accuracy (SA) and special

character error rate (SER) is very similar. It measures if the program correctly converted all

special characters (+, =, (, ), /, !, |, etc…).

Special Character Accuracy (SA) = (b/m)*100

Special Character Error Rate (SER) = 100-SA

(where b=Total number of correct special characters in the resultant document

m=Total number of special characters in the input image)

The equations for accuracy were borrowed from Vijayanari et al (2015) and also

correspond to the formula for accuracy given by Tafiti et al (2016) in their evaluation of Google

Docs among other programs.

Table 6. Translation Accuracy

Sample Number Google Documents MATLAB OCR

8
9

10

11

12

13

14

15

16

17

18

19

20

23

Table 7. Translation Accuracy


Sample Number Google Documents MATLAB OCR

21

22

23

24

25

26

27

28

29

30
31

32

33

34

35

36

37

38

39

40

24

Table 8. Translation Accuracy


Sample Number Google Documents MATLAB OCR

41

42

43

44

45

46

47

48

49

50

51

52
53

54

55

56

57

58

59

60

25

Table 9. Translation Accuracy


Sample Number Google Documents MATLAB OCR

61

62

63

64

65

66

67

68

69

70

71

72

73

74
75

76

77

78

79

80

26

Table 10. Translation Accuracy


Sample Number Google Documents MATLAB OCR

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96
97

98

99

100

27

Statistical Analysis

With n>30, the researchers used a z-test to test the variance in the data. The null

hypothesis would be that Google Documents and MATLAB have the same accuracy rates for

special characters. With the alternative hypothesis being that MATLAB has a higher or lower

rate of accuracy. This formula of z-test where x̅is the sample mean, µ is the population mean, σl

is the standard deviation of the population and n is the sample size.

Figure 4. Z-test formula

Table 11. Performance Measures


OCR Tools Character Character Special Special Standard
Accuracy Error Rate Character Character Deviation
(CA) % (CER) % Accuracy Error Rate of Special
(EA) % (SER) % Character
Accuracy σ

Google
Documents

MATLAB
28
REFERENCES

Abdullah, Palash, Rahman, Islam & Alim. (2016). Digital Image Processing Analysis using
Matlab. http://www.ajer.org/papers/v5(12)/Q05120143147.pdf

Al-Zuhairi, Maher. (2018). Re: What is the minimum sample size required to train a Deep
Learning model - CNN?. Retrieved
from:https://www.researchgate.net/post/What_is_the_minimum_sample_size_required_t
o_train_a_Dep_Learning_model-CNN/5a930d3602d229c2506b5fbf/citation/download.

Basavaprasad B. & Ravi M. (2014). A Study On The Importance Of Image Processing And Its
applications.
https://pdfs.semanticscholar.org/7656/d3db8962a5a75d162842065319155db73af8.pdf

Batomalaque, Camacho, Dalida, & Delmo. (2019). Image to Text Conversion Technique for
Anti-Plagiarism System.
http://ijasc.ascons.org/digital-library/15866?fbclid=IwAR2EagiHS1VTK8S7aFDviocMm
kBZXbySgqmYIGfgCerusoKQ98t4--ObBv8

Benito, Munoz. (2013). Google Docs: an experience in collaborative work in the University.
Enseñanza & Teaching, 30(1), 159 – 180. doi:10.14201

Brown, Fay, & Walker. (2003). Handprinted symbol recognition system.


https://www.sciencedirect.com/science/article/abs/pii/0031320388900179

Chaudhuri, A. (2017). Optical Character Recognition Systems for English Language. In


Kacpryzyk, J (Ed.), Optical Character Recognition Systems for English Language (pp.
85-107). Springer International Pub. https://doi.org/10.1007/978-3-319-50252-6

Dalal, J., Daiya S.,(2019). Image Processing Based Optical Recognition using Matlab
http://www.ijesrt.com/issues%20pdf%20file/Archive-2018/May-2018/51.pdf

Eisa, T., Salim, N., & Alzahrani, S. (2015). Existing plagiarism detection techniques: A
systematic mapping of the scholarly literature. Online Information Review. 39. (pp.
383-400). https://doi.org/10.1108/OIR-12-2014-0315
Foltynek, T., Meuschke, N.,& Gipp, B. (2019, October). Academic Plagiarism Detection: A
Literature Review. ACM Comput. Surv, 52(6). https://doi.org/10.1145/3345317

Ghadiyaram, A. (2009). An investigation into Telugu font and character recognition.


http://hdl.handle.net/10603/4166

Ivo V. (2014). How OCR Works, A Close Look at Optical Character Recognition.
https://how-ocr-works.com/

29
Jasrotia, D.,& Malik, A. (2018). Webcam Based Optical Character Recognition Using Matlab.
International Journal of Engineering Sciences & Research Technology 7(8),
https://doi.org/10.5281/zenodo.1336727

Mohammad Litton H., Tafiq A., Sarkar S., Al-Amin Md., (2019). Development of an Alphabetic
Character Recognition System Using Matlab for Bangladesh
http://www.ijsrp.org/research-paper-0119.php?rp=P858129

Nimcomsoft Contributors. (2012). Optical Character Recognition (OCR) – How it works,


https://www.nicomsoft.com/optical-character-recognition-ocr-how-it-works/

Praveena, V., Shruthi, S., Narmadha, S.,& Menaga, D. (2019). A Developmental Approach of
OCR Based Assistive System For Visually Impared People. IEEE 5th International
Conference on Science, Technology, Engineering, Mathematics. Retrieved from:
https://www.academia.edu/44332993/A_developmental_approach_of_OCR_based_assist
ive_system_for_visually_impaired_people?fbclid=IwAR2nDLIgODY7WwIA89DRRdvr
ZM3MdKBafs1BZRqwSOr_HPQG-nScQmVtmP4

Sahu, Narendra & Sonkusare, Manoj. (2017). A Study on Optical Character Recognition
Techniques. International Journal of Computational Science, Information Technology
and Control Engineering.
https://www.researchgate.net/publication/313334780_A_Study_on_Optical_Character_R
ecognition_Techniques

Tafti, A., Baghaie, A., Assefi, M., Arabnia, H., Yu, Z.,& Peissig, P. (2016). OCR as a Service:
An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and
Transym. Advances in Visual Computing. ISVC 2016. Lecture Notes in Computer
Science, https://doi.org/10.1007/978-3-319-50835-1_66
Tiwari, S., Mishra, S., Bhatia, P., Yadav, Km Praveen., (2013). Optical Character Recognition
using MATLAB.
http://ijarece.org/wp-content/uploads/2013/08/IJARECE-VOL-2-ISSUE-5-579-582.pdf

Vijayanari, S., & Sakila, A. (2015). Performance Comparison of Different OCR Tools.
International Journal of Ubicomp (IJU), 6(3). https://doi.org/10.5121/iju.2015.6303
30
31

You might also like