Professional Documents
Culture Documents
Your Big Idea
Your Big Idea
Recogition
(OCR)
What is OCR?
Optical Character Recognition (OCR) is the
mechanical or electronic conversion of images of
typewritten or printed text into machine-
encoded text.
Techniques include :
• Feature extraction decomposes glyphs into “features” like lines, closed loops, line
direction, and line intersections. Feature Extraction serves two purposes; one is to
extract properties that can identify a character uniquely. Second is to extract
properties that can differentiate between similar characters.
Pattern Classification Process
Training Testing
Post-processing
OCR accuracy can be increased if the output is constrained by a lexicon – a list of
words that are allowed to occur in a document. This might be,
for example: all the words in the English language, or a more technical lexicon for
a specific field. This technique can be problematic if the document contains words
not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the
character segmentation step, for improved accuracy.
Post-processing
OCR accuracy can be increased if the output is constrained by a lexicon – a list of
words that are allowed to occur in a document. This might be,
for example: all the words in the English language, or a more technical lexicon for
a specific field. This technique can be problematic if the document contains words
not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the
character segmentation step, for improved accuracy.
Handwritten text recognition using knn
Training Set Testing image
Model
Output image
Pros and Cons
● OCR reduces time for processing for processing data from large number of
forms
● If done manually, may lead to human error and takes up much of the time
● In spite of rough handling ,one can read the ocr information with high degree of
accuracy.
● Higher rates of recognition of general cursive script will likely not be possible
without the use of contextual or grammatical information
● Ocr Systems are expensive
● All the document need to be checked over carefully and corrected manually
Milestones
2018
Revolutionizing the document Electronic Health
management process recordGoogle lens
2019 2020