Indexing and Retrieval of Document Images

1
Proposal Synopsis
Indexing and Retrieval of Official

Documents Images
Project Proposal
Session 2019-2023
Submitted by:
Dua Nazakat 2019-CS-689

Ubaidullah 2019-CS-671
Usman Mumtaz 2019-CS-685
Supervised by:
Dr. Qurat ul Ain Akram
Department of Computer Science,

University of Engineering and Technology, Lahore, New-Campus.
2
Proposal Synopsis
Contents
Contents....................................................................................................................................... 2
List of Figures ............................................................................................................................. 3
List of Tables ............................................................................................................................... 4
Proposal Synopsis ........................................................................................................................... 5
1.1 Abstract ................................................................................................................................ 5
1.2 Introduction .......................................................................................................................... 5
1.4 Objectives ............................................................................................................................ 6
1.5 Features/Scope ..................................................................................................................... 7
1.12 Related Work ....................................................................................................................... 7
1.7 Proposed Methodology/System ......................................................................................... 11
1.8 Tools and Techniques ........................................................................................................ 12
1.9 Team Member Individual Tasks/Work Division ............................................................... 12
1.10 Data Gathering Approach .................................................................................................. 13
1.11 Timeline/Gantt chart .......................................................................................................... 14
1.12 References .......................................................................................................................... 14
3
Proposal Synopsis
List of Figures
Figure 1.1: Flow Chart for Development of Application for Searching Document .................................. 12
4
Proposal Synopsis
List of Tables
Table 1: Related Work Activites ................................................................................................................... 9
Table 2: Work Division .............................................................................................................................. 12
5
Proposal Synopsis
Chapter 1
Proposal Synopsis
1.1 Abstract
With the advent of new technologies, document images created a tremendous demand to access
and manipulate the information using image processing. Instead of using paper, large quantities of
printed documents are often scanned and archived as images. As almost all the notifications from
the universities, offices, government sectors are in a scanned document. Offices nowadays are
working on paperless communication of official notices through transfer of images. Many indexing
and retrieval systems are being developed to make the work efficient and quickly to retrieve the
document without manually searching from thousands of documents. The text searching from the
scanned documents images led to the foundation of many applications. This technique has saved
people’s time all over the world by quickly retrieving the relevant documents they need. The need
of this project arose because there are a lot of documents in the hard form in many organizations
and to access the information from these hard form documents is time consuming. The scope of
this project covers official documents and images of the Pakistan UET organization. As there are
a lot of emails coming from the university in a day, it becomes hard to find the relevant documents
from those mails that can consume a lot of time and sometimes are not able to find the relevant
document on time and users might miss the deadlines or important notification from the university.
This problem forms the basis of this project. The web-based application will be developed that
will cope with this problem by using image processing, deep learning, indexing and retrieval
techniques. This project aims at the benefits of using paperless scanned documents for the
Pakistani organization UET that will fetch the relevant document quickly based on user query.
This system ensures timely information retrieval, there will be no loss of document, up to dated
documents will be fetched from the system.
1.2 Introduction
In the digital era, there has been a great demand for image and video data applications. In many
sectors paper less work is widely spreading, many document images are being scanned and made
available over the Internet. Offices nowadays are working on paperless communication of official
notices through transfer of images. As almost all the notifications from the universities, offices,
government sectors are in the form of scanned documents. This project proposes an innovative
6
Proposal Synopsis
solution for searching in scanned document images. Firstly, the dataset will be acquired from the
UET for this project. The information is to be Search from images therefore we will use OCR to
extract text from the images so that text searched method could be applied. From this extracted
text the terms like title, subject, body, signing body etc. are achieved. Indexing of the documents
is based on extracted text. After indexing the retrieval process will take place based on the user
query the relevant document will be fetched. This system ensures timely information retrieval,
there will be no loss of document, up to dated documents will be fetched from the system.
Repository is maintained that will keep track of all documents. Better data security will be
provided, scanned documents can be encrypted, password protected, and securely stored on the
storage. As the scope of this project covers only UET scanned documents images, this application
further can be extended for the other sectors as well where most of the communication is done
through transferring of images.
1.3 Problem Statement
The focus of this project is to develop a working application for the official documents of UET
that will save time, search the document quickly and provide retrieval of documents efficiently
based on searching from document images using image processing, Deep Learning, Machine
Learning, Indexing and retrieval techniques.
1.4 Objectives
Following objectives will be achieved during this project:

• Learning phase to grasp the image processing techniques and model building techniques.
• Data will be acquired from the university as the dataset of this application.
• Recognizer used that will be recognizing the Text from image that will tell the terms used in
the scanned image.
• Document Layout Analysis on official UET documents using image processing, segmentation
into Header, subject, body, signing body etc.
• Model building for the classification and detection of annotated image
• Indexing of all types of layouts mentioned above.
• Retrieval of the document based on user’s query.
• Finally, Development of a web system that combines both indexing and retrieval. Searching
on Official documents that will retrieve the searched based document its date and time, to and
from, file size and file type, its preview, maintain history.
7
Proposal Synopsis
1.5 Features/Scope
• The project covers a working application for retrieval of searched based document. The
scope of this project covers the UET document fetcher application.
• A working application will be deployed that will fetch documents based on user quires.
• System will handle font size variations in the document.
• System will tackle English Language
1.12 Related Work
Several researches has been performed on the indexing and retrieval of scanned documents.
Some of them findings are listed below:
1. Balasubramanian et al. [1] in this paper, search system solution was proposed for retrieval
of relevant documents from large collection of document images. There method of search is
specifically for Indian languages. They system focused on computing information retrieval
measures from word images without explicitly recognizing these images. There system is
capable of searching across languages for retrieving relevant documents from multilingual
document image database. They are currently working on a comprehensive test on large
collection of document images.
2. Shaheera Saba Mohd Naseem Akhter and Priti P. Rege [2] in this project semantic based
text segmentation on Marathi document images using deep learning methods has been
proposed. U-Net and Residual U-Net (ResU-Net) architecture, are used for semantic
segmentation on Marathi document. Both the deep learning models (U-Net and ResUNet)
had given a state-of-the-art performance on medical image segmentation. he experimental
results show better performance on ResU-Net architecture than U-Net due to the presence of
skip connection in the model. The model with skip connections avoids vanishing gradient
problem, and also the feature accumulation in the model generalizes well on the
segmentation task. U-Net and ResU-Net model gives 95% and 98% accuracy respectively on
the dataset.
3. Kumar1 et al. [3] In this research paper, they presented an efficient indexing and retrieval
scheme for searching in large document image databases. Efficiency and scalability along with
high precision and recall values are achieved by content-sensitive hashing. The retrieval speed
is orders of magnitude better - the technique can search 20,000 word images in milliseconds.
They demonstrated that this technique is practical for searching printed documents rapidly.
8
Proposal Synopsis
4. Nabin Sharma et al [4] this paper deals with Signature and logo as a query are important for
content-based document image retrieval from a scanned document repository. They deal with
signature and logo detection from a repository of scanned documents, which can be used for
document retrieval using signature or logo information.
5. David Doermann [5 ] In this paper, they have attempted to provide some background and past
research on both. The main purpose of this application was to retrieve the information in the
documents, indexing them and abstracting them. The approach used for this purpose was OCR.
The features, this application was retrieving were character and word’s texture, shape and
structure. Ultimately, both indexing and retrieval made use of the powerful features offered in
both the visual representation of the page and in the underlying content of text, graphics, and
images. Such systems will need to address complex trade-offs between algorithm speed, image
quality, and retrieval recall and precision.
6. Do Fetcher Application [6] DocFetcher is an open source desktop search application which
allows users to search the contents of the files on the computer. It can be thought as the Google
for local files on the computer. Queries are entered in the text field. The search results are
displayed in the result pane. A preview panel shows a text-only preview of the file currently
selected in the result pane. You can filter the results by minimum and/or maximum file size,
by file type and by location. The buttons are used for opening the manual, opening the
preferences and minimizing the program into the system tray, respectively. Indexing
approaches used for indexing the document information are The naive approach to file search,
Index-based search and Telephone book analogy.
7. Fernando Vegas Fernandez [7] A systematic search, unlike a narrative search that could
yield a subset of haphazard and biased documents, achieves a neutral collection of documents
to perform intelligent information extraction from document databases (Google Scholars). It
is possible to find in Google Scholar almost any document found in the other sources.
Concepts such as natural language processing, semantics, and ontologies frequently appear in
the documents reviewed. A semantic-based tag extraction is done by using system
DOMINUS. Tagging is done on the tags which are title, author, abstract, and references, and
nowadays it is easier to retrieve those tags with Google Scholar and tools such as EndNote
and Mandalay. System focuses on information retrieval (document retrieval) based on word
concepts and text clustering. The documents analyzed propose algorithm-based systems and
agents with rules to query document databases. There are three stages: stage 1 includes
review planning and searching for relevant articles using electronic databases; stage 2
involves deleting all duplicates according to the title and author and excluding irrelevant
papers by reading their titles, abstracts, and keywords; and stage 3 refers to content analysis.
8. Wagenpfeil et al. [8] In this research paper the author tries to find out best solution for
indexing and retrieval of images in phone. There different options on which we can accomplish
9
Proposal Synopsis
this task are Semantic analysis and representation, Graph representation processing and AI
pattern matching. By using multimedia feature vector graph, he finds out best solution in which
he creates a database using encoded images by assigning each object of image a unique color
and adding each encoded image into a single image. Then for retrieval he uses the same coding
scheme to generate an encoded image on the basis of user query, by searching this image in
database he obtains the related images. the issue is the sematic gap which is the difference
between require data and obtain data.
9. Nawei Chen [9] This is a Survey of multimodal IR systems that combine the text and image
modalities. In this paper the issues with multimodal IR are addressed: various techniques to
combine text and images; techniques to find relationships between text and images; noise and
uncertainties in IR systems; and techniques to improve effectiveness of IR, such as Latent
Semantic Indexing, user’s relevance feedback, semantic network, and document clustering and
classification.
10. Garg et al. [10] the results shows the line segmentation from the header and base line detection
method line segmentation of variable skew or fluctuating lines of Handwritten Hindi text.
Approaches:
After reading the research papers, following searching are summarized for our project which are
as follows:
Deep learning base solution. Acquire Document images from the university, image level
segmentation include data annotating and tagging, building a model for classification and
recognition, testing and evaluation to increase the model accuracy and finally working
application that will perform searching and indexing of documents.
Table 1: Related Work Activites
Related Paper Data Set Model Accuracy Weakness Proposed Solution

Retrieval from English 2507 “Greenstone 95%English controls Deep learning
Document Image Hindi 3354 search engine” for 92% Urdu morphological based solution for
Collections digital libraries word variants only image
DWT for segmentation
clustering
Semantic scanned U-Net and 95% - 98% Marathi language Work in different
Segmentation of images taken ResUNet document images languages can be
Printed Text from from various are used performed
Marathi Marathi
Document Images
books
using Deep
Learning Methods
10
Proposal Synopsis
Efficient Search in collection of 7 indexing locality 90% Feature selection For Feature
Document Image Kalidasa sensitive hashing using machine selection using
Collection books (LSH) learning techniques machine learning
to include multiple techniques
fonts and styles.
Signature and Tobacco-800 ZF, VGG16, promising and Only for logo and Work can be done
Logo Detection VGG M , and at par with the signature for Title and body
using Deep CNN YOLOv2 existing
for Document methods
Image Retrieval
The Indexing and dataset of On a data base of 83.5

Retrieval of 1,320 random 800 images, the ------ -------
Document Images: 16 point font system was over
A Survey Arial 95% effective
characters
Intelligent Scholarly Bayes decision from 93% up to System does not Departmental
information Documents function 98% work for document work
extraction from (classification departmental will be provided
scholarly method) documents.
document
databases
AI based semantic MIRFlickr GMMAF better Only apply on

multimedia 25000 CNN performance pictures with -
indexing and Open Images AI4MMRA and better entities
retrieval for social accuracy and
media on perform
smartphone well
Survey of region-based multi-class
multimodal IR document image classification needs
systems that - classification - to be improved -
combine the text algorithm
and image developed by Li
modalities and Gray, Support
Vector Machine
(SVM)
Canny edge
detection
A New Method for Sanskrit and Run length 97% Limited for Hindi The method can be
Line Segmentation Hindi text= smearing language applied on many
of Handwritten 100 algorithm other languages
Hindi Text like Bangla, Telugu
etc.
11
Proposal Synopsis
1.7 Proposed Methodology/System
Phase 1: Background Knowledge

Phase 2: Literature survey. The project begins with the brief Literature Review as it a research
based project, that demands a lot of articles, internet web page, journals and research papers
readings.
Phase 3: Data Acquiring. After feeding all the research work that is related to our project, we
will have the clear understanding related to the project so we will start working on our real
project, and it will begin by acquiring data from the resources.
• Acquisition of documents data, UET official document data is acquired.
Phase 4: Document Layout Analysis. After acquiring all the data, we will start with image
level segmentation using image processing techniques.
• Image Annotation: Image annotate by labelling features of the image and drawing
boundaries of the feature i.e., Header, Body, Subject, signing part, Assigned body etc.
• Segmentation: into Header, subject, body, signing body, from
• Tagging Preparing images for image classification.
Phase 5: Model building. Training of the model Using Deep learning.
Phase 6: Evaluation. Testing of the model and improving the accuracy of the model.
Phase 7: Development of the System. After all research base working, we will start
implementing the system.
• Indexing. Using Database for indexing the documents. For the indexing process, we
propose to identify the word set by clustering them into different groups based on their
similarities.
• Retrieval. Based on searching of the user the relevant document is fetch.
This include UI design and integration of all components into one application.
12
Proposal Synopsis
Background
knowledge
Literature
Survey
Image Data
Acquistion
Document Image
Preprocessing
Classification and
Recognition
Testing and
Evalutation
Development
Documentation
Research Paper
Figure 1.1: Flow Chart for Development of Application for Searching Document
1.8 Tools and Techniques
Languages: Python, HTML, CSS, JAVASCRIPT, REACT

IDE: Pycharm/Spyder, Visual studio code
Colab implementation for model building
Database: SQLite
Packages: Pi-torch (Python machine learning packages based on torch)/Tensor flow/Keras
1.9 Team Member Individual Tasks/Work Division
Following work Division will be carried out during the project.

Table 2: Work Division
Team Member Tasks

Usman Mumtaz Learning Phase of NLP, deep
Dua Nazakat learning course
Ubaidullah
Usman Mumtaz Literature Review
Dua Nazakat
13
Proposal Synopsis
Ubaidullah
Usman Mumtaz Technique Finalization

Dua Nazakat
Ubaidullah
Usman Mumtaz Image processing
Dua Nazakat
Ubaidullah
Usman Mumtaz Model building
Dua Nazakat
Ubaidullah Frontend
Dua Nazakat
Usman Mumtaz Testing and Evaluation
Dua Nazakat
Ubaidullah
Usman Mumtaz Implementation of
Dua Nazakat Application
Ubaidullah
Usman Mumtaz Research paper writing
Dua Nazakat
Ubaidullah
1.10 Data Gathering Approach
For this project we collect the data from UET, as this project scope covers the UET Official
documents indexing and retrieving.
14
Proposal Synopsis
1.11 Timeline/Gantt chart

GANTT CHART
Duration( Days)
Literatur Survey 14
Data Acquiring 16
Document Layout Analysis: Image Annotation 29
Document Layout Analysis: Tagging 30
Model Building 60
Evaluation & Testing 60
Development of System 90
Research Paper Writing 88
May-22 May-22 Jun-22 Jul-22 Aug-22 Sep-22 Oct-22 Nov-22 Dec-22 Jan-23 Feb-23 Mar-23
1.12 References
[1] A. Balasubramanian, Million Meshesha, and C.V. Jawahar, “Retrieval from Document
Image Collections”, Proc. of the 4th Indian Conference on Computer Vision, Graphics and
Image Processing, (ICVGIP) in Hyderabad - 500 032, in India, pages 622–627, 2004
Available:https://www.researchgate.net/publication/221551741_Searching_in_Document_Images
[2] Shaheera Saba Mohd Naseem Akhter and Priti P. Rege, “Semantic Segmentation of Printed
Text from Marathi Document Images using Deep Learning Methods”, 2019 IEEE 16th India
Council International Conference (INDICON), in Rajkot, in India, 13-15 Dec. 2019
Available: https://ieeexplore.ieee.org/abstract/document/9030360/references#references
[3] Anand Kumar1, C.V. Jawahar1, and R. Manmatha, “Efficient Search in Document Image
Collections”, ACCV 2007: Computer Vision – ACCV, in Department of Computer Science
University of Massachusetts Amherst, MA 01003, in USA, vol 4843, pp 586–59, 2007
Available: https://link.springer.com/chapter/10.1007/978-3-540-76386-4_55
[4] Nabin Sharma; Ranju Mandal; Rabi Sharma; Umapada Pal; Michael Blumenstein, “Signature
and Logo Detection using Deep CNN for Document Image Retrieval”, 2018 16th International
Conference on Frontiers in Handwriting Recognition (ICFHR), in Niagara Falls, NY, in USA,
Aug. 2018
15
Proposal Synopsis
[5] David Doermann, “The Indexing and Retrieval of Document Images: A Survey”, Ctr.
Automation. Res Research, University of Maryland, College Park, in Maryland,1998
Available: https://www.sciencedirect.com/science/article/abs/pii/S1077314298906920
[6] Tran Nam Quang,” DocFetcher”, DocFetcher Team, February 14, 2022 [online]
Available: http://docfetcher.sourceforge.net/en/index.html
[7] Vegas Fernandez, F. “Intelligent information extraction from scholarly document databases”,
Journal of Intelligence Studies in Business, in Departamento de Ingeniería Civil: Construcción,
Universidad Politécnica de Madrid, in Spain, vol 11, No 3, 2020
Available: https://ojs.hh.se/index.php/JISIB/article/view/834
[8] Wagenpfeil, S., et al. (2021). "Ai-based semantic multimedia indexing and retrieval for social
media on smartphones.", Faculty of Mathematics and Computer Science, University of Hagen,
Universitätsstrasse 1, D-58097 Hagen, GermanyAcademy for International Science & Research
(AISR), Derry BT48 7TG, in UK, 12(1): 43, 2021
Available: https://www.mdpi.com/2078-2489/12/1/43
[9] Nawei Chen, “A Survey of Indexing and Retrieval of Multimodal Documents: Text and
Image”, in Kingston, Ontario, Canada School of Computing Queen’s University, 2007
Available: https://link.springer.com/journal/10032
[10] N. K. Garg, L. Kaur and M. K. Jindal, "A New Method for Line Segmentation of Handwritten
Hindi Text," 2010 Seventh International Conference on Information Technology: New
Generations, 2010, pp. 392-397, doi: 10.1109/ITNG.2010.89.
Available: https://ieeexplore.ieee.org/document/5501694

Indexing and Retrieval of Document Images

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Indexing and Retrieval of Document Images

Uploaded by

Copyright:

Available Formats

1

Indexing and Retrieval of Official

Dua Nazakat 2019-CS-689

Department of Computer Science,

1.3 Problem Statement

Following objectives will be achieved during this project:

1.12 Related Work

Related Paper Data Set Model Accuracy Weakness Proposed Solution

The Indexing and dataset of On a data base of 83.5

AI based semantic MIRFlickr GMMAF better Only apply on

1.7 Proposed Methodology/System

Phase 1: Background Knowledge

1.8 Tools and Techniques

Languages: Python, HTML, CSS, JAVASCRIPT, REACT

1.9 Team Member Individual Tasks/Work Division

Following work Division will be carried out during the project.

Team Member Tasks

Usman Mumtaz Technique Finalization

1.10 Data Gathering Approach

1.11 Timeline/Gantt chart

Document Layout Analysis: Image Annotation 29

Document Layout Analysis: Tagging 30

Evaluation & Testing 60

Research Paper Writing 88

You might also like