Professional Documents
Culture Documents
Indexing and Retrieval of Document Images
Indexing and Retrieval of Document Images
Proposal Synopsis
Session 2019-2023
Submitted by:
Supervised by:
Dr. Qurat ul Ain Akram
Contents
Contents....................................................................................................................................... 2
List of Figures ............................................................................................................................. 3
List of Tables ............................................................................................................................... 4
Proposal Synopsis ........................................................................................................................... 5
1.1 Abstract ................................................................................................................................ 5
1.2 Introduction .......................................................................................................................... 5
1.4 Objectives ............................................................................................................................ 6
1.5 Features/Scope ..................................................................................................................... 7
1.12 Related Work ....................................................................................................................... 7
1.7 Proposed Methodology/System ......................................................................................... 11
1.8 Tools and Techniques ........................................................................................................ 12
1.9 Team Member Individual Tasks/Work Division ............................................................... 12
1.10 Data Gathering Approach .................................................................................................. 13
1.11 Timeline/Gantt chart .......................................................................................................... 14
1.12 References .......................................................................................................................... 14
3
Proposal Synopsis
List of Figures
Figure 1.1: Flow Chart for Development of Application for Searching Document .................................. 12
4
Proposal Synopsis
List of Tables
Table 1: Related Work Activites ................................................................................................................... 9
Table 2: Work Division .............................................................................................................................. 12
5
Proposal Synopsis
Chapter 1
Proposal Synopsis
1.1 Abstract
With the advent of new technologies, document images created a tremendous demand to access
and manipulate the information using image processing. Instead of using paper, large quantities of
printed documents are often scanned and archived as images. As almost all the notifications from
the universities, offices, government sectors are in a scanned document. Offices nowadays are
working on paperless communication of official notices through transfer of images. Many indexing
and retrieval systems are being developed to make the work efficient and quickly to retrieve the
document without manually searching from thousands of documents. The text searching from the
scanned documents images led to the foundation of many applications. This technique has saved
people’s time all over the world by quickly retrieving the relevant documents they need. The need
of this project arose because there are a lot of documents in the hard form in many organizations
and to access the information from these hard form documents is time consuming. The scope of
this project covers official documents and images of the Pakistan UET organization. As there are
a lot of emails coming from the university in a day, it becomes hard to find the relevant documents
from those mails that can consume a lot of time and sometimes are not able to find the relevant
document on time and users might miss the deadlines or important notification from the university.
This problem forms the basis of this project. The web-based application will be developed that
will cope with this problem by using image processing, deep learning, indexing and retrieval
techniques. This project aims at the benefits of using paperless scanned documents for the
Pakistani organization UET that will fetch the relevant document quickly based on user query.
This system ensures timely information retrieval, there will be no loss of document, up to dated
documents will be fetched from the system.
1.2 Introduction
In the digital era, there has been a great demand for image and video data applications. In many
sectors paper less work is widely spreading, many document images are being scanned and made
available over the Internet. Offices nowadays are working on paperless communication of official
notices through transfer of images. As almost all the notifications from the universities, offices,
government sectors are in the form of scanned documents. This project proposes an innovative
6
Proposal Synopsis
solution for searching in scanned document images. Firstly, the dataset will be acquired from the
UET for this project. The information is to be Search from images therefore we will use OCR to
extract text from the images so that text searched method could be applied. From this extracted
text the terms like title, subject, body, signing body etc. are achieved. Indexing of the documents
is based on extracted text. After indexing the retrieval process will take place based on the user
query the relevant document will be fetched. This system ensures timely information retrieval,
there will be no loss of document, up to dated documents will be fetched from the system.
Repository is maintained that will keep track of all documents. Better data security will be
provided, scanned documents can be encrypted, password protected, and securely stored on the
storage. As the scope of this project covers only UET scanned documents images, this application
further can be extended for the other sectors as well where most of the communication is done
through transferring of images.
The focus of this project is to develop a working application for the official documents of UET
that will save time, search the document quickly and provide retrieval of documents efficiently
based on searching from document images using image processing, Deep Learning, Machine
Learning, Indexing and retrieval techniques.
1.4 Objectives
1.5 Features/Scope
• The project covers a working application for retrieval of searched based document. The
scope of this project covers the UET document fetcher application.
• A working application will be deployed that will fetch documents based on user quires.
• System will handle font size variations in the document.
• System will tackle English Language
Several researches has been performed on the indexing and retrieval of scanned documents.
Some of them findings are listed below:
1. Balasubramanian et al. [1] in this paper, search system solution was proposed for retrieval
of relevant documents from large collection of document images. There method of search is
specifically for Indian languages. They system focused on computing information retrieval
measures from word images without explicitly recognizing these images. There system is
capable of searching across languages for retrieving relevant documents from multilingual
document image database. They are currently working on a comprehensive test on large
collection of document images.
2. Shaheera Saba Mohd Naseem Akhter and Priti P. Rege [2] in this project semantic based
text segmentation on Marathi document images using deep learning methods has been
proposed. U-Net and Residual U-Net (ResU-Net) architecture, are used for semantic
segmentation on Marathi document. Both the deep learning models (U-Net and ResUNet)
had given a state-of-the-art performance on medical image segmentation. he experimental
results show better performance on ResU-Net architecture than U-Net due to the presence of
skip connection in the model. The model with skip connections avoids vanishing gradient
problem, and also the feature accumulation in the model generalizes well on the
segmentation task. U-Net and ResU-Net model gives 95% and 98% accuracy respectively on
the dataset.
3. Kumar1 et al. [3] In this research paper, they presented an efficient indexing and retrieval
scheme for searching in large document image databases. Efficiency and scalability along with
high precision and recall values are achieved by content-sensitive hashing. The retrieval speed
is orders of magnitude better - the technique can search 20,000 word images in milliseconds.
They demonstrated that this technique is practical for searching printed documents rapidly.
8
Proposal Synopsis
4. Nabin Sharma et al [4] this paper deals with Signature and logo as a query are important for
content-based document image retrieval from a scanned document repository. They deal with
signature and logo detection from a repository of scanned documents, which can be used for
document retrieval using signature or logo information.
5. David Doermann [5 ] In this paper, they have attempted to provide some background and past
research on both. The main purpose of this application was to retrieve the information in the
documents, indexing them and abstracting them. The approach used for this purpose was OCR.
The features, this application was retrieving were character and word’s texture, shape and
structure. Ultimately, both indexing and retrieval made use of the powerful features offered in
both the visual representation of the page and in the underlying content of text, graphics, and
images. Such systems will need to address complex trade-offs between algorithm speed, image
quality, and retrieval recall and precision.
6. Do Fetcher Application [6] DocFetcher is an open source desktop search application which
allows users to search the contents of the files on the computer. It can be thought as the Google
for local files on the computer. Queries are entered in the text field. The search results are
displayed in the result pane. A preview panel shows a text-only preview of the file currently
selected in the result pane. You can filter the results by minimum and/or maximum file size,
by file type and by location. The buttons are used for opening the manual, opening the
preferences and minimizing the program into the system tray, respectively. Indexing
approaches used for indexing the document information are The naive approach to file search,
Index-based search and Telephone book analogy.
7. Fernando Vegas Fernandez [7] A systematic search, unlike a narrative search that could
yield a subset of haphazard and biased documents, achieves a neutral collection of documents
to perform intelligent information extraction from document databases (Google Scholars). It
is possible to find in Google Scholar almost any document found in the other sources.
Concepts such as natural language processing, semantics, and ontologies frequently appear in
the documents reviewed. A semantic-based tag extraction is done by using system
DOMINUS. Tagging is done on the tags which are title, author, abstract, and references, and
nowadays it is easier to retrieve those tags with Google Scholar and tools such as EndNote
and Mandalay. System focuses on information retrieval (document retrieval) based on word
concepts and text clustering. The documents analyzed propose algorithm-based systems and
agents with rules to query document databases. There are three stages: stage 1 includes
review planning and searching for relevant articles using electronic databases; stage 2
involves deleting all duplicates according to the title and author and excluding irrelevant
papers by reading their titles, abstracts, and keywords; and stage 3 refers to content analysis.
8. Wagenpfeil et al. [8] In this research paper the author tries to find out best solution for
indexing and retrieval of images in phone. There different options on which we can accomplish
9
Proposal Synopsis
this task are Semantic analysis and representation, Graph representation processing and AI
pattern matching. By using multimedia feature vector graph, he finds out best solution in which
he creates a database using encoded images by assigning each object of image a unique color
and adding each encoded image into a single image. Then for retrieval he uses the same coding
scheme to generate an encoded image on the basis of user query, by searching this image in
database he obtains the related images. the issue is the sematic gap which is the difference
between require data and obtain data.
9. Nawei Chen [9] This is a Survey of multimodal IR systems that combine the text and image
modalities. In this paper the issues with multimodal IR are addressed: various techniques to
combine text and images; techniques to find relationships between text and images; noise and
uncertainties in IR systems; and techniques to improve effectiveness of IR, such as Latent
Semantic Indexing, user’s relevance feedback, semantic network, and document clustering and
classification.
10. Garg et al. [10] the results shows the line segmentation from the header and base line detection
method line segmentation of variable skew or fluctuating lines of Handwritten Hindi text.
Approaches:
After reading the research papers, following searching are summarized for our project which are
as follows:
Deep learning base solution. Acquire Document images from the university, image level
segmentation include data annotating and tagging, building a model for classification and
recognition, testing and evaluation to increase the model accuracy and finally working
application that will perform searching and indexing of documents.
Table 1: Related Work Activites
Efficient Search in collection of 7 indexing locality 90% Feature selection For Feature
Document Image Kalidasa sensitive hashing using machine selection using
Collection books (LSH) learning techniques machine learning
to include multiple techniques
fonts and styles.
Signature and Tobacco-800 ZF, VGG16, promising and Only for logo and Work can be done
Logo Detection VGG M , and at par with the signature for Title and body
using Deep CNN YOLOv2 existing
for Document methods
Image Retrieval
Background
knowledge
Literature
Survey
Image Data
Acquistion
Document Image
Preprocessing
Classification and
Recognition
Testing and
Evalutation
Development
Documentation
Research Paper
Figure 1.1: Flow Chart for Development of Application for Searching Document
Ubaidullah
For this project we collect the data from UET, as this project scope covers the UET Official
documents indexing and retrieving.
14
Proposal Synopsis
Literatur Survey 14
Data Acquiring 16
Model Building 60
Development of System 90
May-22 May-22 Jun-22 Jul-22 Aug-22 Sep-22 Oct-22 Nov-22 Dec-22 Jan-23 Feb-23 Mar-23
1.12 References
[1] A. Balasubramanian, Million Meshesha, and C.V. Jawahar, “Retrieval from Document
Image Collections”, Proc. of the 4th Indian Conference on Computer Vision, Graphics and
Image Processing, (ICVGIP) in Hyderabad - 500 032, in India, pages 622–627, 2004
Available:https://www.researchgate.net/publication/221551741_Searching_in_Document_Images
[2] Shaheera Saba Mohd Naseem Akhter and Priti P. Rege, “Semantic Segmentation of Printed
Text from Marathi Document Images using Deep Learning Methods”, 2019 IEEE 16th India
Council International Conference (INDICON), in Rajkot, in India, 13-15 Dec. 2019
Available: https://ieeexplore.ieee.org/abstract/document/9030360/references#references
[3] Anand Kumar1, C.V. Jawahar1, and R. Manmatha, “Efficient Search in Document Image
Collections”, ACCV 2007: Computer Vision – ACCV, in Department of Computer Science
University of Massachusetts Amherst, MA 01003, in USA, vol 4843, pp 586–59, 2007
Available: https://link.springer.com/chapter/10.1007/978-3-540-76386-4_55
[4] Nabin Sharma; Ranju Mandal; Rabi Sharma; Umapada Pal; Michael Blumenstein, “Signature
and Logo Detection using Deep CNN for Document Image Retrieval”, 2018 16th International
Conference on Frontiers in Handwriting Recognition (ICFHR), in Niagara Falls, NY, in USA,
Aug. 2018
15
Proposal Synopsis
[5] David Doermann, “The Indexing and Retrieval of Document Images: A Survey”, Ctr.
Automation. Res Research, University of Maryland, College Park, in Maryland,1998
Available: https://www.sciencedirect.com/science/article/abs/pii/S1077314298906920
[6] Tran Nam Quang,” DocFetcher”, DocFetcher Team, February 14, 2022 [online]
Available: http://docfetcher.sourceforge.net/en/index.html
[7] Vegas Fernandez, F. “Intelligent information extraction from scholarly document databases”,
Journal of Intelligence Studies in Business, in Departamento de Ingeniería Civil: Construcción,
Universidad Politécnica de Madrid, in Spain, vol 11, No 3, 2020
Available: https://ojs.hh.se/index.php/JISIB/article/view/834
[8] Wagenpfeil, S., et al. (2021). "Ai-based semantic multimedia indexing and retrieval for social
media on smartphones.", Faculty of Mathematics and Computer Science, University of Hagen,
Universitätsstrasse 1, D-58097 Hagen, GermanyAcademy for International Science & Research
(AISR), Derry BT48 7TG, in UK, 12(1): 43, 2021
Available: https://www.mdpi.com/2078-2489/12/1/43
[9] Nawei Chen, “A Survey of Indexing and Retrieval of Multimodal Documents: Text and
Image”, in Kingston, Ontario, Canada School of Computing Queen’s University, 2007
Available: https://link.springer.com/journal/10032
[10] N. K. Garg, L. Kaur and M. K. Jindal, "A New Method for Line Segmentation of Handwritten
Hindi Text," 2010 Seventh International Conference on Information Technology: New
Generations, 2010, pp. 392-397, doi: 10.1109/ITNG.2010.89.
Available: https://ieeexplore.ieee.org/document/5501694