Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

1

Department of Informatics
University of Leicester
CO7201 Individual Project

Preliminary Report

Indexing and Querying Documents using


MapReduce

Ashwath Vijayan
asv9@student.le.ac.uk
asv9

Project Supervisor: Dr Thomas Erlebach


Second Marker: Dr Nicole Yap

Word Count: 1440


Thursday, 4th March, 2021
2

DECLARATION
All sentences or passages quoted in this report, or computer code of any form whatsoever
used and/or submitted at any stages, which are taken from other people’s work have been
specifically acknowledged by clear citation of the source, specifying author, work, date
and page(s). Any part of my own written work, or software coding, which is substantially
based upon other people’s work, is duly accompanied by clear citation of the source,
specifying author, work, date and page(s). I understand that failure to do this amounts to
plagiarism and will be considered grounds for failure in this module and the degree
examination as a whole.

Name: Ashwath Vijayan


Date: Thursday, 4th March, 2021
3

Contents

1. Aims and Objectives 4


2. Requirements 4
3. Technical Specification 4
4. Requirements Evaluation Plan 5
5. Background Research and Reading list 5
6. Time-plan and Risk Plan 6
7. References 8
4

1. Aims and Objectives


The aim of this project is to build a MapReduce application which can index reasonably large
documents and also facilitate search on top of these indexes. Searching across a few
documents for specific keyword(s) might be straight forward and fast, but imagine having to
search across thousands of documents. In order to perform efficient searches across multiple
documents, building inverted indexes or lucene indexes is crucial. Building indexes involves
going through all the documents and recording the position of every word found in the
document. Since this process can be done parallely, this project will make use of MapReduce
for indexing.
After indexing the files, a proper format has to be maintained to store the indexes and it is
important to split the indexed file into chunks so that we won’t have to go through a single
huge file. To facilitate fast search on these indexes, it is essential to maintain a lookup table
which can map any given word to the document which holds it’s indexing information (since
there will be multiple files which hold indexes). Now come the challenges of persisting the
lookup locally, updating the lookup, and so on.
As a conclusion, the application will be compared with other indexing frameworks like
Elasticsearch or Apache Lucene.

2. Requirements
High Level Requirements:
The user will be able to simply present the root directory of the document set and initiate the
indexing process. The process will be carried out using MapReduce and will index the
documents in a custom format.
Post this, the user will be able to search across these indexes using CLI/REST based
mechanisms.

Essential Requirements:
1. Provision to work with any folder structure.
2. Simple setup of the application with minimal dependencies.
3. Sharding the indexed file so as to improve performance.
4. Storing word vs index file name as a self balancing tree, enabling saving the tree to a
local file and also loading the save from a file.
5. CLI based search capabilities.

Recommended Requirements:
1. Ability to update indexes with new/updated files.
2. REST based search capabilities.
3. Caching indexed files based on LRU method so that search is fast.

Optional Requirements:
1. Support for custom ranking based on the datasets. Since this project will use a news
dataset, ranking will be based on relevance and timeline.
2. Implementing indexing using existing frameworks and comparing the performance
5

3. Technical Specification
Programming Languages:
1. Java - 8
2. Javascript - ES6
3. HTML - 5
4. CSS - 3
5. Apache Maven - 3.6.3

Frameworks:
1. Apache Hadoop - 1.2.1 (Maven Based)
2. SpringBoot - 2.4 (Maven Based)
3. Apache Lucene - 8.x.x (Maven Based)
4. Elasticsearch Client - 7.x.x (Maven Based)

Operating System:
1. Ubuntu Linux - 20.0.4

Cloud Technologies:
1. AWS - For testing performance on multiple nodes on cloud.

IDE:
1. IntelliJ IDEA
2. VS Code

Version Control:
1. GIT
2. SVN (University hosted)

4. Requirements Evaluation Plan


Efficiency:
The application’s performance will be compared with a plain textual search and also with
other existing indexing technologies. The application is expected to have a reasonable
difference between the established frameworks.

Mappers and Reducers:


The number of nodes that the hadoop system contains will also determine the performance.
So increasing the nodes in the cluster and measuring performance will give a proper idea as
to how to fine tune the application. Also the number of emit operations will have an effect on
performance.

Testing:
The application can be tested by running the operations on datasets and comparing the search
responses with the pre-computed results. The application must be able to handle any type of
folder structure and hence testing with nested folders, plain files, etc is needed.
6

5. Background Research and Reading list


Background Research:
The process of indexing is used widely to efficiently perform searches across websites,
articles, e-books, etc. Without proper indexes, an application will have to scan across all the
documents to respond to a search query. When looking at indexes, inverted indexes are pretty
popular, and the idea is practical and straightforward.
When responding to advanced search queries, inverted indexes play a major role as the
location of occurrences is used to filter the required documents. Apache Lucene and
Elasticsearch are popular search frameworks that use inverted indexing internally.
As indexing is a suitable parallel processing problem, MapReduce is a suitable option. So this
application will use MapReduce to index documents and lookups to perform search
operations. When coming to lookups, a key use case is to perform updates to the lookups and
persisting the lookups present in RAM to a local file. In addition to this, there exist models
like vector space models to retrieve models. At this point of time, the correlation between
vector space models, lookups is unclear and hence more research is needed.

Reading List:
1. Learn about Hadoop working -
https://www.udemy.com/course/master-apache-hadoop/
2. Learn about Indexing, Lucene indexing - Lucene in Action book written by Erik
Hatcher
3. Elasticsearch - https://www.udemy.com/course/elasticsearch-complete-guide/
4. Mccreadie, Richard & Macdonald, Craig & Ounis, Iadh. (2009). Comparing
Distributed Indexing: To MapReduce or Not?. CEUR Workshop Proceedings.
5. Vector space models -
https://www.geeksforgeeks.org/web-information-retrieval-vector-space-model/

6. Time-plan and Risk Plan


7

Breakdown of timeline:

Task Planned time taken Explanation

Setting requirements of the


application and researching
Requirements and about the existing
background research 25-Feb-21 - 4-Mar-21 frameworks

Exploring Hadoop
Learning Hadoop and environment and
experimenting with it 27-Feb-21 - 5-Mar-21 understanding HDFS

Implementing Code for Code for indexing


Indexing the documents documents and testing with
(Milestone 1) 5-Mar-21 - 9-Mar-21 real time news data sets

Shard the index, so that it’s


easier when searching.
Lookup tree becomes
Sharding the indexes and essential when sharding the
building a lookup tree 9-Mar-21 - 16-Mar-21 file into chunks

Saving lookup tree locally


and building lookup from Saving/Loading lookup trees
a file 16-Mar-21 - 21-Mar-21 to and from local file

LRU cache for keeping the


indexed documents in
memory based on word Caching file chunks based
usage on the usage using LRU
(Milestone 2) 21-Mar-21 - 26-Mar-21 caching mechanism

Basic search with ranking


Implementing basic search capabilities and testing with
functionalities 27-Mar-21 - 3-Apr-21 news dataset

CLI and REST based Building CLI and REST


search options services on top of the search
(Milestone 3) 4-Apr-21 - 9-Apr-21 code that was implemented

Running the hadoop


programs on AWS instances
and checking how
Running comparisons on performance improves when
AWS systems 9-Apr-21 - 14-Apr-21 run across multiple nodes

Comparisons with existing


Running comparisons with frameworks and noting
Apache down improvements
Lucene/Elasticsearch 14-Apr-21 - 21-Apr-21
8

Code refactor and finishing


Final Product and report 21-Apr-21 - 30-Apr-21 up the report

Risk Plan:
Lack of experience: The author has no experience working with MapReduce and it is
necessary to devote a sufficient amount of time reading articles to understand the best
practices.

Unknown AWS costs: In order to test the application’s efficiency, getting metrics by running
MapReduce across datasets in a multi node environment will be done. Since some resources
are costly, there are chances to cut down on getting exact metrics.

Multi-Environment Testing: When trying to run MapReduce code in Windows system there
were permission errors which may involve modifying source code and repackaging of the
Hadoop jar file. As a result of this, multi-environment testing might be skipped and all the
testing will be carried out on a Linux system.

Ranking issues: Ranking the search results is important and depends majorly on the type of
data. Providing an interface over which end users can write custom code may not be feasible
and hence the code will be restricted to a specific dataset.

7. References
References that I have found so far have already been included in the reading list.

You might also like