Professional Documents
Culture Documents
Department of Informatics University of Leicester CO7201 Individual Project
Department of Informatics University of Leicester CO7201 Individual Project
Department of Informatics
University of Leicester
CO7201 Individual Project
Preliminary Report
Ashwath Vijayan
asv9@student.le.ac.uk
asv9
DECLARATION
All sentences or passages quoted in this report, or computer code of any form whatsoever
used and/or submitted at any stages, which are taken from other people’s work have been
specifically acknowledged by clear citation of the source, specifying author, work, date
and page(s). Any part of my own written work, or software coding, which is substantially
based upon other people’s work, is duly accompanied by clear citation of the source,
specifying author, work, date and page(s). I understand that failure to do this amounts to
plagiarism and will be considered grounds for failure in this module and the degree
examination as a whole.
Contents
2. Requirements
High Level Requirements:
The user will be able to simply present the root directory of the document set and initiate the
indexing process. The process will be carried out using MapReduce and will index the
documents in a custom format.
Post this, the user will be able to search across these indexes using CLI/REST based
mechanisms.
Essential Requirements:
1. Provision to work with any folder structure.
2. Simple setup of the application with minimal dependencies.
3. Sharding the indexed file so as to improve performance.
4. Storing word vs index file name as a self balancing tree, enabling saving the tree to a
local file and also loading the save from a file.
5. CLI based search capabilities.
Recommended Requirements:
1. Ability to update indexes with new/updated files.
2. REST based search capabilities.
3. Caching indexed files based on LRU method so that search is fast.
Optional Requirements:
1. Support for custom ranking based on the datasets. Since this project will use a news
dataset, ranking will be based on relevance and timeline.
2. Implementing indexing using existing frameworks and comparing the performance
5
3. Technical Specification
Programming Languages:
1. Java - 8
2. Javascript - ES6
3. HTML - 5
4. CSS - 3
5. Apache Maven - 3.6.3
Frameworks:
1. Apache Hadoop - 1.2.1 (Maven Based)
2. SpringBoot - 2.4 (Maven Based)
3. Apache Lucene - 8.x.x (Maven Based)
4. Elasticsearch Client - 7.x.x (Maven Based)
Operating System:
1. Ubuntu Linux - 20.0.4
Cloud Technologies:
1. AWS - For testing performance on multiple nodes on cloud.
IDE:
1. IntelliJ IDEA
2. VS Code
Version Control:
1. GIT
2. SVN (University hosted)
Testing:
The application can be tested by running the operations on datasets and comparing the search
responses with the pre-computed results. The application must be able to handle any type of
folder structure and hence testing with nested folders, plain files, etc is needed.
6
Reading List:
1. Learn about Hadoop working -
https://www.udemy.com/course/master-apache-hadoop/
2. Learn about Indexing, Lucene indexing - Lucene in Action book written by Erik
Hatcher
3. Elasticsearch - https://www.udemy.com/course/elasticsearch-complete-guide/
4. Mccreadie, Richard & Macdonald, Craig & Ounis, Iadh. (2009). Comparing
Distributed Indexing: To MapReduce or Not?. CEUR Workshop Proceedings.
5. Vector space models -
https://www.geeksforgeeks.org/web-information-retrieval-vector-space-model/
Breakdown of timeline:
Exploring Hadoop
Learning Hadoop and environment and
experimenting with it 27-Feb-21 - 5-Mar-21 understanding HDFS
Risk Plan:
Lack of experience: The author has no experience working with MapReduce and it is
necessary to devote a sufficient amount of time reading articles to understand the best
practices.
Unknown AWS costs: In order to test the application’s efficiency, getting metrics by running
MapReduce across datasets in a multi node environment will be done. Since some resources
are costly, there are chances to cut down on getting exact metrics.
Multi-Environment Testing: When trying to run MapReduce code in Windows system there
were permission errors which may involve modifying source code and repackaging of the
Hadoop jar file. As a result of this, multi-environment testing might be skipped and all the
testing will be carried out on a Linux system.
Ranking issues: Ranking the search results is important and depends majorly on the type of
data. Providing an interface over which end users can write custom code may not be feasible
and hence the code will be restricted to a specific dataset.
7. References
References that I have found so far have already been included in the reading list.