Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

60-654 ADVANCE COMPUTING


CONCEPTS

PROJECT REPORT

Group Members:

Vamsi Krishna Kathi-104348224


Rathilesh Reddy Panyala-104348213

Guided by

Dr. Luis Rueda

Page 1 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

Table of Contents

Objectives……………………………………………………………3

Introduction………………………………………………………….3

How SLSS works……………………………………………………4

Concepts used in SLSS……………………………………………...4

Implementation of concepts…………………………………………5

I. Jsoup………………………………………………………5
II. Edit Distance………………………………………………5
III. HashMap…………………………………………………..6
IV. Sorting HashMap………………………………………….7

A glimpse of SLSS…………………………………………………..8

Further Scope of SLSS………………………………………………10

Conclusion…………………………………………………………...10

References…………………………………………………………...10

Page 2 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

Objectives:

The main perspective of the project is to implement the concepts learned


in the class in a real system such as creation of Web search engine, thus
by gaining hands on experience in developing a real system using the
data structures and algorithms studied and evaluating the system on the
basis of methods for analysis.

The objective of Stop Looking Start Searching is:

 Provide users with most relevant information with respect to their


query
 To optimize the speed of the query
 To provide the users with the Number of search results based on
the ranking of the web pages.

Introduction:

The rapid growth of the World Wide Web (WWW) has made accessible
a large quantity of data which is both structured and un-structured. [1]
From the user perspective the search engine is to retrieve the most
relevant information based on the keywords entered in the search engine.
To retrieve a search from both the structured and un-structured data the
search should have a series of programs that determine how the search is
determined.

The three main components that make a search engine good are:
1. A database of web contents.
2. A search engine operating on the database.
3. A series of programs and algorithms that determine how accurate
and fast the search results are displayed. [2]

Following are the key contents in this report:


 Defining the concept of search engine
 Discuss the searching strategies
Page 3 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

 Transferring the learning outcome to the real world practice.


 Choosing the algorithms based on the need.

How SLSS works:

The working of SLSS includes the steps as followed:


1. When the program is up and running, the webpages are converted
into text.
2. On the user interface, the user enters the keyword to be searched
and also inputs the number of files to be searched the user would
like to perform.
3. After the input is given, the keyword entered is matched with the
content which is converted into text.
4. The list of matched keywords are sorted according to their count
and are finally displayed in the output.
5. If the input keyword doesn’t have a match, the closest word to the
entered keyword will be found.

Concepts used in SLSS:

1. Jsoup.
2. Hashmap.
3. Sorting of Hashmap.
4. Edit distance.
5. Multiway Merge sort (to be used in further iterations).

Page 4 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

Implementation:
1. Conversion of HTML to Text

Concepts or algorithms used:

Jsoup: Jsoup is an open source Java library


for working with real-world HTML. It provides a very convenient API
for extracting and manipulating data. Jsoup implements the WHATWG
HTML5 specification, and parses HTML.[6]

How it works?

HTML parsing is very simple with Jsoup, here we need to


call static method Jsoup.parse() and pass HTML String to it. Jsoup
provides several overloaded parse() method to read HTML file from
String, a File, from a base URI, from an URL, and from an Input Stream.
In this we need to specify character encoding to read HTML files
correctly which is not in "UTF-8" format. The parse (String
html) method parses the input HTML into a new Document. In
Jsoup, Document extends Element which extends Node. Also Text
Node extends Node. As long as you pass in a non-null string, you're
guaranteed to have a successful, sensible parse, with
a Document containing (at least) a head and a body element. Once you
have a Document, you can get the data you want by calling appropriate
methods in Document and its parent classes Element and Node.

2. Searching with words that are not in the dictionary

Concepts or algorithms used:

Edit distance: Edit distance is applied


to find the closest word to the keyword, when a match between
Dictionary words and key word is not found, the system will
automatically compare the Key word to dictionary words to get the
edit distance of each word, and list the most closet ones.
Page 5 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

3.Searching the keyword given as input

Concepts or algorithms used:

HashMap: It is a data structure used to


implement an associative array, a structure that can map keys to
values. A hash table uses a hash function to compute an index into
an array of buckets or slots, from which the desired value can be
found. [5]

How it works?

1.HashMap has an inner class called Entry which stores key-value pairs.
Page 6 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

2. Above Entry object is stored in Entry[ ](Array) called table


3. An index of table is logically known as bucket and it stores first
element of linked list.
4. Key object’s hashcode () is used to find bucket of that Entry object.
5. If two key objects have same hashcode, they will go in same bucket of
table array.
6. Key object‘s equals () method is used to ensure uniqueness of key
object.
7. Value object‘s equals () and hashcode () method is not used at all.

4. Sorting the HashMap:

How it works?

Sorting in a HashMap is to be done first by keys and


then by values. The program is divided into two parts, first part sorts
HashMap by keys and second part sorts it by values. In order to sort a
HashMap by values, Comparator implementation which compares each
entries by values to arrange them in a particular order is used.
Comparator overrides comapre() method and accepts two entries.[4]
Later it retrieves values from those entries and compare them and return
result. Since there is no method in Java Collection API to sort Map, we

Page 7 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

used Collections.sort () method which accepts a List. This involves


creating a temporary ArrayList with entries for sorting purpose and then
again copying entries from sorted ArrayList to a new LinkedHashMap to
keep them in sorted order. Finally we create a HashMap from
that LinkedHashMap, which is what we needed.

A Glimpse of SLSS:

Fig: The GUI of the SLSS

Page 8 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

Fig: Keyword (web) with 17 files to be searched is entered.

Fig: The output or the retrieved results of the search being displayed.

Page 9 of 10
STOP LOOKING -START SEARCHING (WEB SEARCH ENGINE)

Further Scope:

 Can be upgraded to multiple key word search


 Large dictionary or data sets using Multiway Merge Sort
 Front end of the search engine - user-friendly
 Redirection to the web pages – from results obtained

Conclusion:

A large number of algorithms represent a large number of


solutions to a given problem. By developing a good understanding of
algorithms we will be able to choose the best suited algorithm for a
particular problem.
Large dictionary sets implementation is to be considered further and a
lot of further reading is to be done to bring the best in the SLSS to solve
the practical scenarios faced.

References:

1. http://www.ntoulas.net/pubs/ntoulas_understanding_SE.pdf
2. http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Srch
EngCriteria.pdf
3. http://javarevisited.blogspot.ca/2014/09/how-to-parse-html-file-
in-java-jsoup-example.html
4. http://java67.blogspot.ca/2015/01/how-to-sort-hashmap-in-
java-based-on.html
5. http://www.javacodegeeks.com/2014/03/how-hashmap-works-
in-java.html

Page 10 of 10

You might also like