Unit-3 DWDM 7TH Sem Cse

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Cluster is a group of objects that belongs to the same class.

In other words, similar objects are


grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis


 Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base. And
they can characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to
populations.
 Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to
house type, value, and geographic location.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit card
fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining


The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with large databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and
usable.

Clustering Methods
Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember −
 For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.

Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −
 Agglomerative Approach
 Divisive Approach

Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.

Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering


Here are the two approaches that are used to improve the quality of hierarchical clustering −
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering on
the micro-clusters.

Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.

Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.

Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or the properties of desired clustering
results. Constraints provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application requirement.
Text databases consist of huge collection of documents. They collect these information from
several sources such as news articles, books, digital libraries, e-mail messages, web pages, etc.
Due to increase in the amount of information, the text databases are growing rapidly. In many of
the text databases, the data is semi-structured.
For example, a document may contain a few structured fields, such as title, author,
publishing_date, etc. But along with the structure data, the document also contains unstructured
text components, such as abstract and contents. Without knowing what could be in the
documents, it is difficult to formulate effective queries for analyzing and extracting useful
information from the data. Users require tools to compare the documents and rank their
importance and relevance. Therefore, text mining has become popular and an essential theme in
data mining.

Information Retrieval
Information retrieval deals with the retrieval of information from a large number of text-based
documents. Some of the database systems are not usually present in information retrieval
systems because both handle different kinds of data. Examples of information retrieval system
include −

 Online Library catalogue system


 Online Document Management Systems
 Web Search Systems etc.
Note − The main problem in an information retrieval system is to locate relevant documents in
a document collection based on a user's query. This kind of user's query consists of some
keywords describing an information need.
In such search problems, the user takes an initiative to pull relevant information out from a
collection. This is appropriate when the user has ad-hoc information need, i.e., a short-term
need. But if the user has a long-term information need, then the retrieval system can also take an
initiative to push any newly arrived information item to the user.
This kind of access to information is called Information Filtering. And the corresponding
systems are known as Filtering Systems or Recommender Systems.

Basic Measures for Text Retrieval


We need to check the accuracy of a system when it retrieves a number of documents on the
basis of user's input. Let the set of documents relevant to a query be denoted as {Relevant} and
the set of retrieved document as {Retrieved}. The set of documents that are relevant and
retrieved can be denoted as {Relevant} ∩ {Retrieved}. This can be shown in the form of a Venn
diagram as follows −

There are three fundamental measures for assessing the quality of text retrieval −

 Precision
 Recall
 F-score

Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query.
Precision can be defined as −
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|

Recall
Recall is the percentage of documents that are relevant to the query and were in fact retrieved.
Recall is defined as −
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|

F-score
F-score is the commonly used trade-off. The information retrieval system often needs to trade-
off for precision or vice versa. F-score is defined as harmonic mean of recall or precision as
follows −
F-score = recall x precision / (recall + precision) / 2
Partitioning Method (K-Mean) in Data Mining
 Last Updated : 05 Feb, 2020
Partitioning Method:
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which each
partition represents a cluster and a particular region. There are many algorithms that
come under partitioning method some of the popular ones are K-Mean, PAM(K-
Mediods), CLARA algorithm (Clustering Large Applications) etc.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the CS Theory Course at a student-friendly price and
become industry ready.

In this article, we will be seeing the working of K Mean algorithm in detail.

K-Mean (A centroid based Technique):


The K means algorithm takes the input parameter K from the user and partitions the
dataset containing N objects into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the similarity of data objects with the
data objects from outside the cluster is low (intercluster). The similarity of the cluster is
determined with respect to the mean value of the cluster.
It is a type of square error algorithm. At the start randomly k objects from the dataset are
chosen in which each of the objects represents a cluster mean(centre). For the rest of the
data objects, they are assigned to the nearest cluster based on their distance from the
cluster mean. The new mean of each of the cluster is then calculated with the added data
objects.
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
4. Repeat Step 4 until no change occurs.
Figure – K-mean Clustering

Flowchart:

Figure – K-mean Clustering

Example: Suppose we want to group the visitors to a website using just their age as
follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-
29) and (36-66) as 2 clusters we get using K Mean Algorithm.

Hierarchical Clustering in Data Mining

A Hierarchical clustering method works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data points as a separate cluster. Then, it
repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A
diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the
sequences of merges or splits) graphically represents this hierarchy and is an inverted tree
that describes the order in which factors are merged (bottom-up view) or cluster are break
up (top-down view).
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the CS Theory Course at a student-friendly price and
become industry ready.

The basic method to generate hierarchical clustering are:

1. Agglomerative:
Initially consider every data point as an individual Cluster and at every step, merge the
nearest pairs of the cluster. (It is a bottom-up method). At first everydata set set is
considered as individual entity or cluster. At every iteration, the clusters merge with
different clusters until one cluster is formed.
Algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)
 Consider every data point as a individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:
This is just a demonstration of how the actual algorithm works no calculation has been
performed below all the proximity among the clusters are assumed.
Let’s say we have six data points A, B, C, D, E, F.
Figure – Agglomerative Hierarchical clustering
 Step-1:
Consider each alphabet as a single cluster and calculate the distance of one cluster
from all the other clusters.
 Step-2:
In the second step comparable clusters are merged together to form a single cluster.
Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge
them in the second step similarly with cluster (D) and (E) and at last, we get the
clusters
[(A), (BC), (DE), (F)]
 Step-3:
We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
 Step-4:
Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
 Step-5:
At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into
account all of the data points as a single cluster and in every iteration, we separate the
data points from the clusters which aren’t comparable. In the end, we are left with N
clusters.

Figure – Divisive Hierarchical clustering


Search Engine

A search engine is an online answering machine, which is used to search, understand, and
organize content's result in its database based on the search query (keywords) inserted by the
end-users (internet user). To display search results, all search engines first find the valuable
result from their database, sort them to make an ordered list based on the search algorithm, and
display in front of end-users. The process of organizing content in the form of a list is commonly
known as a Search Engine Results Page (SERP).

Google, Yahoo!, Bing, YouTube, and DuckDuckGo are some popular examples of search
engines.
In our search engine tutorial, we are going to discuss the following topics -

o Advantages of Search Engine


o Disadvantages of Search Engine
o Components of Search Engine
o How do search engines work
o Search Engine Processing
o Search Engine (Google) algorithm updates
o Most Popular Search Engines in the world

Advantages of Search Engine


Searching content on the Internet becomes one of the most popular activities all over the world.
In the current era, the search engine is an essential part of everyone's life because the search
engine offers various popular ways to find valuable, relevant, and informative content on the
Internet.

A list of advantages of search engines is given below -

1. Time-Saving
Search engine helps us to save time by the following two ways -

o Eliminate the need to find information manually.


o Perform search operations at a very high speed.

2. Variety of information
The search engine offers various variety of resources to obtain relevant and valuable information
from the Internet. By using a search engine, we can get information in various fields such as
education, entertainment, games, etc. The information which we get from the search engine is in
the form of blogs, pdf, ppt, text, images, videos, and audios.

3. Precision
All search engines have the ability to provide more precise results.

4. Free Access
Mostly search engines such as Google, Bing, and Yahoo allow end-users to search their content
for free. In search engines, there is no restriction related to a number of searches, so all end users
(Students, Job seekers, IT employees, and others) spend a lot of time to search valuable content
to fulfill their requirements.

5. Advanced Search
Search engines allow us to use advanced search options to get relevant, valuable, and informative
results. Advanced search results make our searches more flexible as well as sophisticated. For
example, when you want to search for a specific site, type "site:" without quotes followed by the
site's web address.

Suppose we want to search for java tutorial on javaTpoint then type "java
site:www.javatpoint.com" to get the advanced result quickly.

To search about education institution sites (colleges and universities) for B.Tech in computer
science engineering, then use "computer science engineering site:.edu." to get the advanced
result.

6. Relevance
Search engines allow us to search for relevant content based on a particular keyword. For
example, a site "javatpoint" scores a higher search for the term "java tutorial" this is because a
search engine sorts its result pages by the relevance of the content; that's why we can see the
highest-scoring results at the top of SERP.

Disadvantages of Search Engine


There are the following disadvantages of Search Engines -

o Sometimes the search engine takes too much time to display relevant, valuable, and
informative content.
o Search engines, especially Google, frequently update their algorithm, and it is very
difficult to find the algorithm in which Google runs.
o It makes end-users effortless as they all time use search engines to solve their small
queries also.

Components of Search Engine


There are the following four basic components of Search Engine -

1. Web Crawler
Web Crawler is also known as a search engine bot, web robot, or web spider. It plays an
essential role in search engine optimization (SEO) strategy. It is mainly a software component
that traverses on the web, then downloads and collects all the information over the Internet.

There are the following web crawler features that can affect the search results –

o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling

2. Database
The search engine database is a type of Non-relational database. It is the place where all the
web information is stored. It has a large number of web resources. Some most popular search
engine databases are Amazon Elastic Search Service and Splunk.

There are the following two database variable features that can affect the search results:

o Size of the database


o The freshness of the database

3. Search Interfaces
Search Interface is one of the most important components of Search Engine. It is an interface
between the user and the database. It basically helps users to search for queries using the
database.

There are the following features Search Interfaces that affect the search results -

o Operators
o Phrase Searching
o Truncation

4. Ranking Algorithms
The ranking algorithm is used by Google to rank web pages according to the Google search
algorithm.

There are the following ranking features that affect the search results -

o Location and frequency


o Link Analysis
o Clickthrough measurement

How do search engines work


There are the following tasks done by every search engines -

1. Crawling
Crawling is the first stage in which a search engine uses web crawlers to find, visit, and
download the web pages on the WWW (World Wide Web). Crawling is performed by software
robots, known as "spiders" or "crawlers." These robots are used to review the website content.

2. Indexing
Indexing is an online library of websites, which is used to sort, store, and organize the content
that we found during the crawling. Once a page is indexed, it appears as a result of the most
valuable and most relevant query.

3. Ranking and Retrieval


The ranking is the last stage of the search engine. It is used to provide a piece of content that will
be the best answer based on the user's query. It displays the best content at the top rank of the
website.

To know more about how the search engine works click on the following link -

https://www.javatpoint.com/how-search-engine-works

Search Engine Processing


There are following two major Search Engine processing functions -
1. Indexing process
Indexing is the process of building a structure that enables searching.

Indexing process contains the following three blocks -

i. Text acquisition

It is used to identify and store documents for indexing.

ii. Text transformation

It is the process of transform documents into index or features.

iii. Index creation

Index creation takes the output from text transformation and creates the indexes or data searches
that enable fast searching.

2. Query process
The query is the process of producing the list of documents based on a user's search query.

There are the following three tasks of the Query process -

i. User interaction
User interaction provides an interface between the users who search the content and the search
engine.

ii. Ranking

The ranking is the core component of the search engine. It takes query data from the user
interaction and generates a ranked list of data based on the retrieval model.

iii. Evaluation

Evaluation is used to measure and monitor the effectiveness and efficiency. The evaluation result
helps us to improve the ranking of the search engine.

Search Engine (Google) algorithm updates


Before learning about search engine algorithm, first, understand

What is an algorithm?
An algorithm is a set of instructions that we use to solve an infinite number of problems.
Google's algorithm also follows some set of rules to solve the problem. Google's algorithm is
very complex to understand and use because Google changes its algorithm very frequently,
which is very tough for users to identify in which algorithm currently, Google is working.

The search engine uses a combination of algorithms to deliver webpages based on the relevance
rank of a webpage on its search engine results pages (SERPs).

There are the following major updates occurred in Google's algorithm -

1. Google Panda
Google Panda update was the major change made in Google's search result. It a search filter
introduced on 23 February 2011. The name "Panda" derives from Google's Engineer Mr.
Navneet Panda, who made it possible for Google to create and implement the Google Panda
Update. The aim of Google Panda update is to reduce the occurrence of low-quality content,
duplicate content, and thin content in the search results. It contains unique as well as valuable
results at the top of the search engine page ranking.

2. Google Penguin
In April 2012, Google launched the "webspam algorithm update." This webspam algorithm
later called a Penguin algorithm. Currently, Penguin is a part of the core Google search engine
algorithm. It is mainly designed to target link spam, manipulative link building practices, as
well as webpage's scoring when crawled and indexed, which are analyzed by Google.
3. Google Hummingbird
Google Hummingbird was introduced on August 20, 2013. Hummingbird focuses more attention
to each word in a search query to bring better results. It is able to catch users and find the content
that matches the best intent. The advantage of a hummingbird update is that it provides fast,
accurate, and semantics results.

4. Google Payday
Google Payday was introduced on June 11, 2013. It mainly impacted 0.3 percent (approx.) of
queries in the U.S. Google Payday update is used to identify and penalize the lower quality of
websites that uses various heavy spam techniques (spammy queries) to increase the rank and
traffic. The advantage of Payday is that it improves ranking (quality) of search queries.

5. Google Pigeon
Google Pigeon is one of the biggest updates in Google's algorithm. Pigeon update launched on
July 24, 2014. This update is designed to provide better local search results by rewarding local
searches that have a strong organic presence with better visibility. It also improves the ranking of
a search parameter based on distance and location.

6. Google RankBrain
Google's RankBrain is a machine learning and artificial intelligence system. It introduced in
2015 via a Bloomberg news story. It is Google's third most important ranking system. It has the
ability to sort content based on accuracy and determine the most relevant results based on the
search query entered by the end-users.

7. Google EMD (Exact Match Domain)


Google EMD was launched on September 27, 2012, to improve the quality of content. It worked
in the industry for a long time. As the name suggested, it exactly matches the keyword from a
website and falls the low quality of content in a lower rank of Google's search result. According
to Google, the EMD affected .6% of English searches.
8. Google Page Layout Algorithm
Google page layout algorithm was introduced on January 19, 2012. It helps us to find high-
quality results that are easily accessible and visible at the top of the search engine. It mainly
affects 1% of worldwide search requests. This update mainly focuses on the user experience on a
website.

Most Popular Search Engines in the world


1. Google
Google is one of the most popular and trusted search engines in the world. It is created
by Sergey Brin and Larry Page as their research project in 1996. Many website browsers like
Chrome, Safari, Edge, and Firefox come with the default search engine "Google," which set as a
home page or starting page on every browser.

Google includes Machine Learning (ML), Artificial Intelligence (AI), and other algorithms to
identify user's behavior and quality of results in which they are interested. Google regularly
improves the search engine algorithm to produce the best results for end-users.

There are the following features of Google search engine -

o HTML Improvements
HTML Improvements help to improve the display of the Search Engine Results Page (SERP). It
also helps us to identify issues related to Search Engine Optimization (SEO), such as Missing
Metadata, Duplicated content, and more.
o Search Analytics
Search Analytics is one of the most popular features of Google search engine. It filters data in
multiple ways like pages, queries, and more as well as tells how to get organic traffic from
Google.
o Crawl Errors
Crawl Errors help us to solve the problem related to the crawl section. In crawling pages, all
errors related to Googlebot are shown.
o Instantly matching our search
Google's search engine algorithms help to sort billions of webpages according to end-users
requirements and present the most relevant, valuable, as well as useful results in front of end-
users.
o Calculation
Google allows us to uses its platform to perform calculations rather than using the computer's
calculator. To perform a calculation in Google, you just simply type "2345+234" in Google's
search box and press "Enter." Now, Google displays results at the top of search results.

Note: More than 70% of worldwide internet users use Google to do their searches.

Click on the below link to search on Google - https://www.google.com/


2. Bing
Bing search engine was introduced by Microsoft in 2009. It is the second most visited search
engine in the world. It allows us to filter results in different tabs such as images, ads, videos, and
more. The disadvantage of using Bing is that it produces slower results compared to Google.

There are the following features offered by Bing -

o Image Search
Bing offers a more advanced way to organized files than Google. It provides more information to
users and various advanced options to filter the photos.
o Video Search
Bing is one of the best search engine platforms for video content. It mostly brings videos from
YouTube to display. Bing allows us to hover the cursor over any videos to display a short clip of
the selected video.
o Computations
Most of the users like the Bing search engine platform for computation tasks. We can simply type
our math query in the search box to get instant results. Using Bing, we find top-rated websites to
solve equations and mathematics tasks.

o Homescreen
Bing attracts users by frequently changes background images and makes a home screen more
attractive for end-users. It also uses smaller pictures at the bottom of the screen with trending
headlines.

Click on the below link to search on Bing - https://www.bing.com/

3. DuckDuckGo
DuckDuckGo is an internet-based search engine which was founded in 2008. It does not track,
collect, and store our personal information. It is the best platform for those who want to keep
their browsing information secure and private. According to a research DuckDuckGo is
3rd most popular search engine in Australia, approximately 35 million users use it to search their
queries.

Note: In January 2018, DuckDuckGo private browser was launched on iOS and Android.

There are the following unique features of DuckDuckGo -

o Quick Stopwatch
DuckDuckGo uses both quick timer and stopwatch to measure the time. To start a Quick
stopwatch, we simply type "stopwatch" in the search box.
o Check number of characters
Check number of characters is the most interesting feature of DuckDuckGo. Using this, we can
quickly check the number of characters that we are inserting in the search query. To check the
character in search query simple type "chars" before and after the query.
o Check whether websites are down or not
In DuckDuckGo, we use the keyword the "Is website's name.com down for me" Example: (Is
javatpoint.com down for me) to check whether a particular website is down and up for us.
o Calendar
DuckDuckGo helps us to find instant answers related to calendars. If you want to see the calendar
of your Date of Birth, then use the keyword "calendar month year" (calendar December 1997)
in the search box.
o Features for Developers
DuckDuckGo also provides features for developers. Some cool developer features of
DuckDuckGo are given below -
o It generates lorem ipsum text and encodes URLs to machine-readable text.
o It helps the developer to convert input stream binary to decimal.
o It shows a list of special characters and their HTML values.

Click on the below link to search on DuckDuckGo - https://duckduckgo.com/


4. YouTube
YouTube was founded in 2005. It is the best video content delivery search engine. YouTube
offers the easiest way to learn new technologies. Currently, more than 1.5 billion users per month
use YouTube to deliver information.

Based on the Alexa Traffic ranking, YouTube is 2nd largest search engine and 3rd most visited
website in the world.

There are the following features of YouTube, which are used to improve User Experience -

o Better Live Streams


YouTube offers a better live streaming feature since 2013. As we know, a good video helps to
grow business, so when it comes to live streaming, you must be prepared with reliable lighting,
audio, recording equipment, and decent as well as the well-prepared script.
o YouTube Cards
YouTube includes various links such as Link Cards, Fan Funding Cards, Donation Cards, and
Playlist Cards. These all cards help us to drive traffic on our website, promote our content, and
build social lists.

o Captioning
Captioning helps search engines to find the videos. It adds more text metadata to the videos to
makes rank higher on query searches.

5. Baidu
Baidu is the first search engine introduced in 2000. It is the dominant search engine in China. It
is a free web browser that is used for both Windows as well as Android. It cooperates with
companies like Microsoft, Intel, and Qualcomm. It offers services such as cloud services, social
networking, maps, videos, images search, and many more. The main disadvantage of using
Baidu is that it can transmit the user's personal data to the Baidu server without using any
encryption algorithm.

A list of features of Baidu is given below -

o Ads
Ads are one of the most important features of Baidu. Ads help us to generate more leads,
optimize our site in a global way, and use various optimization techniques to bring our content up
to date for end users.
o Automatic Transcoding
Baidu transcodes non-friendly mobile websites without any approval by the site owner.
o Click Behaviour - Opens in a new tab
In Baidu, search results always open in a new tab when we click on a selected website.
o Structured Data Implementation
To accumulate structured data, Baidu uses its own properties, such as Baidu Open and Baidu
webmaster tools.
o Mobile Search
Baidu uses m.baidu.com as its mobile search engine.

Click on the below link to search on Baidu - https://www.baidu.com/


6. Yandex
Yandex was started by two Russian developers in 1990. It is a free browser that is available for
Windows, macOS, Android, and iOS. It mainly includes online advertising, App
Analytics, Data management, smart home technology, Artificial intelligence, as well as a
voice assistant called Alice.

The advantage of the Yandex search engine is that it is safe because its browser uses built-in
protection, which means when we search a harmful site page, then a pop-up comes, and the
browser automatically blocks them.

There are the following features of Yandex -


o Map APIs
Yandex offers various Map APIs which allow programs, apps, and another application to provide
accurate data location for a wide variety of uses. Most of the developers use this Map APIs to
build apps and services.
o Database Management
Yandex provides data management systems for businesses to collect a large amount of valuable
information.
o App Metrica
Yandex provides various analytics software for tracking app metrics. The analytics tool offers the
ability to properly track and understand the end user's data metrics.
o Security & Privacy
Yandex is the best search engine in terms of security and privacy. It uses various privacy
mechanisms to ensure that the user's data is protected, which helps us to improve the experience
of end-users.

Click on the below link to search on Yandex - https://yandex.com/


7. Yahoo!
It is the most widely used Internet search engine and the largest web portal. It provides a
structure to view thousands of websites and millions of web pages. Yahoo email services help us
to capture a huge market. It includes various services such as Yahoo answers, Yahoo groups,
Yahoo search engine, and Yahoo messengers, etc.

As the biggest web portal, it offers various following services -

o Yahoo! Finance
o Yahoo! Shopping
o Yahoo! Games
o Yahoo! Travel
o Yahoo! Maps
o Yahoo! Messengers
o Yahoo! Mail
There are the following features of Yahoo!

o Storage Capacity
Yahoo provides a huge amount of storage capacity (25 GB) to store data online on Yahoo. This
stored data can be used anywhere and anytime.
o Flickr
Flickr is the best media platform for uploading, managing, organizing, and sharing photos as well
as videos.
o Latest News
Yahoo allows us to share the latest information with our customers from time to time. This
information is available in the form of photos, videos, audios, and more.
o Privacy & Security
Yahoo! Search engine always takes care of the privacy of its users and provides a secure platform
with all its privacy constraints.

o User friendly
Yahoo! Offer extra user-friendly features to its customers. It provides a clear difference between
indexes, sent content, receives content, etc. To save memory, it saves unwanted emails only for
90 days; after that, it automatically deletes it from the user's account.
Smartworld.asia Specworld.in

Chapter-4

4.1 Cluster Analysis:


The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering.
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.
Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statisticalanalysis software packages or systems, such as S-Plus, SPSS, and
SAS.

4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.

DEPT OF CSE & IT

Smartzworld.com 1 jntuworldupdates.org
Smartworld.asia Specworld.in

Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

4.1.2 Typical Requirements Of Clustering InData Mining:


 Scalability:
Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering
on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.
However, a cluster could be of any shape. It is important to develop algorithms thatcan
detect clusters of arbitrary shape.
 Minimal requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to input certain parameters in cluster analysis
(such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but it also makes the
quality of clustering difficult to control.
 Ability to deal with noisy data:
Most real-world databases contain outliers or missing, unknown, or erroneous data.
Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
DEPT OF CSE & IT

Smartzworld.com 2 jntuworldupdates.org
Smartworld.asia Specworld.in

 Incremental clustering and insensitivity to the order of input records:


Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates)
into existing clustering structures and, instead, must determine a new clustering from
scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different
clusterings depending on the order of presentation of the input objects.
It is important to develop incremental clustering algorithms and algorithms thatare
insensitive to the order of input.
 High dimensionality:
A database or a data warehouse can contain several dimensionsor attributes.Many
clustering algorithms are good at handling low-dimensional data,involving only two to
three dimensions. Human eyes are good at judging the qualityof clustering for up to three
dimensions. Finding clusters of data objects in highdimensionalspace is challenging,
especially considering that such data can be sparseand highly skewed.
 Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster households
while considering constraints such as the city’s rivers and highway networks, and the type
and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
 Interpretability and usability:
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering
features and methods.

4.2 Major Clustering Methods:


 Partitioning Methods
 Hierarchical Methods

DEPT OF CSE & IT

Smartzworld.com 3 jntuworldupdates.org
Smartworld.asia Specworld.in

 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods

4.2.1 Partitioning Methods:


A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object, and
Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation


technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.

4.2.2 Hierarchical Methods:


A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on
howthe hierarchical decomposition is formed.

 Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
 The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.

DEPT OF CSE & IT

Smartzworld.com 4 jntuworldupdates.org
Smartworld.asia Specworld.in

Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.

There are two approachesto improving the quality of hierarchical clustering:

 Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in


Chameleon, or
 Integratehierarchical agglomeration and other approaches by first using a
hierarchicalagglomerative algorithm to group objects into microclusters, and then
performingmacroclustering on the microclusters using another clustering method such as
iterative relocation.
4.2.3 Density-based methods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
4.2.4 Grid-Based Methods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
DEPT OF CSE & IT

Smartzworld.com 5 jntuworldupdates.org
Smartworld.asia Specworld.in

 All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number
of cells in each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

4.2.5 Model-Based Methods:


 Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

4.3 Tasks in Data Mining:


 Clustering High-Dimensional Data
 Constraint-Based Clustering
4.3.1 Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications
require the analysis of objects containing a large number of features or dimensions.
For example, text documents may contain thousands of terms or keywords as
features, and DNA micro array data may provide information on the expression
levels of thousands of genes under hundreds of conditions.
Clustering high-dimensional data is challenging due to the curse of dimensionality.
Many dimensions may not be relevant. As the number of dimensions increases,
thedata become increasingly sparse so that the distance measurement between pairs
ofpoints become meaningless and the average density of points anywhere in the
data islikely to be low. Therefore, a different clustering methodology needs to be
developedfor high-dimensional data.
DEPT OF CSE & IT

Smartzworld.com 6 jntuworldupdates.org
Smartworld.asia Specworld.in

CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces ofthe data, rather than over the entire data space.
Frequent pattern–based clustering,another clustering methodology, extractsdistinct
frequent patterns among subsets ofdimensions that occur frequently. It uses such
patterns to group objects and generatemeaningful clusters.

4.3.2 Constraint-Based Clustering:


It is a clustering approach that performs clustering by incorporation of user-specified
or application-oriented constraints.
A constraint expresses a user’s expectation or describes properties of the desired
clustering results, and provides an effective means for communicating with the
clustering process.
Various kinds of constraints can be specified, either by a user or as per application
requirements.
Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clusteringemploys forpairwise
constraints in order to improvethe quality of the resulting clustering.

4.4 Classical Partitioning Methods:


The mostwell-known and commonly used partitioningmethods are
 The k-Means Method
 k-Medoids Method
4.4.1 Centroid-Based Technique: The K-Means Method:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok
clusters so that the resulting intracluster similarity is high but the intercluster similarity is
low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which
can be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows.

DEPT OF CSE & IT

Smartzworld.com 7 jntuworldupdates.org
Smartworld.asia Specworld.in

First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci

4.4.1 The k-means partitioning algorithm:


The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.

DEPT OF CSE & IT

Smartzworld.com 8 jntuworldupdates.org
Smartworld.asia Specworld.in

Clustering of a set of objects based on the k-means method

4.4.2 The k-Medoids Method:

The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as

whereE is the sum of the absolute error for all objects in the data set

pis the point inspace representing a given object in clusterCj

ojis the representative object of Cj

DEPT OF CSE & IT

Smartzworld.com 9 jntuworldupdates.org
Smartworld.asia Specworld.in

The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average
dissimilaritybetween an object and the representative object of its cluster.
To determine whether a non representative object, oj random, is a good replacement for a
current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.

Case 1:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.

Case 2:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.

Case 3:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.

Case 4:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.

DEPT OF CSE & IT

Smartzworld.com 10 jntuworldupdates.org
Smartworld.asia Specworld.in

Four cases of the cost function for k-medoids clustering

4.4.2 Thek-MedoidsAlgorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.

DEPT OF CSE & IT

Smartzworld.com 11 jntuworldupdates.org
Smartworld.asia Specworld.in

The k-medoids method ismore robust than k-means in the presence of noise and outliers,
because a medoid is lessinfluenced by outliers or other extreme values than a mean. However,
its processing ismore costly than the k-means method.

4.5 Hierarchical Clustering Methods:

A hierarchical clustering method works by grouping data objects into a tree of clusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.

Hierarchical clustering methods can be further classified as either agglomerative or divisive,


depending on whether the hierarchical decomposition is formed in a bottom-up or top-down
fashion.

4.5.1 Agglomerative hierarchical clustering:


This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are satisfied.
Most hierarchical clustering methods belong to this category. They differ only in their
definition of intercluster similarity.

4.5.2 Divisive hierarchical clustering:


This top-down strategy does the reverse of agglomerativehierarchical clustering by
starting with all objects in one cluster.
It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster
on itsown or until it satisfies certain termination conditions, such as a desired number
ofclusters is obtained or the diameter of each cluster is within a certain threshold.

DEPT OF CSE & IT

Smartzworld.com 12 jntuworldupdates.org
Smartworld.asia Specworld.in

4.6 Constraint-Based Cluster Analysis:


Constraint-based clustering finds clusters that satisfy user-specified preferences orconstraints.
Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different
approaches.
There are a few categories of constraints.
 Constraints on individual objects:

We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained
clustering.

 Constraints on the selection of clustering parameters:

A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
 Constraints on distance or similarity functions:

We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
DEPT OF CSE & IT

Smartzworld.com 13 jntuworldupdates.org
Smartworld.asia Specworld.in

 User-specified constraints on the properties of individual clusters:


A user may like tospecify desired characteristics of the resulting clusters, which may
strongly influencethe clustering process.
 Semi-supervised clustering based on partial supervision:
The quality of unsupervisedclustering can be significantly improved using some weak form
of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled
as belongingto the same or different cluster). Such a constrained clustering process is
calledsemi-supervised clustering.

4.7 Outlier Analysis:

There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medicalanalysis for finding unusual responses to various medical treatments.

Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
DEPT OF CSE & IT

Smartzworld.com 14 jntuworldupdates.org
Smartworld.asia Specworld.in

Types of outlier detection:


 Statistical Distribution-Based Outlier Detection
 Distance-Based Outlier Detection
 Density-Based Local Outlier Detection
 Deviation-Based Outlier Detection

4.7.1 Statistical Distribution-Based Outlier Detection:


The statistical distribution-based approach to outlier detection assumes a distributionor
probability model for the given data set (e.g., a normal or Poisson distribution) andthen
identifies outliers with respect to the model using a discordancy test. Application ofthe
test requires knowledge of the data set parameters knowledge of distribution parameters
such as the mean and variance and theexpected number of outliers.
A statistical discordancy test examines two hypotheses:
A working hypothesis
An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its


rejection. A discordancy test verifies whether an object, oi, is significantly large (or
small) in relation to the distribution F. Different test statistics have been proposed for use
as a discordancy test, depending on the available knowledge of the data. Assuming that
some statistic, T, has been chosen for discordancy testing, and the value of the statistic
for object oi is vi, then the distribution of T is constructed. Significance probability,
SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently small, then oi is discordant and
the working hypothesis is rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model,
G, is adopted. The result is very much dependent on which model F is chosen because
oimay be an outlier under one model and a perfectly valid value under another. The

DEPT OF CSE & IT

Smartzworld.com 15 jntuworldupdates.org
Smartworld.asia Specworld.in

alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
There are different kinds of alternative distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H :oi € G, where i = 1, 2,…, n
F and G may be different distributions or differ only in parameters of the same
distribution.
There are constraints on the form of the G distribution in that it must have potential to
produce outliers. For example, it may have a different mean or dispersion, or a longer
tail.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population,
but contaminants from some other population,
G. In this case, the alternative hypothesis is

Slippage alternative distribution:


This alternative states that all of the objects (apart from some prescribed small number)
arise independently from the initial model, F, with its given parameters, whereas the
remaining objects are independent observations from a modified version of F in which
the parameters have been shifted.
There are two basic types of procedures for detecting outliers:
Block procedures:
In this case, either all of the suspect objects are treated as outliersor all of them are accepted
as consistent.
Consecutive procedures:
An example of such a procedure is the insideoutprocedure. Its main idea is that the object
that is least likely to be an outlier istested first. If it is found to be an outlier, then all of the

DEPT OF CSE & IT

Smartzworld.com 16 jntuworldupdates.org
Smartworld.asia Specworld.in

more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is
tested, and so on. Thisprocedure tends to be more effective than block procedures.

4.7.2 Distance-Based Outlier Detection:


The notion of distance-based outliers was introduced to counter the main limitationsimposed
by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier with
parameters pct and dmin,that is, a DB(pct;dmin)-outlier, if at least a fraction,pct, of the
objects in D lie at a distance greater than dmin from o. In other words, rather thatrelying on
statistical tests, we can think of distance-based outliers as thoseobjects that do not have
enoughneighbors, where neighbors are defined based ondistance from the given object. In
comparison with statistical-based methods, distancebased outlier detection generalizes the
ideas behind discordancy testing for various standarddistributions. Distance-based outlier
detection avoids the excessive computationthat can be associated with fitting the observed
distribution into some standard distributionand in selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier accordingto the
given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct anddmin.
For example, if objects that lie three or more standard deviations from the mean
are considered to be outliers, assuming a normal distribution, then this definition can
be generalized by a DB(0.9988, 0.13s) outlier.
Several efficient algorithms for mining distance-based outliers have been developed.
Index-based algorithm:
Given a data set, the index-based algorithm uses multidimensionalindexing structures, such
as R-trees or k-d trees, to search for neighbors of eachobject o within radius dminaround that
object. Let Mbe the maximum number ofobjects within the dmin-neighborhood of an outlier.
Therefore, onceM+1 neighborsof object o are found, it is clear that o is not an outlier. This
algorithm has a worst-casecomplexity of O(n2k), where n is the number of objects in the data
set and k is thedimensionality. The index-based algorithm scales well as k increases.
However, thiscomplexity evaluation takes only the search time into account, even though the
taskof building an index in itself can be computationally intensive.
Nested-loop algorithm:

DEPT OF CSE & IT

Smartzworld.com 17 jntuworldupdates.org
Smartworld.asia Specworld.in

The nested-loop algorithm has the same computational complexityas the index-based
algorithm but avoids index structure construction and triesto minimize the number of I/Os. It
divides the memory buffer space into two halvesand the data set into several logical blocks.
By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can
be achieved.
Cell-based algorithm:
To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-
resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number
of cells and k is the dimensionality.

In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis

cells thick, rounded up to the closest integer. The algorithm countsoutliers on a


cell-by-cell rather than an object-by-object basis. For a given cell, itaccumulates three
counts—the number of objects in the cell, in the cell and the firstlayer together, and in the
cell and both layers together. Let’s refer to these counts ascell count, cell + 1 layer count, and
cell + 2 layers count, respectively.

Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o
are examined. For objects in the cell, only thoseobjects having no more than M points in
their dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists of
the object’s cell, all of its firstlayer, and some of its second layer.

DEPT OF CSE & IT

Smartzworld.com 18 jntuworldupdates.org
Smartworld.asia Specworld.in

A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.

4.7.3 Density-Based Local Outlier Detection:


Statistical and distance-based outlier detection both depend on the overall or
globaldistribution of the given set of data points, D. However, data are usually not
uniformlydistributed. These methods encounter difficulties when analyzing data with rather
different
density distributions.
To define the local outlier factor of an object, we need to introduce the concepts ofk-
distance, k-distance neighborhood, reachability distance,13 and local reachability density.
These are defined as follows:
The k-distance of an object p is the maximal distance that p gets from its k-
nearestneighbors. This distance is denoted as k-distance(p). It is defined as the distance,
d(p, o), between p and an object o 2 D, such that for at least k objects, o0 2 D, it holds that
d(p, o’)_d(p, o). That is, there are at least k objects inDthat are as close asor closer to p than
o, and for at most k-1 objects, o00 2 D, it holds that d(p;o’’) <d(p, o).

That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted Nkdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearestneighbors of p. That is, it
contains every object whose distance is not greater than theMinPts-distance of p.
The reachability distance of an object p with respect to object o (where o is within
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p, o) = max{MinPtsdistance(o), d(p, o)}.

DEPT OF CSE & IT

Smartzworld.com 19 jntuworldupdates.org
Smartworld.asia Specworld.in

Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply
their actual distance. However, if they are sufficientlyclose (i.e., where p is within the
MinPts-distance neighborhood of o), thenthe actual distance is replaced by the MinPts-
distance of o. This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of
the p close to o.
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as

The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as

It is the average of the ratio of the local reachability density of p and those of p’s
MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density
is, and the higher the local reachability density of p’s MinPts-nearest neighbors are,
the higher LOF(p) is.
4.7.4 Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to
identify exceptional objects. Instead, it identifies outliers by examining themain
characteristics of objects in a group.Objects that ―deviate‖ fromthisdescription areconsidered
outliers. Hence, in this approach the term deviations is typically used to referto outliers. In
this section, we study two techniques for deviation-based outlier detection.The first
sequentially compares objects in a set, while the second employs an OLAPdata cube
approach.

Sequential Exception Technique:

DEPT OF CSE & IT

Smartzworld.com 20 jntuworldupdates.org
Smartworld.asia Specworld.in

The sequential exception technique simulates the way in which humans can
distinguishunusual objects from among a series of supposedly like objects. It uses implicit
redundancyof the data. Given a data set, D, of n objects, it builds a sequence of subsets,{D1,
D2, …,Dm}, of these objects with 2<=m <= n such that

Dissimilarities are assessed between subsets in the sequence. The technique introducesthe
following key terms.
Exception set:
This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between theobjects. It is any function that, if
given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater
the dissimilarity among the objects, the higherthe value returned by the function. The
dissimilarity of a subset is incrementally computedbased on the subset prior to it in the
sequence. Given a subset of n numbers, {x1, …,xn}, a possible dissimilarity function is the
variance of the numbers in theset, that is,

where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction
may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
strings in Dj-1 does not cover any string in Dj that isnot in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.

DEPT OF CSE & IT

Smartzworld.com 21 jntuworldupdates.org

You might also like