Professional Documents
Culture Documents
Unit-3 DWDM 7TH Sem Cse
Unit-3 DWDM 7TH Sem Cse
Unit-3 DWDM 7TH Sem Cse
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or the properties of desired clustering
results. Constraints provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application requirement.
Text databases consist of huge collection of documents. They collect these information from
several sources such as news articles, books, digital libraries, e-mail messages, web pages, etc.
Due to increase in the amount of information, the text databases are growing rapidly. In many of
the text databases, the data is semi-structured.
For example, a document may contain a few structured fields, such as title, author,
publishing_date, etc. But along with the structure data, the document also contains unstructured
text components, such as abstract and contents. Without knowing what could be in the
documents, it is difficult to formulate effective queries for analyzing and extracting useful
information from the data. Users require tools to compare the documents and rank their
importance and relevance. Therefore, text mining has become popular and an essential theme in
data mining.
Information Retrieval
Information retrieval deals with the retrieval of information from a large number of text-based
documents. Some of the database systems are not usually present in information retrieval
systems because both handle different kinds of data. Examples of information retrieval system
include −
There are three fundamental measures for assessing the quality of text retrieval −
Precision
Recall
F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query.
Precision can be defined as −
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall
Recall is the percentage of documents that are relevant to the query and were in fact retrieved.
Recall is defined as −
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to trade-
off for precision or vice versa. F-score is defined as harmonic mean of recall or precision as
follows −
F-score = recall x precision / (recall + precision) / 2
Partitioning Method (K-Mean) in Data Mining
Last Updated : 05 Feb, 2020
Partitioning Method:
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which each
partition represents a cluster and a particular region. There are many algorithms that
come under partitioning method some of the popular ones are K-Mean, PAM(K-
Mediods), CLARA algorithm (Clustering Large Applications) etc.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the CS Theory Course at a student-friendly price and
become industry ready.
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
4. Repeat Step 4 until no change occurs.
Figure – K-mean Clustering
Flowchart:
Example: Suppose we want to group the visitors to a website using just their age as
follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-
29) and (36-66) as 2 clusters we get using K Mean Algorithm.
A Hierarchical clustering method works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data points as a separate cluster. Then, it
repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A
diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the
sequences of merges or splits) graphically represents this hierarchy and is an inverted tree
that describes the order in which factors are merged (bottom-up view) or cluster are break
up (top-down view).
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the CS Theory Course at a student-friendly price and
become industry ready.
1. Agglomerative:
Initially consider every data point as an individual Cluster and at every step, merge the
nearest pairs of the cluster. (It is a bottom-up method). At first everydata set set is
considered as individual entity or cluster. At every iteration, the clusters merge with
different clusters until one cluster is formed.
Algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)
Consider every data point as a individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:
This is just a demonstration of how the actual algorithm works no calculation has been
performed below all the proximity among the clusters are assumed.
Let’s say we have six data points A, B, C, D, E, F.
Figure – Agglomerative Hierarchical clustering
Step-1:
Consider each alphabet as a single cluster and calculate the distance of one cluster
from all the other clusters.
Step-2:
In the second step comparable clusters are merged together to form a single cluster.
Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge
them in the second step similarly with cluster (D) and (E) and at last, we get the
clusters
[(A), (BC), (DE), (F)]
Step-3:
We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Step-4:
Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5:
At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into
account all of the data points as a single cluster and in every iteration, we separate the
data points from the clusters which aren’t comparable. In the end, we are left with N
clusters.
A search engine is an online answering machine, which is used to search, understand, and
organize content's result in its database based on the search query (keywords) inserted by the
end-users (internet user). To display search results, all search engines first find the valuable
result from their database, sort them to make an ordered list based on the search algorithm, and
display in front of end-users. The process of organizing content in the form of a list is commonly
known as a Search Engine Results Page (SERP).
Google, Yahoo!, Bing, YouTube, and DuckDuckGo are some popular examples of search
engines.
In our search engine tutorial, we are going to discuss the following topics -
1. Time-Saving
Search engine helps us to save time by the following two ways -
2. Variety of information
The search engine offers various variety of resources to obtain relevant and valuable information
from the Internet. By using a search engine, we can get information in various fields such as
education, entertainment, games, etc. The information which we get from the search engine is in
the form of blogs, pdf, ppt, text, images, videos, and audios.
3. Precision
All search engines have the ability to provide more precise results.
4. Free Access
Mostly search engines such as Google, Bing, and Yahoo allow end-users to search their content
for free. In search engines, there is no restriction related to a number of searches, so all end users
(Students, Job seekers, IT employees, and others) spend a lot of time to search valuable content
to fulfill their requirements.
5. Advanced Search
Search engines allow us to use advanced search options to get relevant, valuable, and informative
results. Advanced search results make our searches more flexible as well as sophisticated. For
example, when you want to search for a specific site, type "site:" without quotes followed by the
site's web address.
Suppose we want to search for java tutorial on javaTpoint then type "java
site:www.javatpoint.com" to get the advanced result quickly.
To search about education institution sites (colleges and universities) for B.Tech in computer
science engineering, then use "computer science engineering site:.edu." to get the advanced
result.
6. Relevance
Search engines allow us to search for relevant content based on a particular keyword. For
example, a site "javatpoint" scores a higher search for the term "java tutorial" this is because a
search engine sorts its result pages by the relevance of the content; that's why we can see the
highest-scoring results at the top of SERP.
o Sometimes the search engine takes too much time to display relevant, valuable, and
informative content.
o Search engines, especially Google, frequently update their algorithm, and it is very
difficult to find the algorithm in which Google runs.
o It makes end-users effortless as they all time use search engines to solve their small
queries also.
1. Web Crawler
Web Crawler is also known as a search engine bot, web robot, or web spider. It plays an
essential role in search engine optimization (SEO) strategy. It is mainly a software component
that traverses on the web, then downloads and collects all the information over the Internet.
There are the following web crawler features that can affect the search results –
o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling
2. Database
The search engine database is a type of Non-relational database. It is the place where all the
web information is stored. It has a large number of web resources. Some most popular search
engine databases are Amazon Elastic Search Service and Splunk.
There are the following two database variable features that can affect the search results:
3. Search Interfaces
Search Interface is one of the most important components of Search Engine. It is an interface
between the user and the database. It basically helps users to search for queries using the
database.
There are the following features Search Interfaces that affect the search results -
o Operators
o Phrase Searching
o Truncation
4. Ranking Algorithms
The ranking algorithm is used by Google to rank web pages according to the Google search
algorithm.
There are the following ranking features that affect the search results -
1. Crawling
Crawling is the first stage in which a search engine uses web crawlers to find, visit, and
download the web pages on the WWW (World Wide Web). Crawling is performed by software
robots, known as "spiders" or "crawlers." These robots are used to review the website content.
2. Indexing
Indexing is an online library of websites, which is used to sort, store, and organize the content
that we found during the crawling. Once a page is indexed, it appears as a result of the most
valuable and most relevant query.
To know more about how the search engine works click on the following link -
https://www.javatpoint.com/how-search-engine-works
i. Text acquisition
Index creation takes the output from text transformation and creates the indexes or data searches
that enable fast searching.
2. Query process
The query is the process of producing the list of documents based on a user's search query.
i. User interaction
User interaction provides an interface between the users who search the content and the search
engine.
ii. Ranking
The ranking is the core component of the search engine. It takes query data from the user
interaction and generates a ranked list of data based on the retrieval model.
iii. Evaluation
Evaluation is used to measure and monitor the effectiveness and efficiency. The evaluation result
helps us to improve the ranking of the search engine.
What is an algorithm?
An algorithm is a set of instructions that we use to solve an infinite number of problems.
Google's algorithm also follows some set of rules to solve the problem. Google's algorithm is
very complex to understand and use because Google changes its algorithm very frequently,
which is very tough for users to identify in which algorithm currently, Google is working.
The search engine uses a combination of algorithms to deliver webpages based on the relevance
rank of a webpage on its search engine results pages (SERPs).
1. Google Panda
Google Panda update was the major change made in Google's search result. It a search filter
introduced on 23 February 2011. The name "Panda" derives from Google's Engineer Mr.
Navneet Panda, who made it possible for Google to create and implement the Google Panda
Update. The aim of Google Panda update is to reduce the occurrence of low-quality content,
duplicate content, and thin content in the search results. It contains unique as well as valuable
results at the top of the search engine page ranking.
2. Google Penguin
In April 2012, Google launched the "webspam algorithm update." This webspam algorithm
later called a Penguin algorithm. Currently, Penguin is a part of the core Google search engine
algorithm. It is mainly designed to target link spam, manipulative link building practices, as
well as webpage's scoring when crawled and indexed, which are analyzed by Google.
3. Google Hummingbird
Google Hummingbird was introduced on August 20, 2013. Hummingbird focuses more attention
to each word in a search query to bring better results. It is able to catch users and find the content
that matches the best intent. The advantage of a hummingbird update is that it provides fast,
accurate, and semantics results.
4. Google Payday
Google Payday was introduced on June 11, 2013. It mainly impacted 0.3 percent (approx.) of
queries in the U.S. Google Payday update is used to identify and penalize the lower quality of
websites that uses various heavy spam techniques (spammy queries) to increase the rank and
traffic. The advantage of Payday is that it improves ranking (quality) of search queries.
5. Google Pigeon
Google Pigeon is one of the biggest updates in Google's algorithm. Pigeon update launched on
July 24, 2014. This update is designed to provide better local search results by rewarding local
searches that have a strong organic presence with better visibility. It also improves the ranking of
a search parameter based on distance and location.
6. Google RankBrain
Google's RankBrain is a machine learning and artificial intelligence system. It introduced in
2015 via a Bloomberg news story. It is Google's third most important ranking system. It has the
ability to sort content based on accuracy and determine the most relevant results based on the
search query entered by the end-users.
Google includes Machine Learning (ML), Artificial Intelligence (AI), and other algorithms to
identify user's behavior and quality of results in which they are interested. Google regularly
improves the search engine algorithm to produce the best results for end-users.
o HTML Improvements
HTML Improvements help to improve the display of the Search Engine Results Page (SERP). It
also helps us to identify issues related to Search Engine Optimization (SEO), such as Missing
Metadata, Duplicated content, and more.
o Search Analytics
Search Analytics is one of the most popular features of Google search engine. It filters data in
multiple ways like pages, queries, and more as well as tells how to get organic traffic from
Google.
o Crawl Errors
Crawl Errors help us to solve the problem related to the crawl section. In crawling pages, all
errors related to Googlebot are shown.
o Instantly matching our search
Google's search engine algorithms help to sort billions of webpages according to end-users
requirements and present the most relevant, valuable, as well as useful results in front of end-
users.
o Calculation
Google allows us to uses its platform to perform calculations rather than using the computer's
calculator. To perform a calculation in Google, you just simply type "2345+234" in Google's
search box and press "Enter." Now, Google displays results at the top of search results.
Note: More than 70% of worldwide internet users use Google to do their searches.
o Image Search
Bing offers a more advanced way to organized files than Google. It provides more information to
users and various advanced options to filter the photos.
o Video Search
Bing is one of the best search engine platforms for video content. It mostly brings videos from
YouTube to display. Bing allows us to hover the cursor over any videos to display a short clip of
the selected video.
o Computations
Most of the users like the Bing search engine platform for computation tasks. We can simply type
our math query in the search box to get instant results. Using Bing, we find top-rated websites to
solve equations and mathematics tasks.
o Homescreen
Bing attracts users by frequently changes background images and makes a home screen more
attractive for end-users. It also uses smaller pictures at the bottom of the screen with trending
headlines.
3. DuckDuckGo
DuckDuckGo is an internet-based search engine which was founded in 2008. It does not track,
collect, and store our personal information. It is the best platform for those who want to keep
their browsing information secure and private. According to a research DuckDuckGo is
3rd most popular search engine in Australia, approximately 35 million users use it to search their
queries.
Note: In January 2018, DuckDuckGo private browser was launched on iOS and Android.
o Quick Stopwatch
DuckDuckGo uses both quick timer and stopwatch to measure the time. To start a Quick
stopwatch, we simply type "stopwatch" in the search box.
o Check number of characters
Check number of characters is the most interesting feature of DuckDuckGo. Using this, we can
quickly check the number of characters that we are inserting in the search query. To check the
character in search query simple type "chars" before and after the query.
o Check whether websites are down or not
In DuckDuckGo, we use the keyword the "Is website's name.com down for me" Example: (Is
javatpoint.com down for me) to check whether a particular website is down and up for us.
o Calendar
DuckDuckGo helps us to find instant answers related to calendars. If you want to see the calendar
of your Date of Birth, then use the keyword "calendar month year" (calendar December 1997)
in the search box.
o Features for Developers
DuckDuckGo also provides features for developers. Some cool developer features of
DuckDuckGo are given below -
o It generates lorem ipsum text and encodes URLs to machine-readable text.
o It helps the developer to convert input stream binary to decimal.
o It shows a list of special characters and their HTML values.
Based on the Alexa Traffic ranking, YouTube is 2nd largest search engine and 3rd most visited
website in the world.
There are the following features of YouTube, which are used to improve User Experience -
o Captioning
Captioning helps search engines to find the videos. It adds more text metadata to the videos to
makes rank higher on query searches.
5. Baidu
Baidu is the first search engine introduced in 2000. It is the dominant search engine in China. It
is a free web browser that is used for both Windows as well as Android. It cooperates with
companies like Microsoft, Intel, and Qualcomm. It offers services such as cloud services, social
networking, maps, videos, images search, and many more. The main disadvantage of using
Baidu is that it can transmit the user's personal data to the Baidu server without using any
encryption algorithm.
o Ads
Ads are one of the most important features of Baidu. Ads help us to generate more leads,
optimize our site in a global way, and use various optimization techniques to bring our content up
to date for end users.
o Automatic Transcoding
Baidu transcodes non-friendly mobile websites without any approval by the site owner.
o Click Behaviour - Opens in a new tab
In Baidu, search results always open in a new tab when we click on a selected website.
o Structured Data Implementation
To accumulate structured data, Baidu uses its own properties, such as Baidu Open and Baidu
webmaster tools.
o Mobile Search
Baidu uses m.baidu.com as its mobile search engine.
The advantage of the Yandex search engine is that it is safe because its browser uses built-in
protection, which means when we search a harmful site page, then a pop-up comes, and the
browser automatically blocks them.
o Yahoo! Finance
o Yahoo! Shopping
o Yahoo! Games
o Yahoo! Travel
o Yahoo! Maps
o Yahoo! Messengers
o Yahoo! Mail
There are the following features of Yahoo!
o Storage Capacity
Yahoo provides a huge amount of storage capacity (25 GB) to store data online on Yahoo. This
stored data can be used anywhere and anytime.
o Flickr
Flickr is the best media platform for uploading, managing, organizing, and sharing photos as well
as videos.
o Latest News
Yahoo allows us to share the latest information with our customers from time to time. This
information is available in the form of photos, videos, audios, and more.
o Privacy & Security
Yahoo! Search engine always takes care of the privacy of its users and provides a secure platform
with all its privacy constraints.
o User friendly
Yahoo! Offer extra user-friendly features to its customers. It provides a clear difference between
indexes, sent content, receives content, etc. To save memory, it saves unwanted emails only for
90 days; after that, it automatically deletes it from the user's account.
Smartworld.asia Specworld.in
Chapter-4
4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Smartzworld.com 1 jntuworldupdates.org
Smartworld.asia Specworld.in
Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.
Smartzworld.com 2 jntuworldupdates.org
Smartworld.asia Specworld.in
Smartzworld.com 3 jntuworldupdates.org
Smartworld.asia Specworld.in
Density-Based Methods
Grid-Based Methods
Model-Based Methods
The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.
Smartzworld.com 4 jntuworldupdates.org
Smartworld.asia Specworld.in
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.
Smartzworld.com 5 jntuworldupdates.org
Smartworld.asia Specworld.in
All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number
of cells in each dimension in the quantized space.
STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.
Smartzworld.com 6 jntuworldupdates.org
Smartworld.asia Specworld.in
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces ofthe data, rather than over the entire data space.
Frequent pattern–based clustering,another clustering methodology, extractsdistinct
frequent patterns among subsets ofdimensions that occur frequently. It uses such
patterns to group objects and generatemeaningful clusters.
Smartzworld.com 7 jntuworldupdates.org
Smartworld.asia Specworld.in
First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.
whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci
Smartzworld.com 8 jntuworldupdates.org
Smartworld.asia Specworld.in
The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as
whereE is the sum of the absolute error for all objects in the data set
Smartzworld.com 9 jntuworldupdates.org
Smartworld.asia Specworld.in
The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average
dissimilaritybetween an object and the representative object of its cluster.
To determine whether a non representative object, oj random, is a good replacement for a
current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.
Case 1:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.
Case 4:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Smartzworld.com 10 jntuworldupdates.org
Smartworld.asia Specworld.in
4.4.2 Thek-MedoidsAlgorithm:
Smartzworld.com 11 jntuworldupdates.org
Smartworld.asia Specworld.in
The k-medoids method ismore robust than k-means in the presence of noise and outliers,
because a medoid is lessinfluenced by outliers or other extreme values than a mean. However,
its processing ismore costly than the k-means method.
A hierarchical clustering method works by grouping data objects into a tree of clusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.
Smartzworld.com 12 jntuworldupdates.org
Smartworld.asia Specworld.in
We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained
clustering.
A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
Constraints on distance or similarity functions:
We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
DEPT OF CSE & IT
Smartzworld.com 13 jntuworldupdates.org
Smartworld.asia Specworld.in
There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medicalanalysis for finding unusual responses to various medical treatments.
Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
DEPT OF CSE & IT
Smartzworld.com 14 jntuworldupdates.org
Smartworld.asia Specworld.in
Smartzworld.com 15 jntuworldupdates.org
Smartworld.asia Specworld.in
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
There are different kinds of alternative distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H :oi € G, where i = 1, 2,…, n
F and G may be different distributions or differ only in parameters of the same
distribution.
There are constraints on the form of the G distribution in that it must have potential to
produce outliers. For example, it may have a different mean or dispersion, or a longer
tail.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population,
but contaminants from some other population,
G. In this case, the alternative hypothesis is
Smartzworld.com 16 jntuworldupdates.org
Smartworld.asia Specworld.in
more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is
tested, and so on. Thisprocedure tends to be more effective than block procedures.
Smartzworld.com 17 jntuworldupdates.org
Smartworld.asia Specworld.in
The nested-loop algorithm has the same computational complexityas the index-based
algorithm but avoids index structure construction and triesto minimize the number of I/Os. It
divides the memory buffer space into two halvesand the data set into several logical blocks.
By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can
be achieved.
Cell-based algorithm:
To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-
resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number
of cells and k is the dimensionality.
In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis
Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o
are examined. For objects in the cell, only thoseobjects having no more than M points in
their dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists of
the object’s cell, all of its firstlayer, and some of its second layer.
Smartzworld.com 18 jntuworldupdates.org
Smartworld.asia Specworld.in
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.
That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted Nkdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearestneighbors of p. That is, it
contains every object whose distance is not greater than theMinPts-distance of p.
The reachability distance of an object p with respect to object o (where o is within
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p, o) = max{MinPtsdistance(o), d(p, o)}.
Smartzworld.com 19 jntuworldupdates.org
Smartworld.asia Specworld.in
Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply
their actual distance. However, if they are sufficientlyclose (i.e., where p is within the
MinPts-distance neighborhood of o), thenthe actual distance is replaced by the MinPts-
distance of o. This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of
the p close to o.
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as
The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as
It is the average of the ratio of the local reachability density of p and those of p’s
MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density
is, and the higher the local reachability density of p’s MinPts-nearest neighbors are,
the higher LOF(p) is.
4.7.4 Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to
identify exceptional objects. Instead, it identifies outliers by examining themain
characteristics of objects in a group.Objects that ―deviate‖ fromthisdescription areconsidered
outliers. Hence, in this approach the term deviations is typically used to referto outliers. In
this section, we study two techniques for deviation-based outlier detection.The first
sequentially compares objects in a set, while the second employs an OLAPdata cube
approach.
Smartzworld.com 20 jntuworldupdates.org
Smartworld.asia Specworld.in
The sequential exception technique simulates the way in which humans can
distinguishunusual objects from among a series of supposedly like objects. It uses implicit
redundancyof the data. Given a data set, D, of n objects, it builds a sequence of subsets,{D1,
D2, …,Dm}, of these objects with 2<=m <= n such that
Dissimilarities are assessed between subsets in the sequence. The technique introducesthe
following key terms.
Exception set:
This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between theobjects. It is any function that, if
given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater
the dissimilarity among the objects, the higherthe value returned by the function. The
dissimilarity of a subset is incrementally computedbased on the subset prior to it in the
sequence. Given a subset of n numbers, {x1, …,xn}, a possible dissimilarity function is the
variance of the numbers in theset, that is,
where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction
may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
strings in Dj-1 does not cover any string in Dj that isnot in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.
Smartzworld.com 21 jntuworldupdates.org