Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

MCS-221

Data Warehousing
Indira Gandhi and Data Mining
National Open University
School of Computer and
Information Sciences

Block

4
CLASSIFICATION, CLUSTERING AND
WEB MINING
UNIT 10
Classification
UNIT 2
Clustering
UNIT 3
Text and Web Mining
PROGRAMME DESIGN COMMITTEE
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

COURSE DESIGN COMMITTEE


Prof. T.V. Vijay Kumar, JNU, New Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Dr. Rahul Johri, USICT, GGSIPU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Mr. Vinay Kumar Sharma, NVLI, IGNOU Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU

BLOCK PREPARATION TEAM


Course Editor Course Writers
Prof. Devendra Kumar Tayal Unit 10 & Unit 11:
Dept. of Computer Science & Engineering Dr. S. Nagaprasad, Lecturer
Indira Gandhi Delhi Technical University for Women Dept. Of Computer Science
New Delhi Tara Govt. Degree and P.G College
Sangareddy, Telangana
Language Editor Unit 12: Prof. Archana Singh
Prof. Parmod Kumar Dept. Of Information Technology
School of Humanities Amity School of Engineering & Technology
IGNOU Noida
New Delhi

Course Coordinator: Prof. V.V. Subrahmanyam

Print Production
Mr. Sanjay Aggarwal, Assistant Registrar (Publication), MPDD

July, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission in writing from
the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at Maidan Garhi, New
Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
BLOCK INTRODUCTION
The title of the block is Classification, Clustering and Web Mining. The objectives of this
block are to make you understand about the underlying concepts of Classification, Clustering,
Text and Web Mining.
The block is organized into 3 units:

Unit 10 covers the overview of Classification, general approaches to solve a classification


problem and details of various classification techniques;
Unit 11 covers the overview of Clustering, categorization of clustering methods, partitioning
method, hierarchical clustering and outlier analysis; and
Unit 12 covers the introductory topics of Text and Web mining, web content mining, mining
multimedia data on the web and web usage mining.
 
Classification, Clustering
and Web Mining
UNIT 11 CLUSTERING

Structure

11.0 Introduction
11.1 Objectives
11.2 Clustering – An Overview
11.2.1 Applications of Cluster Analysis in Data Mining
11.3 Clustering Methods
11.4 Partitioning Method
11.4.1 k-Means Algorithm
11.4.2 k-Medoids
11.5 Hierarchical Method
11.5.1 Agglomerative Approach
11.5.2 Divisive Approach
11.6 Density Based Method
11.6.1 DBSCAN
11.7 Limitations with Cluster Analysis
11.8 Outlier Analysis
11.9 Summary
11.10 Solutions/Answers
11.11 Further Readings

11.0 INTRODUCTION

In the earlier unit, we have studied the Classification in Data Mining. We have
covered the introductory concepts and general approach to classification, applications
of classification models, various classifiers and their underlying principles and model
evaluation and selection aspects.

This unit covers another important concept known as Clustering. Clustering is the
process by which we create groups in a data, like customers, products, employees, text
documents, in such a way that objects falling into one group exhibit many similar
properties with each other and are different from objects that fall in the other groups
that got created during the process

It is a technique used to turn a collection of unrelated abstract objects into a set of


related classes. It is a technique for dividing large collections of data or objects into
smaller groups called clusters. As a stand-alone tool, it provides insight into data
distribution and can be used as a pre-processing step for other algorithms or as a pre-
processing step in its own right.

We will study overview of clustering, clustering methods, partitioning method,


hierarchical clustering and outlier analysis.

11.1 OBJECTIVES

After going through this unit, you shall be able to:

 Understand Clustering in Data Mining


 List various types of Clustering Methods
 Understand and demonstrate various Clustering Methods
 Understand the k-Means algorithm and k-Medoids which falls under

 
 
Clustering
partitioning method
 Explain the Agglomerative and Divisive methods of hierarchical clustering
 Discuss outlier analysis

11.2 CLUSTERING – AN OVERVIEW


Clustering is the process of grouping a collection of objects (usually represented as
points in a multidimensional space) into classes of similar objects. Cluster analysis is
a very important tool in data analysis. It is a set of methodologies for automatic
classification of a collection of patterns into clusters based on similarity. Intuitively,
patterns within the same cluster are more similar to each other than patterns belonging
to a different cluster. It is important to understand the difference between clustering
(unsupervised classification) and supervised classification.

Typical pattern clustering activity involves the following steps:

 pattern representation (including feature extraction and/or selection),


 definition of a pattern proximity measure appropriate to the data domain,
 clustering
 data abstraction and
 assessment of output.

Cluster analysis is an exploratory discovery process. It can be used to discover


structures in data without providing an explanation/interpretation.

Cluster analysis includes two major aspects: clustering and cluster validation.
Clustering aims at partitioning objects into groups according to a certain criteria. To
achieve different application purposes, a large number of clustering algorithms have
been developed. While due to there are no general purpose clustering algorithms to fit
all kinds of applications, thus, it is required an evaluation mechanism to assess the
quality of clustering results that produced by different clustering algorithms or a
clustering algorithm with different parameters, so that the user may find a fit cluster
scheme for a specific application. The quality assessment process of clustering results
is regarded as cluster validation. Cluster analysis is an iterative process of clustering
and cluster verification by the user facilitated with clustering algorithms, cluster
validation methods, visualization and domain knowledge to databases.

Clustering is an unsupervised-based algorithm that divides a set of data points into


clusters so that the objects are all part of the same group. Using clustering, data can be
subdivided into smaller, more manageable groups. Data from each of these subgroups
are grouped together into a single cluster.

 To put it simply, a cluster is a group of related things.


 It’s a grouping of objects where the distance between any two members is
less than the distance between any two people.
 A multidimensional space segment with a high density of objects that is
related to the other segments.

A simple example of Clustering is illustrated in Figure 1 as shown below:


 
 
Classification, Clustering
and Web Mining

Figure 1: Example of Clustering

Clustering helps in organizing huge voluminous data into clusters and displays interior
structure of statistical information. Clustering is the intent of segregating the data into
clusters. Clustering improves the data readiness towards artificial intelligence
techniques. Process for clustering, exhibits knowledge discovery in data, It is used
either as a stand-alone tool to get penetration into data distribution or as a pre-
processing step for other algorithm.

11.2.1 Applications of Cluster Analysis in Data Mining

Following are some of the applications of Cluster Analysis:

 Clustering analysis is widely utilized in a variety of fields, including data


analysis, market research, pattern identification, and image processing.
 It aids data discovery by assigning documents on the Internet.
 Credit card fraud detection relies on clustering.
 Cluster analysis is a data mining function that provides insight into data
distribution so that the properties of each cluster may be analyzed.
 It can be used in biology to figure out plant and animal taxonomies,
classify genes with similar functions, and have a better understanding of
population structure.
 Earth observation databases use this data to identify similar land regions
and to group houses in a city based on house type, value, and geographic
position.
 It is the backbone of search engine algorithms, where objects that are
similar to each other must be presented together and dissimilar objects
should be ignored. Also, it is required to fetch objects that are closely
related to a search term, if not completely related.
 A similar application of text clustering like search engine can be seen in
academics where clustering can help in the associative analysis of various
documents – which can be in-turn used in – plagiarism, copyright
infringement, patent analysis etc.
 Used in image segmentation in bioinformatics where clustering
algorithms have proven their worth in detecting cancerous cells from
various medical imagery – eliminating the prevalent human errors and
other bias.
 OTT platforms are using clustering in implementing movie
recommendations for its users.
 News summarization can be performed using Cluster analysis where
articles can be divided into a group of related topics.
 Clustering is used in getting recommendations for sports training for
athletes based on their goals and various body related metrics and assign
the training regimen to the players accordingly.


 
 
Clustering
 Marketing and sales applications use clustering to identify the Demand-
Supply gap based on various past metrics – where a definitive meaning
can be given to huge amounts of scattered data.
 Various job search portals use clustering to divide job posting
requirements into organized groups which becomes easier for a job-seeker
to apply and target for a suitable job.
 Resumes of job-seekers can be segmented into groups based on various
factors like skill-sets, experience, strengths, type of projects, expertise
etc., which makes potential employers connect with correct resources.
 Clustering effectively detects hidden patterns, rules, constraints, flow etc.
based on various metrics of traffic density from GPS data and can be used
for segmenting routes and suggesting users with best routes, location of
essential services, search for objects on a map etc.
 Satellite imagery can be segmented to find suitable and arable lands for
agriculture.
 Clustering can help in getting customer persona analysis based on various
metrics of Recency, Frequency, and Monetary metrics and build an
effective User Profile – in-turn this can be used for Customer Loyalty
methods to curb customer churn.
 Document clustering is effectively being used in preventing the spread of
fake news on Social Media.
 Website network traffic can be divided into various segments and
heuristically when we can prioritize the requests and also helps in
detecting and preventing malicious activities.
 Eateries are using clustering to perform Customer Segmentation which
helped them to target their campaigns effectively and helped increase
their customer engagement across various channels.

11.3 CLUSTERING METHODS


For a successful grouping there are two major goals – (i) similarity between one data
point with another (ii) distinction of those similar data points with others which most
certainly, heuristically differ from those points.

The basis of such divisions begins with our ability to scale large datasets and that’s a
major beginning point for us. This data contains different kinds of attributes like
categorical data, continuous data etc.. Dealing with these is the second challenge. The
next challenge is the multidimensional data. The clustering algorithm should
successfully cross this hurdle as well.

The clusters not only there to distinguish data points but also they should be inclusive.
Sure, a distance metric helps a lot but the cluster shape is often limited to being a
geometric shape and many important data points get excluded. This problem too needs
to be taken care of.

Also data is highly “noisy” in nature as many unwanted features have been residing in
the data which makes it rather Herculean task to bring about any similarity between
the data points, leading to the creation of improper groups. As we move towards the
end of the line, we are faced with a challenge of business interpretation. The outputs
from the clustering algorithm should be understandable and should fit the business
criteria and address the business problem correctly.

To address the challenges such as scalability, attributes, dimensional, boundary shape,


noise, and interpretation stated above there are various types of clustering methods to
solve one or many of these problems.


 
 
Classification, Clustering
The following are various types of Clustering methods: and Web Mining
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

11.3.1 Partitioning Method


Partitioning methods breaks down the data into a set of different clusters. Given n
objects, this method produces k clusters of data where k < n clusters of data and using
an iterative relocation method. Algorithms used in partitioned clustering are k-means
algorithm and k-medoids algorithm.

11.3.2 Hierarchical Method

Hierarchical methods, decomposes n number of data set into groups forming


hierarchy of clusters. The tree like structural representation of hierarchical based
clustering is called Dendrogram diagram. The root of dendrogram represents the
entire dataset and leaves represent each individual cluster present in the dataset. The
clustering results are obtained by taking dendrogram at different levels. The two
approaches of hierarchical based clustering, Agglomerative (Bottom-up) method and
Divisive (Top-down) method.

11.3.3 Density-Based Method


Density based clustering is clustering of database based on the densities (i.e. local
cluster criterion). It has major features such as:

 It discovers the arbitrary shape cluster.


 It handles the noise data.
 It examines only the local region to justify the density.
 To termination the process it requires the density parameters.

It is categorized into two types namely:

(a) Density Based Connectivity: This includes clustering technique such as DBSCAN
(Density based Spatial Clustering of Applications with Noise) and DBCLASD.
(b) Density based Function: DENCLUE (Density Based Clustering) method density
clusters are obtained basedon some functions.

11.3.4 Grid Based Method


The idea of grid-based clustering methods is based on the clustering oriented query
answering in multilevel grid structures. The upper level stores the summary of the
information of its next level, thus the grids make cells between the connected levels.
Many grid-based methods have been proposed, such as STING (Statistical
Information Grid Approach), CLIQUE, and the combination of grid-density based
technique. The grid-based methods are efficient on clustering data with the
complexity of O(N). However the primary issue of grid-based techniques is how to
decide the size of grids. This depends on the user’s experience.

11.3.5 Model Based Clustering Method


Model-based clustering methods are based on the assumption that data are generated
by a mixture of underlying probability distributions, and they optimize the fit between
the data and some mathematical model, for example statistical approach, neural

 
 
Clustering
network approach and other AI approaches. When facing an unknown data
distribution, choosing a suitable one from the model based candidates is still a major
challenge. On the other hand, clustering based on probability suffers from high
computational cost, especially when the scale of data is very large.

11.3.6 Constraint Based Method


The clustering process, in general, is based on the approach that the data can be
divided into an optimal number of “unknown” groups. The underlying stage of all the
clustering algorithms is to find those hidden patterns and similarities, without any
intervention or predefined conditions. However, in certain business scenarios, we
might be required to partition the data based on certain constraints. Here is where a
supervised version of clustering machine learning techniques comes into play.
A constraint is defined as the desired properties of the clustering results, or a user’s
expectation on the clusters so formed – this can be in terms of a fixed number of
clusters, or, the cluster size, or, important dimensions (variables) that are required for
the clustering process. Usually, tree-based algorithms like Decision Trees, Random
Forest, and Gradient Boosting, etc. are made use of to attain constraint-based
clustering. A tree is constructed by splitting without the interference of the constraints
or clustering labels. Then, the leaf nodes of the tree are combined together to form the
clusters while incorporating the constraints and using suitable algorithms.

In the next section, let us study the algorithms available in partitioning method.

11.4 PARTITIONING METHOD


This method is one of the most popular choices for analysts to create clusters. This is
also known as Supervised Clustering method. In partitioning clustering, the clusters
are partitioned based upon the characteristics of the data points. We need to specify
the number of clusters to be created for this clustering method. These clustering
algorithms follow an iterative process to reassign the data points between clusters
based upon the distance. The algorithms that fall into this category are as follows:

11.4.1 k-Means Algorithm


This algorithm can be used to cluster the data set thus forming clusters repeatedly. It is
one of the unsupervised and iterative algorithms. The main aim of this algorithm is to
find out the location of the clusters thus minimizing the distance between the cluster
and the data set. This algorithm also called “Lloyd’s algorithm” where m data set are
clustered to form some number of clusters say k, where each of the data set belongs to
the closer mean cluster.

Algorithm

1. Define the number of clusters (k) to be produced and identical data point
centroids.
2. The distance from every data point to all the centroids is calculated and the
point is assigned to the cluster with a minimum distance.
3. Follow the above step for all the data points.
4. The average of the data points present in a cluster is calculated and can set
new centroid for that cluster.
5. Until desired clusters are formed repeat Step 2.

The initial centroid is selected randomly and thus the resulting clusters have larger
influence on them. Complexity of k-means algorithm is O(tkn) where n- total data set,
k-clusters formed, t-iterations in order to form cluster [1].


 
 
Classification, Clustering
Advantages and Web Mining

 Effortless implementation process.


 Dense clusters are produced when clusters are spherical when compared to
hierarchical method.
 Appropriate for large databases.

Disadvantages

 Inappropriate for clusters with different density and size.


 Equivalent results are not produced on iterative run.
 Euclidean distance measures can weigh unequally due to underlying factors.
 Unsuccessful for non-linear data set and categorical data.
 Noisy data and outliers are difficult to handle.
 
11.4.2 k-Medoids or PAM (Partitioning Around Medoids)

It is similar in process to the K-means clustering algorithm with the difference being
in the assignment of the center of the cluster. In PAM, the medoid of the cluster has to
be an input data point while this is not true for K-means clustering as the average of
all the data points in a cluster may not belong to an input data point.

In this algorithm, each of the cluster’s is described by one of the objects which is
located near the centroid of the cluster. The repeated process of changing described
objects by non-described objects continues until the resulting cluster is improved. The
value is predicted using cost function which measures the variance between an object
and described object of a cluster.

The algorithm is implemented in two steps:

Build: Initial medoids are innermost objects.


Swap: A function can be swapped by another function until the function can no longer
be reduced.

Algorithm
1. Initially choose m random points as initial medoids from given data set.
2. For every data point assign a closest medoid by distance metrics.
3. Swapping cost is calculated for every chosen and unchosen object given as
TCns where s is selected and n is non-selected object.
4. If TCns < 0, s is replaced by n
5. Until there is no change in medoids, repeat 2 and 3.

Four characteristics to be considered are:

 Shift-out membership: Movement of an object from current cluster to another


is allowed.
 Shift-in membership: Movement of an object from outside to current cluster is
allowed.
 Update the current medoids: Current medoid can be replaced by a new
medoid.
 No change: Objects are at their appropriate distances from cluster.

Advantages
 Effortless understanding and implementation process.
 Can run quickly and converge in few steps.
 Dissimilarities between the objects is allowed.

 
 
Clustering
 Less sensitive to outliers when compared to k-means.

Disadvantages
 Initial sets of medoids can produce different clustering’s. It is thus advisable
to run the procedure several times with different initial sets.
 Resulting clusters may depend upon units of measurement. Variables of
different magnitude can be standardized.

In the following section, let us focus on the algorithms available in the hierarchical
method.

11.5 HIERARCHICAL METHOD

This method decomposes a set of data items into a hierarchy. Depending on how the
hierarchical breakdown is generated, we can put hierarchical approaches into different
categories. Following are the two approaches;
 Agglomerative Approach
 Divisive Approach

11.5.1 Agglomerative Approach


This Algorithm is also referred as Bottom-up approach. This approach treats each and
every data point as a single cluster and then merges each cluster by considering  the
similarity (distance) in each individual cluster until a single large cluster is obtained or
when some condition is satisfied.

Algorithm
1. Initialize all n data points into N individual clusters.
2. Find the cluster pairs with the least distance (closest distance) and combine
them as one single cluster.
3. Calculate pair-wise distance between the clusters at present that is the new
formed cluster and the priority available clusters.
4. Repeat steps 2 and 3 until all data samples are merged into a single large
cluster of size N

Advantages

 Easy to identify nested clusters.


 Gives better results and ease in implementation.
 They are suitable for automation.
 Reduces the effect of initial values of cluster on the clustering results.
 Reduces the computing time and space complexity.

Disadvantages

 It can never undo what was done previously.


 Difficulty in handling different sized clusters and convex shapes lead to
increase in time complexity
 There is no direct minimization of objective function.
 Sometimes there is difficulty in identifying the exact number of clusters by
the Dendrogram.


 
 
Classification, Clustering
11.5.2 Divisive Approach and Web Mining
This approach is also referred as the top-down approach. In this, we consider the
entire data sample set as one cluster and continuously splitting the cluster into smaller
clusters iteratively. It is done until each object in one cluster or the termination
condition holds. This method is rigid, because once a merging or splitting is done, it
can never be undone.

Algorithm

1. Initially, initiate the process with one cluster containing all the samples.
2. Select a largest cluster from the cluster that contains widest diameter.
3. Detect the data point in the cluster found in step 2 with the minimum average
similarity to the other elements in that cluster.
4. The first element to be added to the fragment group is the data samples found
in step 3.
5. Detect the element in the original group which has the highest average
similarity with the fragment group.
6. If the average similarity of element obtained in step 5 with the fragment group
is greater than its average similarity with the original group then assign the
data sample to the fragment group and go to step 5; otherwise do nothing;
7. Repeat the step 2 to 6 until each data point is separated into individual clusters

Advantage

 It produces more accurate hierarchies than bottom-up algorithm in some


circumstances.

Disadvantages

 Top down approach is computationally more complex than bottom up


approach because we need a second flat clustering algorithm.
 Use of different distance metrics for measuring distance between clusters may
generate different results.

Let us study DBSCAN algorithm pertaining to the Density-based method in the next
section.

11.6 DENSITY BASED METHOD


The primary idea of density-based methods is that for each point of a cluster the
neighborhood of a given unit distance contains at least a minimum number of points,
i.e. the density in the neighborhood should reach some threshold. However, this idea
is based on the assumption of that the clusters are in the spherical or regular shapes.

11.6.1 DBSCAN(Density-Based Spatial Clustering of Applications with Noise)

DBSCAN was proposed to adopt density-reachability and density connectivity for


handling the arbitrarily shaped clusters and noise. In DBSCAN, a cluster is defined as
group of data that is of highly dense. DBSCAN considers two parameters such as:

Eps: the maximum value of radius from its neighborhood.


MinPts: The Eps is surrounded by data points (i.e. Eps-Neighborhood) that should be
minimum.


 
 
Clustering
To define Eps-Neighborhood it should satisfy the following condition,
NEps(q) : { p belongs to D|(p, q) ≤ Eps }.

In order to understand the Density Based Clustering let us follow few definitions:

 Core point: It is point which lies within Eps and MinPts which are specified by
user. And that point is surrounded by dense neighborhood.
 Border point: It is point that lies within the neighborhood of core point and
multiple core points can share same border point and this point does not contains
dense neighborhood.
 Noise/Outlier: It is point that does not belongs to cluster.
 Direct Density Reachable: A point p is directly Density Reachable from point q
with respect to Eps, MinPts if point p belongs to NEps(q) and Core point condition
i.e.|NEps(q) | ≥ MinPts
 Density Reachable: A point p is said to Density Reachable from point q with
respect to Eps, MinPts if there a chain points such as p1,p2, ... ... pn, p1 = q, pn = p
such that pi + 1 is directly reachable from pn.

Algorithm

1. In order to form clusters, initially consider a random point say point p


2. The second step is to find the all points that are density reachable from point p with
respect to Eps and MinPts. The following condition is checked in order to form the
cluster
a. If point p is found to be core point, then cluster is obtained.
b. If point p is found to be border point, then no points are density reachable
from point p and hence visit the next point of database.
3. Continue this process until all the points is processed.

Advantages

 It can identify Outlier.


 It does not require number of clusters to be specified in advance.

Disadvantages

 If the density of data keeps changing then efficiency of finding clusters is


difficult.
 It does not suit for high quality of data and the user has to specify the
parameter in advance.

11.7 LIMITATIONS WITH CLUSTER ANALYSIS

There are two major drawbacks that influence the feasibility of cluster analysis in real
world applications in data mining.

 The existing automated clustering algorithms on dealing with arbitrarily


shaped data distribution of the datasets.
 The second issue is that, the evaluation of the quality of clustering results by
statistics-based methods is time consuming when the database is large,
primarily due to the drawback of very high computational cost of statistics-
based methods for assessing the consistency of cluster structure between the
sampling subsets. The implementation of statistics-based cluster validation
10 
 
 
Classification, Clustering
methods does not scale well in very large datasets. On the other hand, and Web Mining
arbitrarily shaped clusters also make the traditional statistical cluster validity
indices ineffective, which leave it difficult to determine the optimal cluster
structure.

In addition, the inefficiency of clustering algorithms on handling arbitrarily shaped


clusters in extremely large datasets directly impacts the effect of cluster validation,
because cluster validation is based on the analysis of clustering results produced by
clustering algorithms. Moreover, most of the existing clustering algorithms tend to
deal with the entire clustering process automatically, i.e., once the user sets the
parameters of algorithms, the clustering result is produced with no interruption, which
excludes the user until the end. As a result, it is very hard to incorporate user domain
knowledge into the clustering process. Cluster analysis is a multiple runs iterative
process, without any user domain knowledge, it would be inefficient and unintuitive
to satisfy specific requirements of application tasks in clustering.

11.8 OUTLIER ANALYSIS


In statistics, an outlier is a data point that deviates significantly from the norm.
Measuring and execution errors are two possible causes. Outlier analysis, also known
as outlier mining, is the process of examining data that stands out for whatever reason.
It's impossible to do data analysis without encountering outliers. An outlier is a data
point that deviates significantly from the norm. As a result, an outlier is critical since
it indicates an experimentation flaw. There are numerous applications for outliers,
such as identifying fraud and creating new market trends. Outliers are frequently
mistaken for random fluctuations in the data. Outliers, on the other hand, are distinct
from random noise in the following way:
 An outlier is a point of observation that differs from the rest because of its
location.
 To better detect outliers, noise should be eliminated.

11.8.1 Outliers in Data Mining

For the most part, data mining algorithms ignore outliers like noise and exceptions.
However, in certain applications like fraud detection, uncommon occurrences can be
just as fascinating as the more common ones, therefore performing an outlier analysis
becomes critical.

Outliers in Data Mining can have a variety of causes. Following are a few of these
contributing factors:

 Identifying financial fraud such as credit card hacking or other similar


scams makes use of this technology.
 It’s utilized to keep track of a customer's changing purchase habits.
 It’s used to find and report human-made mistakes in typing.
 It’s utilized for troubleshooting and identifying problems with machines
and systems.

11.8.2 Handling of Outliers in Data Mining

Data Mining must deal with outliers for a variety of reasons. These are a few of the
explanations:
 The outcomes of databases are impacted by outliers.
 Outliers frequently produce good or useful discoveries and conclusions,
allowing researchers to identify various patterns or trends.

11 
 
 
Clustering
 Even in the world of study, outliers can be useful. They can be a lifesaver
when doing research.
 Data mining's most important subfield is outlier analysis.

11.8.3 Outlier Detection

Outliers are generally defined as models that are exceptionally far from the
mainstream of data. There is no strict mathematical definition of what alienation is;
determining whether an observation is an abstraction is ultimately a subjective
exercise. An outlier can be interpreted as data or observation that deviates greatly
from the mean of a given protocol or set of data. An exception may occur by accident,
but it may indicate a measurement error or the given set of data may have a heavier
distribution.

Therefore, outlier detection can be defined as the process of detecting and then
excluding outsiders from a given set of data. There are no standardized outlier
identification methods because these are mostly dataset-dependent. Outlier detection
as a branch of data processing has many applications in data stream analysis.

11.8.4 Outlier Detection Techniques

To identify the exterior in the database, it is important to keep in mind the context and
find the answer to the most basic and relevant question: “Why should I find outliers?”
The context will explain the meaning of your findings.

Remember two important questions about your database during Outlier Identification:

(i) What and how many features do I consider for outlier detection?
(Similarity/diversity)
(ii) Can I take the distribution (s) of values for the features I have
selected? (Parameter / non-parameter)

There are four Outlier Detection techniques in general.

11.8.4.1 Numeric Outlier

A numerical outlier is a simple, non-standard outlier detection technique in a one-


dimensional feature space. Exteriors are calculated by IQR (InterQuartile Range). For
example, the first and third quarters (Q1, Q3) are calculated. Outlier is a data point xi
that is out of range.

Using the interquartile amplifier value k=1.5, the limits are the typical upper and
lower whiskers of a box plot.

This technique can be easily implemented on the KNIME Analytics platform using
the Numeric Outliers node.

11.8.4.2 Z-Score

The Z-score technique considers the Gaussian distribution of data. Outliers are data
points that are on the tail of the distribution and are therefore far from average.

The z-score of any data point can be calculated by the following expression, after
making appropriate changes to the selected feature interval of the dataset:

When calculating the z-score for each sample, a limit must be specified in the data set.
Some good ‘thumb rule’ limits may be fixed deviations of 2.5, 3, 3.5, or more.
12 
 
 
Classification, Clustering
and Web Mining
11.8.4.3 DBSCAN

This outlier detection technique is based on the DBSCAN clustering method.


DBSCAN is a non-standard, density-based outlier detection method. Here, all data
points are defined as focal points, boundary points, or noise points.

11.8.4.4 Isolated forest


This non-parameter system is suitable for large datasets in one or more dimensional
features. Isolation number is very important in this outlier detection technique.
Isolation number is the number of divisions required to isolate a data point.

11.8.5 Models for Outlier Detection Analysis

There are many approaches to detecting abnormalities. Outlier detection models can
be classified into the following groups:

11.8.5.1 Intensive Value Analysis

Extreme value analysis is the most basic form of outlier detection and is suitable for 1-
dimensional data. In this external analysis approach, the largest or smallest values are
considered externally. The Z-Test and the Students’ T-Test are excellent examples.
These are good heuristics for the initial analysis of data but they are not of much value
in multifaceted systems. Extreme value analysis is often used as a final step in
interpreting the outputs of other outlier detection methods.

11.8.5.2 Linear Models

In this approach, data is structured outside the lower dimensional substructure using
linear interactions. The distance of each data point is calculated for a plane that
corresponds to the sub-interval. This distance is used to detect outliers. PCA (primary
component analysis) is an example of a linear model for anomaly detection.

11.8.5.3 Probabilistic and Statistical Models

In this approach, probability and statistical models consider specific distributions of


data. Expectation-enhancement (EM) methods are used to estimate the parameters of
the sample. Finally, they calculate the probability of the member of each data point for
the calculated distribution. Points with the lowest probability of membership are
marked externally.

11.8.5.4 Proximity-based Models

In this mode, the outliers are designed as points of isolation from other observations.
Cluster analysis, density-based analysis, and neighborhood environment are key
approaches of this type.

11.8.5.5 Information-theoretical models

In this mode, outliers increase the minimum code length to describe a data set.

11.8.6 Uses for Detecting Outliers in Data Mining

In Data Mining, it is common to utilize Outlier Detection to find anomalies. In data


mining, it's utilized to find patterns or trends. The following are some examples of
where outlier detection is put to use in data mining:
13 
 
 
Clustering

 Fraud Detection
 Telecom Fraud Detection
 Intrusion Detection in Cyber Security
 Medical Analysis
 Environment Monitoring such as Cyclone, Tsunami, Floods, Drought and
so on
 Noticing unforeseen entries in Databases

Check Your Progress 1:

1. Describe the uses of cluster analysis in data mining.


……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
2. Differentiate between Various Clustering Methods along with their description,
advantages, disadvantages and algorithms available.
……………………………………………………………………………………
……………………………………………………………………………………
…………………………………………………………………………………...
3. Briefly discuss Outlier and Outlier Detection.
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………

11.9 SUMMARY
In this unit, we had studied the introductory topics of clustering, clustering methods,
algorithms associated with clustering and outlier analysis.

A group of comparable data objects is classed as a cluster in clustering. A collection


of data is referred to as a group. The cluster analysis divides data into groupings based
on how closely they resemble one another. A label is assigned to a group once the data
has been classified into several groups. By doing the classification, it aids in
adaptation to shifts.

Cluster Analysis refers to the process of identifying groupings of items that are similar
in some respects but differ from one another in other respects. It is a grouping of
comparable items that belongs to the same category. A set of items in which the
distance between any two things in the cluster is less than the distance between any
two objects in the cluster and any object not placed inside the cluster. High-density
region related to other regions in multidimensional space. Cluster analysis has wide
applications in data mining, information retrieval, biology, medicine, marketing, and
image segmentation.

With the help of clustering algorithms, a user is able to understand natural clusters or
structures underlying a data set. For example, clustering can help marketers discover
distinct groups and characterize customer groups based on purchasing patterns in
business. In biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionality, and gain insight into structures inherent in
populations.

14 
 
 
Classification, Clustering
Outlier detection from a collection of datasets is a well-known data mining process. and Web Mining
Outliers help in detection of unusual patterns and behaviors of different data points
which can give a useful result for the research.

More on Clustering can be studied in your third semester MCS-224 Artificial


Intelligence and Machine Learning course.

In this next unit, we will discuss Text and Web Mining.

11.10 SOLUTIONS/ANSWERS

Check Your Progress 1:

1.
Data clustering analysis has a wide range of applications, including image processing,
data analysis, pattern identification, and market research. With the use of Data
clustering, firms can find new client groupings in their database. Buying patterns can
also be used to classify data.

Data mining clustering aids in the classification of animals and plants by utilizing
comparable functions or genes in biology. Finding out more about how species are
structured becomes easier with this information. Data mining clustering identifies
geographic areas. There are regions in the earth observation database that are similar
to each other.
In the city, a particular type of dwelling is included in a specific neighborhoods.
Information can be discovered more easily by classifying online files using clustering
in data mining. Additionally, it's employed in detection software. If a credit card is
being used fraudulently, the pattern of deceit can be identified using clustering in data
mining.

2.

Clustering Description Advantages Disadvantages Algorithms


Method
Hierarchical Based on top- Easy to Cluster DIANA,
Clustering to-bottom implement, the assignment is AGNES,
hierarchy of number of strict and cannot hclust etc.
the data points clusters need be undone, high
to create not be specified time complexity,
clusters. apriori, cannot work for a
dendrograms are larger dataset
easy to interpret.
Partitioning Based on Easy to We need to k-means, k-
methods centroids and implement, specify the medians, k-
data points are faster number of modes
assigned into a processing, can cenrtroids apriori,
cluster based work on larger clusters that get
on its data, easy to created are of
proximity to interpret the inconsistent sizes
the cluster outputs and densities,
centroid Effected by noise
and outliers
Distribution- Based on the Number of Complex Gaussian
based probability clusters need algorithm and Mixed
Clustering distribution of not be specified slow, cannot be Models,
the data, apriori, works scaled to larger DBCLASD
15 
 
 
Clustering
clusters are on real-time data
derived from data, metrics are
various easy to
metrics like understand and
mean, tune
variance etc.
Density-based Based on Can handle Complex DENCAST,
Clustering density of the noise and algorithm and DBSCAN
(Model-based data points, outliers, need slow, cannot be
methods) also known as not specify scaled to larger
model based number of data
clustering clusters in the
start, clusters
that are created
are highly
homogenous, no
restrictions on
cluster shapes.
Constraint Clustering is Creates a perfect Overfitting, high Decision
Based directed and decision level of Trees,
(Supervised controlled by boundary, can misclassification Random
Clustering) user automatically errors, cannot be Forest,
constraints determine the trained on larger Gradient
outcome classes datasets Boosting
based on
constraints,
future data can
be classified
based on the
training
boundaries

3.
An Outlier is a data object that significantly deviates from normal objects as if it were
generated by different mechanism. Outlier is different from noise, since noise is a
random error or measured variance and it should be removed before outlier detection.
Outlier detection aims to find patterns in data that do not conform to expected
behavior.

Outlier detection is one of the important aspects of data mining to find out those
objects that differ from the behavior of other objects. Finding outliers from a
collection of patterns is a popular problem in the field of data mining. A key challenge
with outlier detection is that it is not a well expressed problem like clustering so
Outlier Detection as a branch of data mining requires more attention. Outlier
Detection methods can identify errors and remove their contaminating effect on the
data set and as such to purify the data for processing. Outlier detection is extensively
used in a wide variety of applications such as military surveillance for enemy
activities to prevent attacks, intrusion detection in cyber security, fraud detection for
credit cards, insurance or health care and fault detection in safety critical systems and
in various kind of images. It is important in analyzing the data due to the fact that they
can translate into actionable information in a wide variety of applications. An irregular
traffic pattern occurrence in a computer network could mean that a hacked computer
is sending out sensitive data to an unauthorized destination.

11.11 FURTHER READINGS

1. Data Mining: Concepts and Techniques, 3rd Edition, Jiawei Han, Micheline
Kamber, Jian Pei, Elsevier, 2012.
16 
 
 
Classification, Clustering
2. Data Mining, Charu C. Aggarwal, Springer, 2015. and Web Mining
3. Data Mining and Data Warehousing – Principles and Practical Techniques,
Parteek Bhatia, Cambridge University Press, 2019.
4. Introduction to Data Mining, Pang Ning Tan, Michael Steinbach, Anuj
Karpatne, Vipin Kumar, Pearson, 2018.
5. Data Mining Techniques and Applications: An Introduction, Hongbo Du,
Cengage Learning, 2013.
6. Data Mining : Vikram Pudi and P. Radha Krishna, Oxford, 2009.
7. Data Mining and Analysis – Fundamental Concepts and Algorithms;
Mohammed J. Zaki, Wagner Meira, Jr, Oxford, 2014.

17 
 
UNIT 10 CLASSIFICATION
10.0 Introduction
10.1 Objectives
10.2 Classification – An Overview
10.3 k-NN Algorithm
10.4 Decision Tree Classifier
10.5 Bayesian Classification
10.6 Support Vector Machines
10.7 Rule Based Classification
10.8 Model Evaluation and Selection
10.9 Summary
10.10 Solutions/Answers
10.11 Further Readings

10.0 INTRODUCTION

In the earlier unit, we have studied Mining frequent patterns and associations covering
the topics like market basket analysis, classification of frequent pattern mining,
association rule mining, Apriori algorithm, mining multilevel association rules etc.. In
this unit, we will focus on an important topic of Data mining called Classification.

Knowledge discovery from datasets is a part of data mining. Data mining tools and
methods are applied to extract patterns and features from large amount of data, which
can then be applied to other datasets. The classification technique, which can handle a
larger range of data, is gaining prominence. It is one of the most commonly used
technique’s when it comes to classifying large sets of data. This method of data
analysis includes algorithms adapted to the data quality. The algorithm that performs
the classification is the classifier while the observations are the instances.

For example, companies use this approach to learn about the behavior and preferences
of their customers. With classification, you can distinguish between data that is useful
to your goal and data that is not relevant. Another example of this would be your own
email service, which can identify spam and important messages.

In this unit you are going to study the concept of classification in data mining and
various classifiers.

10.1 OBJECTIVES

After completing this unit, you will be able to

 Define Classification in Data Mining


 Understand the approach to data classification
 List various types of Classification Techniques
 Understand and demonstrate various Classification Techniques
 Discuss the advantages, disadvantages and applications of various
Classification Techniques
 Discuss various approaches for Model Evaluation and Model Selection.


 
 
Classification

10.2 CLASSIFICATION – AN OVERVIEW

Classification is a data mining process that assigns items in a collection to target


categories or classes. The objective of classification is to accurately predict the target
class for each record in the data. For example, a classification model used to identify
loan applicants as low, medium, or high credit risks. A classification model that
predicts credit risk could be developed based on observed data for many loan
applicants over a period of time. In addition to the historical credit rating, the data
might track employment history, home ownership years of residence, number and type
of investments, And so on. Credit rating would be the target, the other attributes are
the predictors, and the data for each Customer constitute a case. Classifications are
discrete and do not imply order. A predictive classifier with a numerical target uses a
regression algorithm, not a classification algorithm. The simplest type of classification
problem is binary classification. Where, the target attribute has only two values- YES
or NO. Multiclass targets have more than two values: for example, low, medium,
high, or unknown credit rating.

For example, spam detection in email service providers can be identified as a


classification problem. This is s binary classification since there are only 2 classes as
spam and not spam. A classifier utilizes some training data to understand how given
input variables relate to the class. In this case, known spam and non-spam emails have
to be used as the training data. When the classifier is trained accurately, it can be used
to detect an unknown email.

10.2.1 General Approach to Classification

Data classification is a two-step process, consisting of:

(i) a learning step (where a classification model is constructed) and


(ii) a classification step (where the model is used to predict class labels for
given data).

In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm
builds the classifier by analyzing or “learning from” a training set made up of
database tuples and their associated class labels. A tuple, X, is represented by an
n-dimensional attribute vector, X =(x1, x2, … , xn), depicting n measurements made
on the tuple from n database attributes, respectively A1, A2, …., An….1. Each tuple,
X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute. The class label attribute is discrete-valued and
unordered. It is categorical (or nominal) in that each value serves as a category or
class. The individual tuples making up the training set are referred to as training
tuples. Because the class label of each training tuple is provided, this step is also
known as supervised learning (i.e., the learning of the classifier is “supervised” in that
it is told to which class each training tuple belongs). It contrasts with unsupervised
learning (or clustering), in which the class label of each training tuple is not known,
and the number or set of classes to be learned may not be known in advance.

In the second step, the model is used for classification. First, the predictive accuracy
of the classifier is estimated. If we were to use the training set to measure the
classifier’s accuracy, this estimate would likely be optimistic, because the classifier
tends to overfit the data (i.e., during learning it may incorporate some particular
anomalies of the training data that are not present in the general data set overall).
Therefore, a test set is used, made up of test tuples and their associated class labels.
They are independent of the training tuples, meaning that they were not used to


 
 
Classification, Clustering
construct the classifier. The accuracy of a classifier on a given test set is the and Web Mining
percentage of test set tuples that are correctly classified by the classifier.

10.2.2 Applications of Classification Models

The classification in Data Mining has many applications in day-to-day life. A few
Classification Applications in Data Mining are:

 Product Cart Analysis on the eCommerce platform uses the classification


technique to associate the items into groups and create combinations of
products to recommend. This is a very common Classification Applications in
Data Mining
 The weather patterns can be predicted and classified based on parameters such
as temperature, humidity, wind direction, and many more. These
Classification Applications of Data Mining are used in daily life.
 The public health sector classifies the diseases based on the parameters like
spread rate, severity, and a lot more. This helps in charting out strategies to
mitigate diseases. These Classification Applications of Data Mining help in
finding cures.
 Financial institutes use classification to determine the defaulters and help in
determining the loan seekers, and other categories. These Classification
Applications in Data Mining helps in finding the target audience much easier.
 Spam detection e-mails based on the header and content of the document.
 Classification of students according to their qualifications.
 Patients are classified according to their medical history.
 Classification can be used for the approval of credit.
 Facial key points detection.
 Drugs classification.
 Pedestrian’s detection in an automotive car driving.
 Cancer tumor cells identification.
 Sentiment Analysis.

Let us see the definitions for some of the terminologies encountered in classification:

 Classifier: An algorithm that maps the input data to a specific category.


 Classification Model: A classification model tries to draw some conclusion
from the input values given for training. It will predict the class
labels/categories for the new data.
 Feature: A feature is an individual measurable property of a phenomenon
being observed.
 Binary Classification: Classification task with two possible outcomes.
Example: Gender classification (Male / Female)
 Multi-class classification: Classification with more than two classes. In multi
class classification each sample is assigned to one and only one target label.
Example: An animal can be cat or dog but not both at the same time
 Multi-label Classification: Classification task where each sample is mapped
to a set of target labels (more than one class).
Example: A news article can be about sports, a person, and location at the
same time.

10.2.3 Classification Algorithms in Data Mining

Classification is the operation of separating various entities into several classes. These
classes can be defined by business rules, class boundaries, or some mathematical
function. The classification operation may be based on a relationship between a
known class assignment and characteristics of the entity to be classified. This type of


 
 
Classification
classification is called supervised. If no known examples of a class are available, the
classification is unsupervised. The most common unsupervised classification approach
is clustering, which we will be studying in the next Unit.

Classification algorithm finds relationships between the values of the predictors and
the values of the target. Different Classification algorithms use different techniques
for finding relationships. These relationships are summarized in a model, which can
then be applied to a different dataset in which the class assignments are unknown.
Classification models are tested by comparing the predicted values to known target
values in a set of test data. Classification belongs to the category of supervised
learning where the targets also provided with the input data.

Classification is a highly popular aspect of data mining. As a result, data mining has
many classifiers/classification algorithms:

 Logistic regression
 Linear regression
 K-Nearest Neighbours Algorithm (kNN)
 Decision trees
 Rule-based Classification
 Bayesian Classification
 Random Forest
 Naive Bayes
 Support Vector Machines

We will study the details of some to the popular Classifiers in the following sections.
To start with, we will study k-NN classification technique in the following section.

10.3 k-NN ALGORITHM

The k-Nearest Neighbors (k-NN) algorithm is a data classification method for estimating
the likelihood that a data point will become a member of one group or another based on
what group the data points nearest to it belong to.

The k-nearest neighbor algorithm is a type of supervised machine learning algorithm


used to solve classification and regression problems. However, it's mainly used for
classification problems.k-NN is a lazy learning and non-parametric algorithm.

It's called a lazy learning algorithm or lazy learner because it doesn't perform any
training when you supply the training data. Instead, it just stores the data during the
training time and doesn't perform any calculations. It doesn't build a model until a
query is performed on the dataset.

It's considered a non-parametric method because it doesn’t make any assumptions


about the underlying data distribution. Simply put, k-NN tries to determine what
group a data point belongs to by looking at the data points around it.

Consider there are two groups, A and B. To determine whether a data point is in group
A or group B, the algorithm looks at the states of the data points near it. If the
majority of data points are in group A, it's very likely that the data point in question is
in group A and vice versa.

In short, k-NN involves classifying a data point by looking at the nearest annotated
data point, also known as the nearest neighbor.


 
 
Classification, Clustering
Don't confuse k-NN classification with K-means clustering. K-NN is a supervised and Web Mining
classification algorithm that classifies new data points based on the nearest data
points. On the other hand, K-means clustering is an unsupervised clustering algorithm
that groups data into a K number of clusters which you will be learning in the next
Unit.

10.3.1 Working of k-NN

As mentioned above, the k-NN algorithm is predominantly used as a classifier. Let's


take a look at how kNN works to classify unseen input data points.

k-NN classification is easy to understand and simple to implement. It's ideal in


situations where the data points are well defined or non-linear. In essence, k-NN
performs a voting mechanism to determine the class of an unseen observation. This
means that the class with the majority vote will become the class of the data point in
question. If the value of K is equal to one, then we'll use only the nearest neighbor to
determine the class of a data point. If the value of K is equal to ten, then we'll use the
ten nearest neighbors, and so on.

To put that into perspective, consider an unclassified data point X. There are several
data points with known categories, A and B, in a scatter plot. Suppose the data point X
is placed near group A.As you know, we classify a data point by looking at the nearest
annotated points. If the value of K is equal to one, then we'll use only one nearest
neighbor to determine the group of the data point. In this case, the data point X
belongs to group A as its nearest neighbor is in the same group. If group A has more
than ten data points and the value of K is equal to 10, then the data point X will still
belong to group A as all its nearest neighbors are in the same group.

Suppose another unclassified data point Y is placed between group A and group B. If
K is equal to 10, we pick the group that gets the most votes, meaning that we classify
Y to the group in which it has the most number of neighbors. For example, if Y has
seven neighbors in group B and three neighbors in group A, it belongs to group B.
The fact that the classifier assigns the category with the highest number of votes is
true regardless of the number of categories present.

You might be wondering how the distance metric is calculated to determine whether a
data point is a neighbor or not. There are four ways to calculate the distance measure
between the data point and its nearest neighbor: Euclidean distance, Manhattan
distance, Hamming distance, and Minkowski distance. Out of the three, Euclidean
distance is the most commonly used distance function or metric.

10.3.2 K-Nearest Neighbors Algorithm

The following is the pseudocode for KNN:


1. Load the data
2. Choose K value
3. For each data point in the data:
o Find the Euclidean distance to all training data samples
o Store the distances on an ordered list and sort it
o Choose the top K entries from the sorted list
o Label the test point based on the majority of classes present in the
selected points
4. End

To validate the accuracy of the k-NN classification, a confusion matrix is used. Other
statistical methods such as the likelihood-ratio test are also used for validation.


 
 
Classification
Here are some of the areas where the k-Nearest Neighbor algorithm can be used:

 Credit rating: The k-NN algorithm helps determine an individual's credit


rating by comparing them with the ones with similar characteristics.
 Loan approval: Similar to credit rating, the k-nearest neighbor algorithm is
beneficial in identifying individuals who are more likely to default on loans
by comparing their traits with similar individuals.
 Data Preprocessing: Datasets can have many missing values. The k-NN
algorithm is used for a process called missing data imputation that estimates
the missing values.
 Pattern Recognition: The ability of the k-NN algorithm to identify patterns
creates a wide range of applications. For example, it helps detect patterns in
credit card usage and spot unusual patterns. Pattern detection is also useful in
identifying patterns in customer purchase behavior.
 Stock Price Prediction: Since the k-NN algorithm has a flair for predicting
the values of unknown entities, it's useful in predicting the future value of
stocks based on historical data.
 Recommendation Systems: Since k-NN can help find users of similar
characteristics, it can be used in recommendation systems. For example, it can
be used in an online video streaming platform to suggest content a user is
more likely to watch by analyzing what similar users watch.
 Computer Vision: The k-NN algorithm is used for image classification.
Since it’s capable of grouping similar data points, for example, grouping cats
together and dogs in a different class, it’s useful in several computer
vision applications.

10.3.3 Advantages and Disadvantages of KNN

Some of the advantages of using the k-Nearest Neighbors algorithm are:

 It's easy to understand and simple to implement


 It can be used for both classification and regression problems
 It's ideal for non-linear data since there's no assumption about underlying data
 It can naturally handle multi-class cases
 It can perform well with enough representative data.

The disadvantages of using the k-Nearest Neighbors algorithm:

 Associated computation cost is high as it stores all the training data


 Requires high memory storage
 Need to determine the value of K
 Prediction is slow if the value of N is high
 Sensitive to irrelevant features

We will study Decision Tree classification technique in the next section.

10.4 DECISION TREE CLASSIFIER

Decision tree classifier is the most effective and common prediction and
classification method. It is a simple and widely used classification technique. It
applies a straight forward idea to solve the classification problem. Decision Tree
Classifier poses a series of carefully crafted questions about the attributes of the test
record. Each time it receives an answer, a follow-up question is asked until a
conclusion about the class label of the record is reached.


 
 
Classification, Clustering
A decision tree is a flow chart like tree structure, where each internal node denotes a and Web Mining
test on an attribute, each branch represents an outcome of the test, and leaf nodes
represent classes or class distributions. The top most node in a tree is the root node.
Normally, internal nodes are denoted by rectangles and leaf nodes are denoted by
ovals. A typical decision tree is shown in Figure 1 given below:

Figure 1: Decision Tree for Play Tennis


In order to classify an unknown sample, the attribute values of the sample are tested
against the decision tree. A path is traced from the root to a leaf node that holds the
class prediction for that sample. Decision trees can easily be converted to
classification rules. A basic decision tree algorithm is summarized in the next section.

10.4.1 Basic Algorithm for Inducing a Decision Tree from Training Samples

Below given is the basic decision tree algorithm for learning decision trees.

Algorithm: Generate_decision_tree. Generate a decision tree from the given Training


Data

Input: The training samples, samples, represented by discrete-valued attributes; the


set of candidate attributes, attribute-list.

Output: A decision tree.

(1) create a node N;


(2) If samples are all of the same class, C then
(3) Return N as a leaf node labeled with the class C;
(4) If attribute-list is empty them
(5) return N as a leaf node labeled with the most common class in samples;
//majority voting
(6) select test-attribute, the attribute among attribute-list with the highest
information gain;
(7) label node N with test-attribute;
(8) for each known value ai of test-attribute //partition the samples
(9) grow a branch from node N for the condition test-attribute = ai;
(10) let si be the set of samples in samples for which test-attribute = ai;
//a partition
(11) If si is empty then
(12) attach a leaf labeled with the most common class in samples;
(13) else attach the node returned by Generate_decision_tree(si,attribute-list -
test-attribute);


 
 
Classification
The algorithm summarized above is a version of ID3, a well known decision tree
induction algorithm. The basic strategy is as follows:

 The tree starts as a single node representing the training samples (step 1).
 If the samples are all of the same class, then the node becomes a leaf and is
labeled with that class (steps 2 and 3).
 Otherwise, the algorithm uses an entropy based measure known as
information gain as a heuristic for selecting the attribute that will best
separate the samples into individual classes (step 6). This attribute becomes
the “test” or “decision” attribute at the node (step 7). In this version of the
algorithm, all attributes are categorical, that is, discrete-valued. Continuous-
valued attributes must be discretized.
 A branch is created for each known value of the test attributes, and the
samples are partitioned accordingly (steps 8-10).
 The algorithm uses the same process recursively to form a decision tree for
the samples at each partition. Once an attribute has occurred at a node, it need
not be considered in any of the node’s descendents (step 13).

The recursive partitioning stops only when any one of the following conditions is
TRUE:
 All samples for a given node belong to the same class (steps 2 and 3), or
 There are no remaining attributes on which the samples may be further
partitioned (step 4). In this case, majority voting is employed (step 5). This
involves converting the given node into leaf and labeling it with the class in
majority among samples. Alternatively, the class distribution of the node
samples may be stored.
 There are no samples for the branch test-attribute = ai (step 11). In this case, a
leaf is created with the majority class in samples (step 12).

10.4.2 Attribute Selection Measure

An attribute selection measure is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of class-labeled training tuples into
individual classes. If we were to split D into smaller partitions according to the
outcomes of the splitting criterion, ideally each partition would be pure (i.e., all the
tuples that fall into a given partition would belong to the same class). The attribute
selection measure provides a ranking for each attribute describing the given training
tuples. The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples. If the splitting attribute is continuous-valued or if we are
restricted to binary trees, then, respectively, either a split point or a splitting subset
must also be determined as part of the splitting criterion.

The three popular attribute selection measures are:

 Information gain
 Gain ratio
 Gini index

In fact, these three are closely related to each other. Information Gain, which is also
known as Mutual information, is devised from the transition of Entropy, which in turn
comes from Information Theory. Gain Ratio is a complement of Information Gain,
was born to deal with its predecessor’s major problem. Gini Index, on the other hand,
was developed independently with its initial intention is to assess the income
dispersion of the countries but then be adapted to work as a heuristic for splitting
optimization.


 
 
Classification, Clustering
10.4.2.1 Entropy and Web Mining

Entropy or Information Entropy, or just, is a measurement of the uncertainty in data.


Entropy measures the diversification of the labels.

 Low Entropy indicates that the data labels are quite uniform.
Example: Suppose a dataset has 100 samples. Among those, there are 1
Positive and 99 Negative labeled data points. In this case, the Entropy is very
low. In an extreme case, suppose all the 100 samples are Positive, then the
Entropy is at its minimum or zero.

 High Entropy means the labels are in chaos.


Example: A dataset with 45 Positive samples and 55 Negative samples has a
very high Entropy. The extreme case, when the Entropy is highest happens
when exactly half of the data belongs to each of the labels.

In another point of view, the Entropy measures how hard we guess the label of a
randomly taken sample from the dataset. If most of the data have the same label, says,
Positive, meaning the Entropy is low, thus we can bet the label of the random sample
is Positive with confidence. On the flip side, if the Entropy is high, meaning the
probabilities of the sample to fall into the 2 classes are comparable, making us hard to
make a guess.

The formula of Entropy is given by:

  

where,
X = random variable or process
Xi = possible outcomes
p(Xi) = probability of possible outcomes.

Let’s have a dataset made up of three colors - red, purple, and yellow.

If we have one red, three purple, and four yellow observations in our set, our equation
becomes:
E = −(prlog2pr + pplog2pp + pylog2py)

where pr, pp and py are the probabilities of choosing a red, purple and yellow example
respectively. We have pr=1/8 because only 1/8 of the dataset represents red. 3/8 of the
dataset is purple hence pp=3/8. Finally, py=4/8 since half the dataset is yellow. As
such, we can represent py as py=1/2.

Our equation now becomes:

E= − (1/8 log2 (1/8) + 3/8 log2 (3/8) + 4/8 log2(4/8))

Our entropy would be: 1.411.41

You might wonder, what happens when all observations belong to the same class? In
such a case, the entropy will always be zero.

E= − (1log21)

=0

 
 
Classification

Such a dataset has no impurity. This implies that such a dataset would not be useful
for learning. However, if we have a dataset with say, two classes, half made up of
yellow and the other half being purple, the entropy will be one.

E= − ((0.5log20.5) + (0.5log20.5))

=1

This kind of dataset is good for learning.

10.4.2.2 Information Gain

The concept of entropy plays an important role in measuring the information gain.
Information gain is based on the information theory. Information gain is used for
determining the best features/attributes that render maximum information about a
class. It follows the concept of entropy while aiming at decreasing the level of
entropy, beginning from the root node to the leaf nodes.

Information gain computes the difference between entropy before and after split and
specifies the impurity in class elements.

It can help us determine the quality of splitting, as we shall soon see. The calculation
of information gain should help us understand this concept better.

Gain = Eparent − Echildren

The term Gain represents information gain. Eparent is the entropy of the parent node
and E_{children} is the average entropy of the child nodes. Let’s use an example to
visualize information gain and its calculation.

Suppose we have a dataset with two classes. This dataset has 5 purple and 5 yellow
examples. The initial value of entropy will be given by the equation below. Since the
dataset is balanced, we expect the answer to be 1.

Einitial=−((0.5log20.5)+(0.5log20.5))

=1
Say we split the dataset into two branches. One branch ends up having four values
while the other has six. The left branch has four purples while the right one has five
yellows and one purple.

We mentioned that when all the observations belong to the same class, the entropy is
zero since the dataset is pure. As such, the entropy of the left branch Eleft=0. On the
other hand, the right branch has five yellows and one purple.

Thus:

Eright = − (5/6log2(5/6)+1/6log2(1/6))

A perfect split would have five examples on each branch. This is clearly not a perfect
split, but we can determine how good the split is. We know the entropy of each of the
two branches. We weight the entropy of each branch by the number of elements each
contains.

This helps us calculate the quality of the split. The one on the left has 4, while the
other has 6 out of a total of 10. Therefore, the weighting goes as shown below:
10 
 
 
Classification, Clustering
and Web Mining
Esplit = 0.6∗0.65+0.4∗0

=0.39
The entropy before the split, which we referred to as initial entropy Einitial = 1. After
splitting, the current value is 0.39. We can now get our information gain, which is the
entropy we “lost” after splitting.

Gain=1–0.39
=0.61

The more the entropy removed, the greater the information gain. The higher the
information gain, the better the split.

10.4.2.3 Gain Ratio

Gain Ratio was proposed by John Ross Quinlan. Gain Ratio or Uncertainty
Coefficient is used to normalize the information gain of an attribute against how much
entropy that attribute has. Formula of Gain Ratio is given by:

Gain Ratio = Information Gain / Entropy

From the above formula, it can be stated that if entropy is very small, then the gain
ratio will be high and vice versa.

Be selected as splitting criterion, Quinlan proposed following procedure:

1. First, determine the information gain of all the attributes, and then compute
the average information gain.
2. Second, calculate the gain ratio of all the attributes whose calculated
information gain is larger or equal to the computed average information gain,
and then pick the attribute of higher gain ratio to split.

10.4.2.4 Gini Index

The Gini index or Gini coefficient or Gini impurity computes the degree of probability
of a specific variable that is wrongly being classified when chosen randomly and a
variation of Gini coefficient. It works on categorical variables, provides outcomes
both be “successful” or “failure”, and hence, conducts binary splitting only.

The degree of Gini index varies from 0 to 1,

 Where 0 depicts that all the elements be allied to a certain class, or only one
class exists there.
 The gini index of value as 1 signifies that all the elements are randomly
distributed across various classes, and
 A value of 0.5 denotes the elements are uniformly distributed into some
classes.

It was proposed by Leo Breiman in 1984 as an impurity measure for decision tree
learning. The Gini Index is determined by deducting the sum of squared of
probabilities of each class from one, mathematically, Gini Index can be expressed as:

11 
 
 
Classification
where Pi denotes the probability of an element being classified for a distinct class.

10.4.2.5 Gini Index vs Information Gain

Take a look below for the getting discrepancy between Gini Index and Information
Gain,
 The Gini Index facilitates the bigger distributions so easy to implement
whereas the Information Gain favors lesser distributions having small count
with multiple specific values.
 The method of the Gini Index is used by CART algorithms, in contrast to it,
Information Gain is used in ID3, C4.5 algorithms.
 Gini index operates on the categorical target variables in terms of “success”
or “failure” and performs only binary split, in opposite to that Information
Gain computes the difference between entropy before and after the split and
indicates the impurity in classes of elements.

10.4.3 Tree Pruning

When a decision tree is built, many of the branches will reflect anomalies in the
training data due to noise or outliers. Tree pruning methods address this problem of
overfitting the data. Such methods typically use statistical measures to remove the
least reliable branches, generally resulting in faster classification and an improvement
in the ability of the tree to correctly classify independent test data.

10.4.3.1 How does Tree Pruning works?

There are two common approaches to tree pruning: prepruning and postpruning. In
the prepruning approach, a tree is “pruned” by halting its construction early (example,
by deciding not to further split or partition the subset of training tuples at a given
node). The second and more common approach is postpruning, which removes
subtrees from a “fully grown” tree. A subtree at a given node is pruned by removing
its branches and replacing it with a leaf.

Alternatively, prepruning and postpruning may be interleaved for a combined


approach. Postpruning requires more computation than prepruning, yet generally
leads to a more reliable tree. No single pruning method has been found to be
superior over all others.

Decision trees can suffer from repetition and replication, making them
overwhelming to interpret. Repetition occurs when an attribute is repeatedly tested
along a given branch of the tree. In Replication, duplicate subtrees exist within the
tree

10.4.4 Advantages and Limitations of Decision Tree Approaches

The advantages of decision tree approaches are:

 Decision trees are simple to understand and interpret.


 They require little data and are able to handle both numerical and
categorical data.
 Decision trees can produce comprehensible rules.
 Classification of decision trees without much computation.
 Decision trees can accommodate both categorical and continuous variables.
 Decision trees clearly show which fields for prediction or classification are
most important.
12 
 
 
Classification, Clustering
 They are strong in nature, therefore, they perform well even if its and Web Mining
assumptions are somewhat violated by the true model from which the data
were generated.
 Decision trees perform well with large data in a short time.
 Nonlinear relationships between parameters do not affect tree performance.
 The best feature of using trees for analytics - easy to interpret and explain to
executives.

The drawbacks of decision tree approaches are:


 Decision trees are less suited to estimate tasks where the goal is to predict a
constant attribute value.
 The decision trees are vulnerable to mistakes in many class problems and
relatively limited numbers of training instances.
 Decision tree can be costly to train computationally. The decision-making
method is computationally costly. Each splitting field must be sorted at each
node before the best split can be identified. Combinations of fields are used
in some algorithms and search must be made for optimal combined weights.
Pruning algorithms can also be costly as many sub-trees of candidates have
to be created and compared.
 Data fragmentation: Each split in a tree leads to a reduced dataset under
consideration. And, hence the model created at the split will potentially
introduce bias.
 High variance and unstable : As a result of the greedy strategy applied by
decision tree's variance in finding the right starting point of the tree can
greatly impact the final result. i.e. small changes early on can have big
impacts later. So- if for example you draw two different samples from your
universe, the starting points for both the samples could be very different (and
may even be different variables) this can lead to totally different results.

We will study Bayesian classification technique which includes Baye’s theorem,


Naïve Bayes etc.. in the next section.

10.5 BAYESIAN CLASSIFICATION

Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular class.

Bayesian classification is based on Bayes’ theorem, described below. Studies


comparing classification algorithms have found a simple Bayesian classifier known as
the Naïve Bayesian classifier to be comparable in performance with decision tree and
selected neural network classifiers. Bayesian classifiers exhibits high accuracy and
speed when applied to large databases. Naïve Bayesian classifiers assume that the
effect of an attribute value on a given class is independent of the values of the other
attributes. This assumption is called class conditional independence. It is made to
simplify the computations involved and, in this sense, is considered “Naïve.”

10.5.1 Bayes’ Theorem

Bayes’ theorem is named after Thomas Bayes, English clergyman who did early work
in probability and decision theory during the 18th century.

Let X is a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is


described by measurements made on a set of n attributes.

13 
 
 
Classification
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P (H|X), the probability that the
hypothesis H holds given the “evidence” or observed data tuple X. In other words, we
are looking for the probability that tuple X belongs to class C, given that we know the
attribute description of X. P (H|X) is the posterior probability, or a posteriori
probability, of H conditioned on X.

For example, suppose our world of data tuples is confined to customers described by
the attributes age and income, respectively, and that X is a 35-year-old customer with
an income of Rs.4,00,000. Suppose that H is the hypothesis that our customer will buy
a computer. Then P(H|X) reflects the probability that customer X will buy a computer
given that we know the customer’s age and income.

In contrast, P(H) is the prior probability, or a priori probability, of H. For our


example, this is the probability that any given customer will buy a computer,
regardless of age, income, or any other information, for that matter. The posterior
probability, P(H|X), is based on more information (e.g., customer information) than
the prior probability, P(H), which is independent of X.

Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the


probability that customers, X, is 35 years old and earns Rs. 4,00,000, given that we
know the customer will buy a computer.

P(X) is the prior probability of X. Using our example, it is the probability that a
person from our set of customers is 35 years old and earns Rs.4,00,000.

How are these probabilities estimated? P(H), P(X|H), and P(X) may be estimated from
the given data, as we shall see below. Bayes’ theorem is useful in that it provides a
way of calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X).

Bayes’ theorem is

In the next section, we will look at how Bayes’ theorem is used in the Naive Bayesian
classification.

10.5.2 Naive- Bayesian Classification

The Naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

 Let D be a training set of tuples and their associated class labels.


 Each tuple is represented by an n-dimensional attribute vector, X = (x1, x2,… ,
xn), depicting n measurements made on the tuple from n attributes,
respectively, A1, A2,…, An.
 Suppose that there are m classes, C1, C2, …., Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest posterior
probability, conditioned on X. That is, the naïve Bayesian classifier predicts
that tuple X belongs to the class Ci if and only if
P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m; j ≠ i

Thus we maximize P(Ci|X) .The class Ci for which P(Ci|X) is maximized is called the
maximum posteriori hypothesis. Using Bayes’ theorem,

14 
 
 
Classification, Clustering
and Web Mining

 As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If


the class prior probabilities are not known, then it is commonly assumed that
the classes are equally likely, that is, P(C1) = P(C2) = …..= P(Cm), and we
would therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
Note that the class prior probabilities may be estimated by P(Ci)=|Ci,D|/|D|,
where |Ci , D| is the number of training tuples of class Ci in D.

 Given data sets with many attributes, it would be extremely computationally


expensive to compute P(X|Ci). In order to reduce computation in evaluating
P(X|Ci), the naive assumption of class conditional independence is made.
This presumes that the values of the attributes are conditionally independent
of one another, given the class label of the tuple (i.e., that there are no
dependence relationships among the attributes). Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the
training tuples. Recall that here xk refers to the value of attribute Ak for tuple X. For
each attribute, we look at whether the attribute is categorical or continuous-valued.
For instance, to compute P(X|Ci), we consider the following:

a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D


having the value xk for Ak, divided by |Ci,D|,the number of tuples of class Ci
in D.
b) If Ak is continuous-valued, then we need to do a bit more work, but the
calculation is pretty straightforward. A continuous-valued attribute is typically
assumed to have a Gaussian distribution with a mean μ and standard deviation
σ , defined by

For example, let X = (35,$40,000), where A1 and A2 are the attributes age and income,
respectively. Let the class label attribute be buys_computer. The associated class label
for X is yes (i.e., buys_computer = yes). Let’s suppose that age has not been
discretized and therefore exists as a continuous-valued attribute. Suppose that from
the training set, we find that customers in D who buy a computer are 38+-12 years of
age. In other words, for attribute age and this class, we have μ = 38 years and σ = 12.
We can plug these quantities, along with x1 = 35 for our tuple X, into equation to
estimate P(age = 35|buys_computer = yes) .

To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if

P(X|Ci)P(Ci) > P(X|Cj)P(Cj) for 1 ≤ j ≤ m, j ≠ i

15 
 
 
Classification
In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is the
maximum.

Bayesian classifiers have the minimum error rate in comparison to all other classifiers.
However, in practice this is not always the case, owing to inaccuracies in the
assumptions made for its use, such as class-conditional independence, and the lack of
available probability data.

Bayesian belief networks are graphical models, which unlike naïve Bayesian
classifiers allow the representation of dependencies among subsets of attributes.
Bayesian belief networks can also be used for classification.

10.5.3 Advantages and Disadvantage of Naïve Bayes classifier

Some to the advantages of the Naïve Bayes Classifier are:

 Naive Bayes Algorithm is a fast, highly scalable algorithm.

 Naive Bayes can be use for Binary and Multiclass classification. It provides
different types of Naive Bayes Algorithms like GaussianNB, MultinomialNB,
BernoulliNB.
 It is a simple algorithm that depends on doing a bunch of counts.
 Great choice for Text Classification problems. It’s a popular choice for spam
email classification.
 It can be easily train on small dataset

The disadvantage of Naïve Bayes Classifier is:

 Naïve Bayes can learn individual features importance but can’t determine the
relationship among features.

Common applications of Naïve Bayes algorithm are in Spam filtering. Gmail from
Google uses Naïve Bayes algorithm for filtering spam emails. Sentiment analysis is
another area where Naïve Bayes can calculate the probability of emotions expressed
in the text being positive or negative. Leading web portals may understand the
reaction of customers to their new products based on sentiment analysis.

We will focus on Support Vector Machines classification technique in the following


section.

10.6 SUPPORT VECTOR MACHINES

Support vector machines (SVM) are a class of statistical models first developed in the
mid-1960s by Vladimir Vapnik. In later years, the model has evolved considerably
into one of the most flexible and effective machine learning tools available. It is a
supervised learning algorithm which can be used to solve both classification and
regression problem, even though the current focus is on classification only.

To put it in a nutshell, this algorithm looks for a linearly separable hyperplane, or a


decision boundary separating members of one class from the other. If such a
hyperplane exists, the work is done! If such a hyperplane does not exist, SVM uses a
nonlinear mapping to transform the training data into a higher dimension. Then it
searches for the linear optimal separating hyperplane. With an appropriate nonlinear
mapping to a sufficiently high dimension, data from two classes can always be
separated by a hyperplane. The SVM algorithm finds this hyperplane using support
16 
 
 
Classification, Clustering
vectors and margins. As a training algorithm, SVM may not be very fast compared to and Web Mining
some other classification methods, but owing to its ability to model complex nonlinear
boundaries, SVM has high accuracy. SVM is comparatively less prone to overfitting.
SVM has successfully been applied to handwritten digit recognition, text
classification, speaker identification etc..

10.6.1 When Data is Linearly Separable

Let us start with a simple two-class problem when data is clearly linearly separable as
shown in the diagram below:

Figure 2: Linearly Separable

Let the i-th data point be represented by (Xi, yi) where Xi represents the feature vector
and yi is the associated class label, taking two possible values +1 or -1. In the Figure
2, above the balls having red color has class label +1 and the blue balls have class
label -1, say. A straight line can be drawn to separate all the members belonging to
class +1 from all the members belonging to the class -1. The two dimensional data
above are clearly linearly separable.

In fact, an infinite number of straight lines can be drawn to separate the blue balls
from the red balls.

The problem therefore is which among the infinite straight lines is optimal, in the
sense that it is expected to have minimum classification error on a new observation.
The straight line is based on the training sample and is expected to classify one or
more test samples correctly.

As an illustration, if we consider the black, red and green lines in the Figure 2 above,
is any one of them better than the other two? Or are all three of them equally well
suited to classify? How is optimality defined here? Intuitively it is clear that if a line
passes too close to any of the points, that line will be more sensitive to small changes
in one or more points. The green line is close to a red ball. The red line is close to a
blue ball. If the red ball changes its position slightly, it may fall on the other side of
the green line. Similarly, if the blue ball changes its position slightly, it may be
misclassified. Both the green and red lines are more sensitive to small changes in the
observations. The black line on the other hand is less sensitive and less susceptible to
model variance.

In an n-dimensional space, a hyperplane is a flat subspace of dimension n – 1. For


example, in two dimensions a straight line is a one-dimensional hyperplane, as shown
in the Figure 3. In three dimensions, a hyperplane is a flat two-dimensional subspace,

17 
 
 
Classification
i.e. a plane. Mathematically in n dimensions a separating hyperplane is a linear
combination of all dimensions equated to 0; i.e.,

θ0+θ1x1+θ2x2+…+θnxn=0
The scalar θ0 is often referred to as a bias. If θ0=0, then the hyperplane goes through
the origin.

A hyperplane acts as a separator. The points lying on two different sides of the
hyperplane will make up two different groups.

Basic idea of support vector machines is to find out the optimal hyperplane for
linearly separable patterns. A natural choice of separating hyperplane is optimal
margin hyperplane (also known as optimal separating hyperplane) which is farthest
from the observations. The perpendicular distance from each observation to a given
separating hyperplane is computed. The smallest of all those distances is a measure of
how close the hyperplane is to the group of observations. This minimum distance is
known as the margin. The operation of the SVM algorithm is based on finding the
hyperplane that gives the largest minimum distance to the training examples, i.e. to
find the maximum margin. This is known as the maximal margin classifier.

A separating hyperplane in two dimension can be expressed as

θ0+θ1x1+θ2x2=0
Hence, any point that lies above the hyperplane, satisfies

θ0+θ1x1+θ2x2>0
and any point that lies below the hyperplane, satisfies

θ0+θ1x1+θ2x2<0
The coefficients or weights θ1 and θ2 can be adjusted so that the boundaries of the
margin can be written as

H1: θ0+θ1x1i+θ2x2i ≥ 1, for yi=+1


H2: θ0+θθ1x1i+θ2x2i ≤ −1, for yi=−1
This is to ascertain that any observation that falls on or above H1 belongs to class +1
and any observation that falls on or below H2, belongs to class -1. Alternatively, we
may write

yi(θ0+θ1x1i+θ2x2i) ≤ for every observation


The boundaries of the margins, H1 and \(H_2\, are themselves hyperplanes too. The
training data that falls exactly on the boundaries of the margin are called the support
vectors as they support the maximal margin hyperplane in the sense that if these
points are shifted slightly, then the maximal margin hyperplane will also shift.
Note that the maximal margin hyperplane depends directly only on these support
vectors.

If any of the other points change, the maximal margin hyperplane does not change,
until the movement affects the boundary conditions or the support vectors. The
support vectors are the most difficult to classify and give the most information
regarding classification. Since the support vectors lie on or closest to the decision
boundary, they are the most essential or critical data points in the training set.

18 
 
 
Classification, Clustering
and Web Mining

Figure 3: Support Vectors

For a general n-dimensional feature space, the defining equation becomes:

yi(θ0+θ1x2i+θ2x2i+…+θnxni) ≥ 1, for every observation


If the vector of the weights is denoted by Θ and |Θ| is the norm of this vector, then it is
easy to see that the size of the maximal margin is 2/|Θ|. Finding the maximal margin
hyperplanes and support vectors is a problem of convex quadratic optimization. It is
important to note that the complexity of SVM is characterized by the number of
support vectors, rather than the dimension of the feature space. That is the reason
SVM has a comparatively less tendency to overfit. If all data points other than the
support vectors are removed from the training data set, and the training algorithm is
repeated, the same separating hyperplane would be found. The number of support
vectors provides an upper bound to the expected error rate of the SVM classifier,
which happens to be independent of data dimensionality. An SVM with a small
number of support vectors has good generalization, even when the data has high
dimensionality.

10.6.2 Super Vector Classifier

The maximal margin classifier is a very natural way to perform classification, is a


separating hyperplane exists. However existence of such a hyperplane may not be
guaranteed, or even if it exists, the data is noisy so that maximal margin classifier
provides a poor solution. In such cases, the concept can be extended where a
hyperplane exists which almost separates the classes, using what is known as a soft
margin. The generalization of the maximal margin classifier to the non-separable case
is known as the support vector classifier, where a small proportion of the training
sample is allowed to cross the margins, or even the separating hyperplane. Rather than
looking for the largest possible margin so that every observation is on the correct side
of the margin, thereby making the margins very narrow or non-existent, some
observations are allowed to be on the incorrect side of the margins. The margin
is soft as a small number of observations violate the margin. The softness is controlled
by slack variables which control the position of the observations relative to the
margins and separating hyperplane. The support vector classifier maximizes a soft
margin. The optimization problem can be modified as:

yi(θ0+θ1x1i+θ2x2i+⋯+θnxni) ≥ 1–ϵi for every observation

The εi is the slack corresponding to i-th observation and C is a regularization


parameter set by user. Larger value of C leads to larger penalty for errors.

19 
 
 
Classification
However there will be situations when a linear boundary simply does not work.

10.6.3 When Data is Not Linearly Separable

SVM is quite intuitive when the data is linearly separable. However, when they are
not, as shown in the Figure 4 below, SVM can be extended to perform well.

Figure 4: Linearly Non Separable

There are two main steps for nonlinear generalization of SVM. The first step involves
transformation of the original training (input) data into a higher dimensional data
using a nonlinear mapping. Once the data is transformed into the new higher
dimension, the second step involves finding a linear separating hyperplane in the new
space. The maximal marginal hyperplane found in the new space corresponds to a
nonlinear separating hyper-surface in the original space.

Let us study Rule-based classification technique in the next section.

10.7 RULE-BASED CLASSIFICATION

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from

IF condition THEN conclusion

Let us consider a rule R1,


R1: IF age = youth AND student = yes
THEN buy_computer = yes
Rule-Based Classifier classifies records by using a collection of “if…then…” rules.

Rule Notation: (Condition) → Class Label

Example 1:
(BloodType = Warm) ∧ (LayEggs = Yes) → Birds
(TaxableIncome < 50K) ∨ (Refund = Yes) → Evade = No

20 
 
 
Classification, Clustering
and Web Mining

Example 2:

R1: (Give Birth = no) & (Can Fly = yes) → Birds


R2: (Give Birth = no) & (Live in Water = yes) → Fishes
R3: (Give Birth = yes) & (Blood Type = warm) → Mammals
R4: (Give Birth = no) & (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
10.7.1 Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance satisfy the condition of
the rule.

The rule R1 above covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal

A lemur triggers rule R3, so it is classified => Mammal


A turtle triggers both R4 and R5
A dogfish shark trigger matches none of the rules.
10.7.2 Advantages of Rule-Based Classifiers
Following are the advantages of Rule-Based Classifiers:

21 
 
 
Classification
 They are as highly expressive as decision trees -Easy to interpret
 They are easy to generate
 They can classify new instances rapidly
 Their performance is comparable to decision trees

10.7.3 Coverage and Accuracy


Go through the below given example to know about the coverage and accuracy,
Example 3:

Coverage of a rule:

 Fraction of all records that satisfy the antecedent of a rule


 Count(instances with antecedent) / Count(training set)
 Example on left: (Status = 'Single') -> no, Coverage = 4/10 = 40%

Accuracy of a rule:

 Fraction of records that satisfy the antecedent that also satisfy the consequent
of a rule
 Count (instances with antecedent AND consequent) / Count(instances with
antecedent)
 Example on left: (Status = 'Single') -> no, accuracy = 2/4 = 50%

Important points to be remembered are:

 The Left Hand Side is rule antecedent or condition


 The Right Hand Side is rule consequent
 Coverage of a rule - Fraction of records that satisfy the antecedent of a rule
 Accuracy of a rule - Fraction of records that satisfy both the antecedent and
consequent of a rule.

10.7.4 Characteristics of Rule Sets


If we convert the result of decision tree to classification rules, these rules would be
mutually exclusive and exhaustive at the same time.

22 
 
 
Classification, Clustering
Mutually Exclusive rules and Web Mining
 Classifier contains mutually exclusive rules if the rules are independent of
each other.
 Every record is covered by at most one rule.

Exhaustive rules
 Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values.
 Each record is covered by at least one rule.

These rules can be simplified. However, simplified rules may no longer be mutually
exclusive since a record may trigger more than one rule. Simplified rules may no
longer be exhaustive either since a record may not trigger any rules.
Handling rules that are not mutually exclusive
 A record may trigger more than one rule
 Solutions for these are:
o Ordered rule set
o Unordered rule set – use voting schemes

Handling rules that are not exhaustive


 A record may not trigger any rules
 Solution for this is
o Use a default class

10.7.5 Ordered Rule Set


An ordered rule set is known as a decision list. Rules are rank ordered according to
their priority. For example, when a test record is presented to the classifier, it is
assigned to the class label of the highest ranked rule it has triggered. If none of the
rules fired, it is assigned to the default class.
From the table given in the Example 2,
1. R1: (Give Birth = no) & (Can Fly = yes) → Birds
2. R2: (Give Birth = no) & (Live in Water = yes) → Fishes
3. R3: (Give Birth = yes) & (Blood Type = warm) → Mammals
4. R4: (Give Birth = no) & (Can Fly = no) → Reptiles
5. R5: (Live in Water = sometimes) → Amphibians
A turtle triggers both R4 and R5, but by the order, the conclusion is =>Reptile

10.7.6 Rule Ordering Schemes

If more than one rule is triggered, it needs conflict resolution in the following ways:
 Size ordering - assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
 Class-based ordering - decreasing order of prevalence or misclassification
cost per class
 Rule-based ordering (decision list) - rules are organized into one long
priority list, according to some measure of rule quality or by experts

23 
 
 
Classification
Let us see the table given below:

 R1: Accuracy = 100%, coverage = 30%


 R2: Accuracy = 100%, coverage = 10%
 R3: Accuracy = 100%, coverage = 30%
 R4: Accuracy = 100%, coverage = 30%
10.7.7 Building Classification Rules
There are two methods to build the classification rules as shown below:
10.7.7.1 Direct Method
In direct method, rules are extracted directly from data
Examples: Sequential-Covering such as RIPPER, CN2, Holte’s 1R
10.7.7.2 Indirect Method
In the Indirect method, rules are extracted from other classification models (e.g.
decision trees, neural networks, etc).
Example: C4.5 Rules
Direct Method extract rules directly from data. Sequential Covering such as CN2
Algorithm and RIPPER Algorithm are common direct methods for building
classification rules.

24 
 
 
Classification, Clustering
For 2-class problem, choose one of the classes as positive class, and the other as and Web Mining
negative class. You may choose the positive class as the smaller number of cases.
 Learn rules for the positive class
 Negative class becomes the default class.
To generalize for multi-class problem,
 Order the classes according to increasing class prevalence (fraction of
instances that belong to a particular class),
 Learn the rule set for smallest class first, treat the rest as negative class.
 Repeat with next smallest class as positive class.
 Thus the largest class will become the default class.
Sequential-Covering in a Nut-shell

 Start from an empty rule


 Grow a rule using the Learn-One-Rule function (Rule Growing)
 Remove training records covered by the rule (Instance Elimination)
 Repeat Step (2) and (3) until stopping criterion is met
 (Optional) Rule Pruning

More formally, sequential-covering can be represented as:


Sequential-Covering(Target_attribute, Attributes, Examples, Threshold)
 Learned_rules  { }
 Rule  Learn-one-rule(Target_attribute, Attributes, Examples)
 While Performance(Rule, Examples) > Threshold :
o Learned_rules  Learned_rules + Rule
o Examples  Examples -{examples correctly classified by Rule}
o Rule  Learn-One-Rule (Target_attribute, Attributes, Examples)
 Learned_rules  sort Learned_rules according to Performance over
Examples
 Return Learned_rules

However, because Sequential-Covering does not backtrack, this algorithm is not


guaranteed to find the smallest or best set of rules. Thus the Learn-One-Rule function
must be very effective.
Rule Growing
In Rule Growing, there are two approaches namely, Top-down and Bottom-up.
Top-down -- General-to-specific
Example at Figure 5(a) shows a specific rule determining 3 cases, all classified as
'No'. Choice is picked because of the number of cases and accuracy.
Bottom-up -- Specific-to-general
Example at Figure 5(b) shows two rules (top) and the income part doesn't
distinguish (85K or 90K) and thus the rule can be simplified.

25 
 
 
Classification

Figure 5: (a) General-to-specific (b) Specific-to-general

Rule Evaluation
Rule evaluation is all about how do we determine the best next rule.
First Order Inductive Learner’s (FOIL) Information Gain is an early rule based
learning algorithm.
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct--a condition component in the
antecedent)
Information Gain(R0, R1) = t *[ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
where,
t= number of positive instances covered by both R0 and R1
p0= number of positive instances covered by R0
n0=number of negative instances covered by R0
p1=number of positive instances covered by R1
n1= number of negative instances covered by R1
Gain(R0,R1) is similar to the entropy gain calculations used in decision trees.
Instance Elimination
Following are the criteria to be investigated:
• Why do we need to eliminate instances? Otherwise, the next rule is identical to
previous rule
• Why do we remove positive instances? Ensure that the next rule is different
• Why do we remove negative instances? Prevent underestimating accuracy of rule

26 
 
 
Classification, Clustering
and Web Mining
Direct Method: RIPPER (Repeated Incremental Pruning to Produce Error
Reduction)
For 2-class problem, choose one of the classes as positive class, and the other as
negative class. You may choose the positive class as the smaller number of cases.
 Learn rules for positive class
 Negative class become the default class
To generalize for multi-class problem,
 Order the classes according to increasing class prevalence (fraction of
instances that belong to a particular class)
 Learn the rule set for smallest class first, treat the rest as negative class
 Repeat with next smallest class as positive class
 Thus the largest class will become the default class

Growing a Rule
 Start from empty rule
 Repeat: Add conjuncts as long as they improve FOIL’s information gain
o Stop when rule no longer covers negative examples
o We can get a rather extensive rule such as ABCD -> y
 Prune the rule immediately using incremental reduced error pruning
o Measure for pruning: v = (p-n)/(p+n)
 p: number of positive examples covered by the rule in the
validation set
 n: number of negative examples covered by the rule in the
validation set
o Pruning method: delete any final sequence of conditions that
maximizes v
o Example: if the grown rule is ABCD -> y, check to prune D then CD,
then BCD

Building a Rule Set


Uses the sequential covering algorithm
 Finds the best rule that covers the current set of positive examples
 Eliminate both positive and negative examples covered by the rule

Each time a rule is added to the rule set, compute the new description length
 Stop adding new rules when the new description length is d bits longer than
the smallest description length obtained so far (default setting for d=64 bits)
[this is not clear from the text]
 Alternatively stop when the error rate exceeds 50%
Optimize the rule set:
 For each rule r in the rule set R
o Consider 2 alternative rules:
 Replacement rule (r*): grow new rule from scratch
 Revised rule(r′): add conjuncts to extend the rule r
o Compare the rule set for r against the rule set for r* and r′
o Choose rule set that minimizes MDL principle (minimum description
length-- a measure of model complexity)
 Repeat rule generation and rule optimization for the remaining positive
examples

27 
 
 
Classification
Indirect Method: C4.5 Rules

Extract rules from an unpruned decision tree

For each rule, r: A → y,

 consider an alternative rule r′: A′→ y where A′ is obtained by removing one


of the conjuncts in A
 Compare the pessimistic error rate for r against all r’s
 Prune if one of the alternative rules has lower pessimistic error rate
 Repeat until we can no longer improve generalization error

Instead of ordering the rules, order subsets of rules (class ordering)

 Each subset is a collection of rules with the same rule consequent (class)
 Compute description length of each subset
o Description length = L(error) + g *L(model)
o g is a parameter that takes into account the presence of redundant
attributes in a rule set (default value = 0.5)
o Similar to the generalization error calculation of a decision tree

10.7.8 Advantages of Rule-Based Classifiers

Following are some of the advantages of Rule-Based classifiers:

 They have the characteristics quite similar to decision trees


 These classifiers are as highly expressive as decision trees
 They are easy to interpret
 Their performance is comparable to decision trees
 They can handle redundant attributes
 They are better suited for handling imbalanced classes
 There are harder to handle missing values in the test set

Evaluating the performance of a data mining technique is a fundamental aspect of data


mining. In the following section let us focus on this aspect.

10.8 MODEL EVALUATION AND SELECTION

Evaluating the performance of a data mining technique is a fundamental aspect of


data mining. Evaluation method is the yardstick to examine the efficiency and
performance of any model. The evaluation is important for understanding the quality
of the model or technique, for refining parameters in the iterative process of learning
and for selecting the most acceptable model or technique from a given set of models
or techniques. There are several criteria for evaluating models for different tasks and
other criteria that can be important as well, such as the computational complexity or
the comprehensibility of the model.

Model selection is a technique for selecting the best model after the individual
models are evaluated based on the required criteria. Model selection is the problem
of choosing one from among a set of candidate models. It is common to choose a
model that performs the best on a hold-out test dataset or to estimate model
performance using a resampling technique, such as k-fold cross-validation.
.
An alternative approach to model selection involves using probabilistic statistical
28 
 
 
Classification, Clustering
measures that attempt to quantify both the model performance on the training dataset and Web Mining
and the complexity of the model. Examples include the Akaike (AIC) and Bayesian
Information Criterion(BIC) and the Minimum Description Length (MDL).
The benefit of these information criterion statistics is that they do not require a hold-
out test set, although a limitation is that they do not take the uncertainty of the
models into account and may end-up selecting models that are too simple.

Model evaluation is a method of assessing the correctness of models on test data.


The test data consists of data points that have not been seen by the model before.

10.8.1 Types of Model Selection


 
There are many common approaches that may be used for model selection. For
example, in the case of supervised learning, the three most common approaches are:

 Train, Validation, and Test datasets


 Resampling Methods
 Probabilistic Statistics

The simplest reliable method of model selection involves fitting candidate models on
a training set, tuning them on the validation dataset, and selecting a model that
performs the best on the test dataset according to a chosen metric, such as accuracy
or error. A problem with this approach is that it requires a lot of data.

Resampling techniques attempt to achieve the same as the train/val/test approach to


model selection, although using a small dataset. An example is k-fold cross-
validation where a training set is split into many train/test pairs and a model is fit and
evaluated on each. This is repeated for each model and a model is selected with the
best average score across the k-folds. A problem with this and the prior approach is
that only model performance is assessed, regardless of model complexity.

A third approach to model selection attempts to combine the complexity of the


model with the performance of the model into a score, then select the model that
minimizes or maximizes the score.

Resampling methods

Resampling methods, as the name suggests, are simple techniques of rearranging data
samples to inspect if the model performs well on data samples that it has not been
trained on. In other words, resampling helps us understand if the model will
generalize well.

Cross-validation

Cross-validation is a technique for estimating the generalization performance of a


predictive model. The main idea behind CV is to split data, once or several times, for
estimating the risk of each algorithm: Part of data (the training sample) is used for
training each algorithm, and the remaining part (the validation sample) is used for
estimating the risk of the algorithm. Then, CV selects the algorithm with the smallest
estimated risk. Cross validation is an alternative to random sub-sampling.

Holdout Method

Hold-out or (simple) validation relies on a single split of data. The holdout method is
the simplest kind of cross validation. The data set is separated into two sets, called the
training set and the testing set. The function approximator fits a function using the

29 
 
 
Classification
training set only. Then the function approximator is asked to predict the output values
for the data in the testing set (it has never seen these output values before). The errors
it makes are accumulated as before to give the mean absolute test set error, which is
used to evaluate the model. The advantage of this method is that it is usually
preferable to the residual method and takes no longer to compute. However, its
evaluation can have a high variance. The evaluation may depend heavily on which
data points end up in the training set and which end up in the test set, and thus the
evaluation may be significantly different depending on how the division is made.

Random Sub-sampling

The hold out method can be repeated several times to improve the estimation of a
classifier’s performance. This approach is known as random sub-sampling.
Random sub-sampling encounters some of the problems associated with the holdout
method because it does not utilize as mush data as possible for training. It has also no
control over the number of times each record is used for testing and training.
Consequently, some records might be used for training more often than others.

k-fold Cross-validation

It is one way to improve over the holdout method. The data set is divided into k
subsets, and the holdout method is repeated k times. Each time, one of the k
subsets is used as the test set and the other k-1 subsets are put together to form a
training set. Then the average error across all k trials is computed. The advantage
of this method is that it matters less how the data gets divided. Every data point
gets to be in a test set exactly once, and gets to be in a training set k-1 times. The
variance of the resulting estimate is reduced as k is increased. The disadvantage of
this method is that the training algorithm has to be rerun from scratch k times,
which means it takes k times as much computation to make an evaluation. A
variant of this method is to randomly divide the data into a test and training set k
different times. The advantage of doing this is that you can independently choose
howlarge each test set is and how many trials you average over.

Leave-one-out Method

When K-fold cross-validation taken to its logical extreme, with K equal to N,


the number of data points in the set. That means that N separate times, the
function approximator is trained on all the data except for one point and a
prediction is made for that point. As before the average error is computed and
used to evaluate the model. The evaluation given by leave-one-out cross
validation error (LOO-XVE) is good, but at first pass it seems very expensive to
compute. Fortunately, locally weighted learners can make LOO predictions just as
easily as they make regular predictions. That means computing the LOO-XVE
takes no more time than computing the residual error and it is a much better way
to evaluate models.

Random Split

Random Splits are used to randomly sample a percentage of data into training, testing,
and preferably validation sets. The advantage of this method is that there is a good
chance that the original population is well represented in all the three sets. In more
formal terms, random splitting will prevent a biased sampling of data. It is very
important to note the use of the validation set in model selection. The validation set is
the second test set and one might ask, why have two test sets?

30 
 
 
Classification, Clustering
In the process of feature selection and model tuning, the test set is used for model and Web Mining
evaluation. This means that the model parameters and the feature set are selected such
that they give an optimal result on the test set. Thus, the validation set which has
completely unseen data points (not been used in the tuning and feature selection
modules) is used for the final evaluation.

Time-Based Split

There are some types of data where random splits are not possible. For example, if we
have to train a model for weather forecasting, we cannot randomly divide the data into
training and testing sets. This will jumble up the seasonal pattern. Such data is often
referred to by the term, Time Series.

In such cases, a time-wise split is used. The training set can have data for the last three
years and 10 months of the present year. The last two months can be reserved for the
testing or validation set.

There is also a concept of window sets, where the model is trained till a particular date
and tested on the future dates iteratively such that the training window keeps
increasing shifting by one day (consequently, the test set also reduces by a day). The
advantage of this method is that it stabilizes the model and prevents overfitting when
the test set is very small, for example, 3 to 7 days.

However, the drawback of time-series data is that the events or data points are
not mutually independent. One event might affect every data input that follows after.

K-Fold Cross-Validation

The cross-validation technique works by randomly shuffling the dataset and then
splitting it into k groups. Thereafter, on iterating over each group, the group needs to
be considered as a test set while all other groups are clubbed together into the training
set. The model is tested on the test group and the process continues for k groups.

Thus, by the end of the process, one has k different results on k different test groups.
The best model can then be selected easily by choosing the one with the highest score.

Stratified K-Fold

The process for stratified K-Fold is similar to that of K-Fold cross-validation with one
single point of difference, unlike in k-fold cross-validation, the values of the target
variable is taken into consideration in stratified k-fold.

If for instance, the target variable is a categorical variable with 2 classes, then
stratified k-fold ensures that each test fold gets an equal ratio of the two classes when
compared to the training set.

This makes the model evaluation more accurate and the model training less biased.

Bootstrap

Bootstrap is one of the most powerful ways to obtain a stabilized model. It is close to
the random splitting technique since it follows the concept of random sampling.

The first step is to select a sample size (which is usually equal to the size of the
original dataset). Thereafter, a sample data point must be randomly selected from the
original dataset and added to the bootstrap sample. After the addition, the sample
needs to be put back into the original sample. This process needs to be repeated for N
times, where N is the sample size.
31 
 
 
Classification
Therefore, it is a resampling technique that creates the bootstrap sample by sampling
data points from the original dataset with replacement. This means that the bootstrap
sample can contain multiple instances of the same data point. The model is trained on
the bootstrap sample and then evaluated on all those data points that did not make it to
the bootstrapped sample. These are called the out-of-bag samples.

Probabilistic measures

Probabilistic Measures do not just take into account the model performance but also
the model complexity. Model complexity is the measure of the model’s ability to
capture the variance in the data.

For example, a highly biased model like the linear regression algorithm is less
complex and on the other hand, a neural network is very high on complexity.

Another important point to note here is that the model performance taken into account
in probabilistic measures is calculated from the training set only. A hold-out test set is
typically not required.

A fair bit of disadvantage however lies in the fact that probabilistic measures do not
consider the uncertainty of the models and has a chance of selecting simpler models
over complex models.

There are three statistical approaches to estimating how well a given model fits a
dataset and how complex the model is. And each can be shown to be equivalent or
proportional to each other, although each was derived from a different framing or field
of study.

They are:

 Akaike Information Criterion (AIC). Derived from frequentist probability


 Bayesian Information Criterion (BIC). Derived from Bayesian probability
 Minimum Description Length (MDL). Derived from information theory

Akaike Information Criterion (AIC)

It is common knowledge that every model is not completely accurate. There is always
some information loss which can be measured using the KL information metric.
Kulback-Liebler or KL divergence is the measure of the difference in the probability
distribution of two variables.

A statistician, Hirotugu Akaike, took into consideration the relationship between KL


Information and Maximum Likelihood (in maximum-likelihood, one wishes to
maximize the conditional probability of observing a datapoint X, given the parameters
and a specified probability distribution) and developed the concept of Information
Criterion (or IC). Therefore, Akaike’s IC or AIC is the measure of information loss.
This is how the discrepancy between two different models is captured and the model
with the least information loss is suggested as the model of choice.

where,
K = number of independent variables or predictors
L = maximum-likelihood of the model
N = number of data points in the training set (especially helpful in case of
small datasets)

32 
 
 
Classification, Clustering
The limitation of AIC is that it is not very good with generalizing models as it tends to and Web Mining
select complex models that lose less training information.

Bayesian Information Criterion (BIC)

BIC was derived from the Bayesian probability concept and is suited for models that
are trained under the maximum likelihood estimation.

where,
K = number of independent variables
L = maximum-likelihood
N = Number of sampler/data points in the training set

BIC penalizes the model for its complexity and is preferably used when the size of the
dataset is not very small (otherwise it tends to settle on very simple models).

Minimum Description Length (MDL)

MDL is derived from the Information theory which deals with quantities such as
entropy that measure the average number of bits required to represent an event from a
probability distribution or a random variable.

MDL or the minimum description length is the minimum number of such bits required
to represent the model.

where,
d = model
D = predictions made by the model
L(h) = number of bits required to represent the model
L(D | h) = number of bits required to represent the predictions from the model

10.8.2 Models Evaluation

Models can be evaluated using multiple metrics. However, the right choice of an
evaluation metric is crucial and often depends upon the problem that is being solved.
A clear understanding of a wide range of metrics can help the evaluator to chance
upon an appropriate match of the problem statement and a metric.

Classification metrics

For every classification model prediction, a matrix called the confusion matrix can be
constructed which demonstrates the number of test cases correctly and incorrectly
classified.

Confusion Matrix

A binary classification model classifies each instance into one of two classes; say a
true and a false class. This gives rise to four possible classifications for each instance:
a true positive, a true negative, a false positive, or a false negative. This situation can
be depicted as a confusion matrix (also called contingency table) given in Figure 6.
The confusion matrix juxtaposes the observed classifications for a phenomenon
(columns) with the predicted classifications of a model (rows). In Figure 6, the

33 
 
 
Classification, Clustering
Precision and Web Mining

Precision is the metric used to identify the correctness of classification.

Intuitively, this equation is the ratio of correct positive classifications to the total
number of predicted positive classifications. The greater the fraction, the higher is the
precision, which means better is the ability of the model to correctly classify the
positive class. In the problem of predictive maintenance (where one must predict in
advance when a machine needs to be repaired), precision comes into play. The cost of
maintenance is usually high and thus, incorrect predictions can lead to a loss for the
company. In such cases, the ability of the model to correctly classify the positive class
and to lower the number of false positives is paramount.

Recall

Recall tells us the number of positive cases correctly identified out of the total number
of positive cases.

Going back to the fraud problem, the recall value will be very useful in fraud cases
because a high recall value will indicate that a lot of fraud cases were identified out of
the total number of frauds.

F1- Score

F1- score is also known as F-Score is the harmonic mean of Recall and Precision and
therefore, balances out the strengths of each. It is useful in cases where both recall and
precision can be valuable – like in the identification of plane parts that might require
repairing. Here, precision will be required to save on the company’s cost (because
plane parts are extremely expensive) and recall will be required to ensure that the
machinery is stable and not a threat to human lives.

G-score is the geometric mean of precision and recall:

Receiver Operating Curves (ROC)

Central to constructing, deploying, and using classification models is the question of


model performance assessment. Traditionally this is accomplished by using metrics
derived from the confusion matrix or contingency table. However, it has been
recognized that (a) a scalar is a poor summary for the performance of a model in
particular when deploying non- parametric models such as artificial neural networks
or decision trees and (b) some performance metrics derived from the confusion
matrix are sensitive to data anomalies such as class skew.

It has been observed that Receiver Operating Characteristic (ROC) curves visually
convey the same information as the confusion matrix in a much more intuitive and
robust fashion. ROC curves are two-dimensional graphs that visually depict the
performance and performance trade-off of a classification model. ROC curves were
originally designed as tools in communication theory to visually determine optimal
operating points for signal discriminators.

35 
 
 
Classification
Two new performance metrics have to be introduced here in order to construct ROC
curves (they have been defined here in terms of the confusion matrix), the true
positive rate (TPR) and the false positive rate (FPR):

 True positive rate = TP/(TP+FN) = 1 − false negative rate


False positive rate = FP/(FP+TN) = 1 − true negative rate
ROC graphs are constructed by plotting the true positive rate against the false positive
rate (figure 7 (a)). A number of regions of interest in a ROC graph can be identified.
The diagonal line from the bottom left corner to the top right corner denotes random
classifier performance, that is, a classification model mapped onto this line produces
as many false positive responses as it produces true positive responses.

To the left bottom of the random performance line there is the conservative
performance region. Classifiers in this region commit few false positive errors. In the
extreme case, denoted by point in the bottom left corner, a conservative classification
model will classify all instances as negative. In this way it will not commit any false
positives but it will also not produce any true positives. The region of classifiers with
liberal performance occupies ROC graphs are constructed by plotting the true positive
rate against the false positive rate (figure 7(a)).

A number of regions of interest can be identified in a ROC graph. The diagonal line
from the bottom leftcorner to the top right corner denotes random classifier
performance, that is, a classification model mapped onto this line produces as many
false positive responses as it produces true positive responses. To the left bottom of
the random performance line is the conservative performance region. Classifiers in
this region commit few false positive errors.

In the extreme case, denoted by point in the bottom left corner, a conservative
classification model will classify all instances as negative. In this way it will not
commit any false positives but it will also not produce any true positives. The region
of classifiers with liberal performance occupies the top of the graph. These classifiers
have a good true positive rate but also commit substantial numbers of false positive
errors.

Again, in the extreme case denoted by the point in the top right corner, we have
classification models that classify every instance as positive. In that way, the
classifier will not miss any true positives but it will also commit a very large number
of false positives. Classifiers that fall in the region to the right of the random
performance line have a performance worse than random performance, that is, they
consistently produce more false positive responses than true positive responses.

However, because ROC graphs are symmetric along the random performance line,
inverting the responses of a classifier in the “worse than random performance” region
will turn it into a well performing classifier in one of the regions above the random
performance line. Finally, the point in the top left corner denotes perfect
classification: 100% true positive rate and 0% false positive rate.

36 
 
 
Classification, Clustering
and Web Mining

(a)

(b) (c) (d)

Figure 7: ROC curves: (a) Regions of a ROC graph (b) An almost perfect classifier
(c) A reasonable classifier (d) A poor classifier

The point marked with A is the classifier from the previous section with a TPR
= 0.90 and a FPR = 0.35. Note, that the classifier is mapped to the same point in
the ROC graph regardless whether we use the original test set or the test set with
the sampled down negative class illustrating the fact that ROC graphs are not
sensitive to class skew. Classifiers mapped onto a ROC graph can be ranked
according to their distance to the ‘perfect performance’ point. In Figure 2 (a)
classifier A is considered to be superior to a hypothetical classifier B because A
is closer to the top left corner.

The true power of ROC curves, however, comes from the fact that they
characterize the performance of a classification model as a curve rather than a
single point on the ROC graph. In addition, Figure 2 shows some typical
examples of ROC curves. Part (b) depicts the ROC curve of an almost perfect
classifier where the performance curve almost touches the ‘perfect performance’
point in the top left corner. Part (c) and part (d) depict ROC curves of inferior
classifiers. At this level the curves provide a convenient visual representation of
the performance of various models where it is easy to spot optimal versus sub-
optimal models.

37 
 
 
Classification
Log Loss

Log loss is a very effective classification metric and is equivalent to -1* log
(likelihood function) where the likelihood function suggests how likely the model
thinks the observed set of outcomes was. Since the likelihood function provides very
small values, a better way to interpret them is by converting the values to log and the
negative is added to reverse the order of the metric such that a lower loss score
suggests a better model.

Gain and Lift Charts

Gain and lift charts are tools that evaluate model performance just like the confusion
matrix but with a subtle, yet significant difference. The confusion matrix determines
the performance of the model on the whole population or the entire test set, whereas
the gain and lift charts evaluate the model on portions of the whole population.
Therefore, we have a score (y-axis) for every % of the population (x-axis). Lift charts
measure the improvement that a model brings in compared to random predictions. The
improvement is referred to as the ‘lift’.

K-S Chart

The K-S chart or Kolmogorov-Smirnov chart determines the degree of separation


between two distributions – the positive class distribution and the negative class
distribution. The higher the difference, the better is the model at separating the
positive and negative cases.

Check Your Progress 1:

1. Define Classification in Data Mining.


……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
2. What are the steps involved in building a Classification Model. Explain.
……………………………………………………………………………………
……………………………………………………………………………………
…………………………………………………………………………………...
3. List the advantages, disadvantages and applications of Decision Tree
Classification Technique.
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………

4. List the advantages, disadvantages and applications of Naive Bayes


Classification Algorithm.
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
5. List the advantages, disadvantages and applications of SVM Classification
Algorithm.
…………………………………………………………………………………
…………………………………………………………………………………
………………………………………………………………………………… 

38 
 
 
Classification, Clustering
10.9 SUMMARY and Web Mining

In this unit we had classification, classification techniques and model


evaluation/selection.
Classification is one of the most commonly used data mining techniques. It can be
used for both categorical and numerical attributes. The goal is to predict the class
labels of new, unseen observations by using training data consisting of both labeled
and unlabeled examples. This method makes use of an algorithm to identify patterns
in the training data that are predictive of new observations.
Classification Techniques are methods of data analysis that can be used to determine
the categorization of an individual based on their personal attributes. These techniques
help us better understand individuals by grouping them together depending on their
lifestyle, habits, and traits.
Decision Tree: A decision tree is a class discriminator that iteratively splits the
training set until each partition contains only or primarily samples from one class. A
split point is a test that describes how data is partitioned in each non-leaf node of the
tree based on one or more qualities.
Naive Bayes: Naive bayes is used to work with probabilistic models and majorly used
in machine learning. In this model, probability is calculated for each class to
determine their categorization, which is then used to forecast the values for a new
class.
Rule Based Classification: “If-then” rules are the classification rules, and the rule is a
condition. The rules of individuals are ranked. Rule-based order refers to the order
that is based on their quality. Class-based ordering refers to the grouping of rules that
belong to the same class. A good rule should be error-free and cover as many
scenarios as possible.
Support Vector Machine: Support Vector Machines (SVM), is a classification
technique that can be used to build both classifiers and non-parametric regression
models. SVM works by finding an optimal hyperplane that separates objects of
different classes in the input space based on their training samples. A new
classification method for both linear and non-linear data is the Support Vector
Machine. It transforms the original training data into a higher dimension using a non-
linear mapping. It searches for the linear optimal separation hyper plane with the
additional dimension (i.e.. “decision boundary”).

A hyper plane with a good non-linear mapping to a high enough dimension can
always divide data into two groups. SVM uses support vectors (“important training
tuples”) and margins (specified by the support vectors) to discover this hyper plane.
SVM is used for classification as well as prediction.

More on Classification can be studied in your third semester MCS-224 Artificial


Intelligence and Machine Learning course.

In the next unit, we will focus on Clustering.

39 
 
 
Classification
10.10 SOLUTIONS/ANSWERS
Check Your Progress 1:
1.
Classification is a process that assigns an object or event to one of the predefined
classes in a group. It’s based on their characteristics in order to be able to predict their
future behavior. Classification methods are used when the data set has already been
divided into groups before the classification process begins. The accuracy often
depends on the preprocessing of the data which involves data cleaning (missing
values, null values and blank values), data integration from multiple sources, data
transformation and discretization.

Classification is a single step in the data mining process. It is used for organizing
objects based on some key features. Several approaches such as the K-Nearest
Neighbors classification, Decision Tree Learning and Support Vector Machines are
employed for data mining classification.

2.
Following are the steps involved in building a classification model:
 Initialize the classifier to be used.

 Train the classifier: All classifiers in scikit-learn uses a fit(X, y) method to


fit the model(training) for the given train data X and train label y.

 Predict the target: Given an unlabeled observation X, the predict(X) returns


the predicted label y.

 Evaluate the classifier model

3.
Advantages of Decision Tree Classification Algorithm
 This algorithm allows for an uncomplicated representation of data. So, it is
easier to interpret and explain it to executives.
 Decision Trees mimic the way humans make decisions in everyday life.
 They smoothly handle qualitative target variables.
 They handle non-linear data effectively.

Disadvantages of Decision Tree Classification Algorithm


 They may create complex trees which sometimes become irrelevant.
 They do not have the same level of prediction accuracy as compared to other
algorithms.

Applications of Decision Tree Classification Algorithm


 Sentiment Analysis: It is used as a classification algorithm in text mining to
determine a customer's sentiment towards a product.
 Product Selection: Companies can use decision trees to realize which product
will give them higher profits on launching.

4.
Advantages of Naive Bayes Classification Algorithm
 It is simple, and its implementation is straightforward.
 The time required by the machine to learn the pattern using this classifier is
less.
 It performs well in the case where the input variables have categorical values.
 It gives good results for complex real-world problems.
40 
 
 
Classification, Clustering
 It performs well in the case of multi-class classification. and Web Mining

Disadvantages of Naive Bayes Classification Algorithm


 It assumes independence among feature variables which may not always be
the case.
 We often refer to it as a bad estimator, and hence the probabilities are not
always of great significance.
 If, during the training time, the model was not aware of any of the categorical
variables and that variable is passed during testing, the model assigns 0 (zero)
likelihood and thus substitutes zero probability referred to as 'zero frequency.'
One can avoid this situation by using smoothing procedures such as Laplace
estimation.

Applications of Naive Bayes Classification Algorithm


 Spam Classification: Identifying whether an e-mail is a spam or not based on
the content of the e-mail
 Live Prediction System: This model is relatively fast and thus predicts the
target variable in real-time.
 Sentiment Analysis: Recognising feedback on a product and classifying it as
“positive” or “negative”.
 Multi-Class Prediction: Naive Bayes works well for multi-class classification
machine learning problems.
5.
Advantages of SVM Classification Algorithm
 It makes training the dataset easy.
 It performs well when the data is high-dimensional.

Disadvantages of SVM Classification Algorithm


 It doesn't perform well when the data has noisy elements.
 It is sensitive to kernel functions, so they have to be chosen wisely.

Applications of SVM Classification Algorithm


 Face Detection: It is used to read through images (an array of pixel numbers)
and identify whether it contains a face or not based on usual human features.
 Image Classification: SVM is one of the image classification algorithms used
to classify images based on their characteristics.
 Handwritten Character Recognition: We can use it to identify handwritten
characters.

10.11 FURTHER READINGS

1. Data Mining: Concepts and Techniques, 3rd Edition, Jiawei Han, Micheline
Kamber, Jian Pei, Elsevier, 2012.
2. Data Mining, Charu C. Aggarwal, Springer, 2015.
3. Data Mining and Data Warehousing – Principles and Practical Techniques,
Parteek Bhatia, Cambridge University Press, 2019.
4. Introduction to Data Mining, Pang Ning Tan, Michael Steinbach, Anuj
Karpatne, Vipin Kumar, Pearson, 2018.
5. Data Mining Techniques and Applications: An Introduction, Hongbo Du,
Cengage Learning, 2013.
6. Data Mining : Vikram Pudi and P. Radha Krishna, Oxford, 2009.
7. Data Mining and Analysis – Fundamental Concepts and Algorithms;
Mohammed J. Zaki, Wagner Meira, Jr, Oxford, 2014.

41 
 
Text and Web Mining

UNIT 12 TEXT AND WEB MINING


Structure

12.0 Introduction
12.1 Objectives
12.2 Text Mining and its Applications
12.3 Text Preprocessing
12.4 BoW and TF-IDF For Creating Features from Text
12.4.1 Bag of Words
12.4.2 Vector Space Modeling for Representing Text Documents
12.4.3 Term Frequency-Inverse Document Frequency
12.5 Dimensionality Reduction
12.5.1 Techniques for Dimensionality Reduction
12.5.1.1 Feature Selection Techniques
12.5.1.2 Feature Extraction Techniques
12.6 Web Mining
12.6.1 Features of Web Mining
12.6.2 Web Mining Tasks
12.6.3 Applications of Web Mining
12.7 Types of Web Mining
12.7.1 Web Content Mining
12.7.2 Web Structure Mining
12.7.3 Web Usage Mining
12.8 Mining Multimedia Data on the Web
12.9 Automatic Classification of Web Documents
12.10 Summary
12.11 Solutions/Answers
12.12 Further Readings

12.0 INTRODUCTION

In the earlier unit, we had studied about the Clustering. In this unit let us focus on the
text and web mining aspects. This unit covers the introduction to text mining, text data
analysis and information retrieval, text mining approaches and topics related to web
mining.

12.1 OBJECTIVES
After going through this unit, you should be able to:
 understand the significance of Text Mining
 describe the dimensionality reduction of text
 narrate text mining approaches
 discuss the purpose of web mining and web structure mining
 describe mining the multimedia data on the web and web usage mining.

5
Text and Web Mining

12.2 TEXT MINING AND ITS APPLICATIONS

Text mining, also known as text data mining, is the process of transforming
unstructured text into a structured format to identify meaningful patterns and new
insights. By applying advanced analytical techniques, such as Naïve Bayes, Support
Vector Machines (SVM), and other deep learning algorithms, companies are able to
explore and discover hidden relationships within their unstructured data.

Text is a one of the most common data types within databases. Depending on the
database, this data can be organized as:

 Structured data: This data is standardized into a tabular format with numerous
rows and columns, making it easier to store and process for analysis and
machine learning algorithms. Structured data can include inputs such as
names, addresses, and phone numbers.
 Unstructured data: This data does not have a predefined data format. It can
include text from sources, like social media or product reviews, or rich media
formats like, video and audio files.
 Semi-structured data: As the name suggests, this data is a blend between
structured and unstructured data formats. While it has some organization, it
doesn’t have enough structure to meet the requirements of a relational
database. Examples of semi-structured data include XML, JSON and HTML
files.

Since 80% of data in the world resides in an unstructured format, text mining is an
extremely valuable practice within organizations. Text mining tools and Natural
Language Processing (NLP) techniques, like information extraction, allow us to
transform unstructured documents into a structured format to enable analysis and the
generation of high-quality insights. This, in turn, improves the decision-making of
organizations, leading to better business outcomes.

For example, the tweets or messages on WhatsApp, Facebook, Instagram or through


text messages and the majority of this data exists in the textual form which is highly
unstructured in nature now in order to produce significant and actionable insights from
the text data it is important to get acquainted with the techniques of text analysis.

Text analysis or text mining is the process of deriving meaningful information from
natural language. It usually involves the process of structuring the input text deriving
patterns within the structured data and finally evaluating the interpreted output
compared with the kind of data stored in database text is unstructured amorphous and
difficult to deal with algorithmically. Nevertheless in the modern culture text is the
most common vehicle for the formal exchange of information now as text mining
refers to the process of arriving high-quality information from text the overall goal
here is to turn the text into data for analysis.

Text mining has various areas to explore as shown below:.

Information Extraction is the techniques of taking out the information from the
unstructured text data or semi-structured data contains in the electronic documents.
The processes identify the entities, then classify them and store in the databases from
the unstructured text documents.
6
Text and Web Mining

Natural Language Processing (NLP): The human language which can be found in
WhatsApp chats, blogs, social media reviews or any reviews which are written in any
offline documents. This is done by the application of NLP or natural language
processing. NLP refers to the artificial intelligence method of communicating with an
intelligent system using natural language by utilizing NLP and its components one can
organize the massive chunks of textual data perform numerous or automated tasks and
solve a wide range of problems such as automatic summarization, machine translation,
speech recognition and topic segmentation.

Data Mining: Data mining refers to the extraction of useful data, hidden patterns from
large data sets. Data mining tools can predict behaviors and future trends that allow
businesses to make a better data-driven decision. Data mining tools can be used to
resolve many business problems that have traditionally been too time-consuming.

Information Retrieval: Information retrieval deals with retrieving useful data from
data that is stored in our systems. Alternately, as an analogy, we can view search
engines that happen on websites such as e-commerce sites or any other sites as part of
information retrieval.

Text mining often includes the following techniques:

 Information extraction is a technique for extracting domain specific


information from texts. Text fragments are mapped to field or template lots
that have a definite semantic technique.
 Text summarization involves identifying, summarizing and organizing
related text so that users can efficiently deal with information in large
documents.
 Text categorization involves organizes documents into a taxonomy, thus
allowing for more efficient searches. It involves the assignment of subject
descriptors or classification codes or abstract concepts to complete texts.
 Text clustering involves automatically clustering documents into groups
where documents within each group share common features.

12.2.1 Applications of Text Mining

Following are some of the applications of Text Mining:

 Customer service: There are various ways in which we inivite customer


feedback from our users. When combined with text analytics tools, feedback
systems such as chatbots, customer surveys, Net-Promoter Scores, online
reviews, support tickets, and social media profiles, enable companies to
improve their customer experience with speed. Text mining and sentiment
analysis can provide a mechanism for companies to prioritize key pain points
for their customers, allowing businesses to respond to urgent issues in real-
time and increase customer satisfaction.
 Risk management: Text mining also has applications in risk management. It
can provide insights around industry trends and financial markets by
monitoring shifts in sentiment and by extracting information from analyst
reports and whitepapers. This is particularly valuable to banking institutions as
this data provides more confidence when considering business investments
across various sectors.
7
Text and Web Mining

 Maintenance: Text mining provides a rich and complete picture of the


operation and functionality of products and machinery. Over time, text mining
automates decision making by revealing patterns that correlate with problems
and preventive and reactive maintenance procedures. Text analytics helps
maintenance professionals unearth the root cause of challenges and failures
faster.
 Healthcare: Text mining techniques have been increasingly valuable to
researchers in the biomedical field, particularly for clustering information.
Manual investigation of medical research can be costly and time-consuming;
text mining provides an automation method for extracting valuable information
from medical literature.
 Spam filtering: Spam frequently serves as an entry point for hackers to infect
computer systems with malware. Text mining can provide a method to filter
and exclude these e-mails from inboxes, improving the overall user experience
and minimizing the risk of cyber-attacks to end users.

12.2.2 Text Analytics

Text mining emphasizes more on the process, whereas text analytics emphasizes more
on the result. Text mining and analytics implies to turn text data into high quality
information or actionable knowledge.

Text analytics is a sub-set of Natural Language Processing (NLP) that aims to


automate extraction and classification of actionable insights from unstructured text
disguised as emails, tweets, chats, tickets, reviews, and survey responses scattered all
over the internet.

Text analytics or text mining is multi-faceted and anchors NLP to gather and process
text and other language data to deliver meaningful insights.

12.2.3 Need for Text Analytics

Need for Text Analytics is to:

Maintain Consistency: Manual tasks are repetitive and tiring. Humans tend to make
errors while performing such tasks – and, on top of everything else, performing such
tasks is time-consuming. Cognitive biasing is another factor that hinders consistency
in data analysis. Leveraging advanced algorithms like text
analytics techniques enable performing quick and collective analysis rationally and
provide reliable and consistent data.

Scalability: With text analytics techniques, enormous data across social media, emails,
chats, websites, and documents can be structured and processed without difficulty,
helping businesses improve efficiency with more information.

Real-time Analysis: Real-time data in today’s world is a game-changer. Evaluating


this information with text analytics allows businesses to detect and attend to urgent
matters without delay. Applications of Text analytics enable monitoring and
automated flagging of tweets, shares, likes, and spotting expressions and sentiments
that convey urgency or negativity.

8
Text and Web Mining

The simplest traditional process of text mining is Text preprocessing, Text


Transformation (attribute generation) , Feature Selection (attribute selection), Data
Mining and Evaluation. In the next sections we will study them one by one.

Check Your Progress 1:

1) Define structured, un-structured and semi-structured data with some examples for each.
……………………………………………………………………………………………
……………………………………………………………………………………………
……………………………………………………………………………………………
2) Differentiate between Text Mining and Text Analytics.
……………………………………………………………………………………………
……………………………………………………………………………………………
……………………………………………………………………………………………

12.3 TEXT PREPROCESSING

Text preprocessing is an approach for cleaning and preparing text data for use in a
specific context. Developers use it in almost all natural language processing (NLP)
pipelines, including voice recognition software, search engine lookup, and machine
learning model training. It is an essential step because text data can vary. From its
format (website, text message, voice recognition) to the people who create the text
(language, dialect), there are plenty of things that can introduce noise into your data.
The ultimate goal of cleaning and preparing text data is to reduce the text to only the
words that you need for your NLP goals.

Noise Removal: Text cleaning is a technique that developers use in a variety of


domains. Depending on the goal of your project and where you get your data from,
you may want to remove unwanted information, such as:

 Punctuation and accents


 Special characters
 Numeric digits
 Leading, ending, and vertical whitespace
 HTML formatting

The type of noise that you need to remove from text usually depends on its source.

Stages such as stemming, lemmatization, and text normalization make the vocabulary
size more manageable and transform the text into a more standard form across a
variety of documents acquired from different sources.

9
Text and Web Mining

Figure 1: Text Preprocessing

Once you have a clear idea of the type of application you are developing and the
source and nature of text data, you can decide on which preprocessing stages can be
added to your NLP pipeline. Most of the NLP toolkits on the market include options
for all of the preprocessing stages discussed above.

An NLP pipeline for document classification might include steps such as sentence
segmentation, word tokenization, lowercasing, stemming or lemmatization, stop word
removal, spelling correction and Normalization as shown in Fig 1. Some or all of
these commonly used text preprocessing stages are used in typical NLP systems,
although the order can vary depending on the application.

a) Segmentation

Segmentation involves breaking up text into corresponding sentences. While this may
seem like a trivial task, it has a few challenges. For example, in the English language,
a period normally indicates the end of a sentence, but many abbreviations, including
“Inc.,” “Calif.,” “Mr.,” and “Ms.,” and all fractional numbers contain periods and
introduce uncertainty unless the end-of-sentence rules accommodate those exceptions.

b) Tokenization

For many natural language processing tasks, we need access to each word in a string.
To access each word, we first have to break the text into smaller components. The
method for breaking text into smaller components is called tokenization and the
individual components are called tokens as shown in Fig 2.

A few common operations that require tokenization include:

 Finding how many words or sentences appear in text


 Determining how many times a specific word or phrase exists
 Accounting for which terms are likely to co-occur

While tokens are usually individual words or terms, they can also be sentences or
other size pieces of text.

Many NLP toolkits allow users to input multiple criteria based on which word
boundaries are determined. For example, you can use a whitespace or punctuation to
10
Text and Web Mining

determine if one word has ended and the next one has started. Again, in some
instances, these rules might fail. For example, don’t, it’s, etc. are words themselves
that contain punctuation marks and have to be dealt with separately.

Figure 2; Tokenization
c) Normalization

Tokenization and noise removal are staples of almost all text pre-processing pipelines.
However, some data may require further processing through text normalization.
Text normalization is a catch-all term for various text pre-processing tasks. In the next
few exercises, we’ll cover a few of them:

 Upper or lowercasing
 Stopword removal
 Stemming – bluntly removing prefixes and suffixes from a word
 Lemmatization – replacing a single-word token with its root

Change Case

Changing the case involves converting all text to lowercase or uppercase so that all
word strings follow a consistent format. Lowercasing is the more frequent choice in
NLP software.

Spell Correction

Many NLP applications include a step to correct the spelling of all words in the text.

Stop-Words Removal
 
“Stop words” are frequently occurring words used to construct sentences. In the
English language, stop words include is, the, are, of, in, and and. For some NLP
applications, such as document categorization, sentiment analysis, and spam filtering,
these words are redundant, and so are removed at the preprocessing stage. See the
Table 1 below given the sample text with stop words and without stop words.

Table 1: Sample Text with Stop Words and without Stop Words

Sample Text with Stop Words Without Stop Words


TextMining – A technique of data mining TextMining, technique,
for analysis of web data datamining, analysis, web, data
The movie was awesome Movie, awesome
The product quality is bad Product, quality, bad

11
Text and Web Mining

Stemming

The term word stem is borrowed from linguistics and used to refer to the base or root
form of a word. For example, learn is a base word for its variants such as learn,
learns, learning, and learned.

Stemming is the process of converting all words to their base form, or stem. Normally,
a lookup table is used to find the word and its corresponding stem. Many search
engines apply stemming for retrieving documents that match user queries. Stemming
is also used at the preprocessing stage for applications such as emotion identification
and text classification. An example is given in the Fig 3.

Figure 3: Example of Stemming


Lemmatization

Lemmatization is a more advanced form of stemming and involves converting all


words to their corresponding root form, called “lemma.” While stemming reduces all
words to their stem via a lookup table, it does not employ any knowledge of the parts
of speech or the context of the word. This means stemming can’t distinguish which
meaning of the word right is intended in the sentences “Please turn right at the next
light” and “She is always right.”

The stemmer would stem right to right in both sentences; the lemmatizer would treat
right differently based upon its usage in the two phrases.

A lemmatizer also converts different word forms or inflections to a standard form. For
example, it would convert less to little, wrote to write, slept to sleep, etc.

A lemmatizer works with more rules of the language and contextual information than
does a stemmer. It also relies on a dictionary to look up matching words. Because of
that, it requires more processing power and time than a stemmer to generate output.
For these reasons, some NLP applications only use a stemmer and not a
lemmatizer. In the below given Fig 4, difference between lemmatization and
stemming is illustrated.

Figure 4: Illustration of Lemmatization and Stemming

12
Text and Web Mining

Parts of Speech Tagging

One of the more advanced text preprocessing techniques is parts of speech (POS)
tagging. This step augments the input text with additional information about the
sentence’s grammatical structure. Each word is, therefore, inserted into one of the
predefined categories such as a noun, verb, adjective, etc. This step is also sometimes
referred to as grammatical tagging.

12.4 TEXT TRANSFORMATION USING BoW AND


TF-IDF

We understand that sentence in a fraction of a second. But machines simply cannot


process text data in raw form. They need us to break down the text into a numerical
format that’s easily readable by the machine. This is where the concepts of Bag-of-
Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) come into
play. Both BoW and TF-IDF are techniques that help us convert text sentences into
numeric vectors.
For example, there are sample of reviews of a movie so, the reviews of the
viewers can be:

 Review 1: This movie is very scary and long


 Review 2: This movie is not scary and is slow
 Review 3: This movie is spooky and good

You can easily observe three different opinions of three different viewers. You can see
thousands of reviews about a movie on the internet. All these users generated text can
help us out to takeout some interpretation in gauging that how a movie has performed.
The above three reviews mentioned above cannot be given to the machine learning
engine to analyze positive or negative reviews. So, we apply some text filtering
techniques like Bag of words.

12.4.1 Bag of words (BoW)

It is the kind of a model in which the text is written in the form of numbers. It can be
represented as represent a sentence as a bag of words vector (a string of numbers).

The Bag of Words (BoW) model is the simplest form of text representation in numbers.
Like the term itself, we can represent a sentence as a bag of words vector (a string of
numbers).

Consider once again the 3 movie reviews:

 Review 1: This movie is very scary and long


 Review 2: This movie is not scary and is slow
 Review 3: This movie is spooky and good

We will first build a vocabulary from all the unique words in the above three reviews. The
vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’,
‘not’, ‘slow’, ‘spooky’, ‘good’.

13
Text and Web Mining

We can now take each of these words and mark their occurrence in the three movie
reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews as shown in the
Table 2 below:

Table 2: Vector Representation for the Reviews

1 2 3 4 5 6 7 8 9 10 11 Length of
This movie is very scary and long not slow spooky good the
Review
(in
words)

Review 1 1 1 1 1 1 1 1 0 0 0 0 7

Review 2 1 1 2 0 1 1 0 1 1 0 0 8

Review 3 1 1 1 0 0 1 0 0 0 1 1 6

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 1 0 0 0 1 1]

And that’s the core idea behind a Bag of Words (BoW) model.

Drawbacks of using a BoW

In the above example, we can have vectors of length 11. However, we start facing
issues when we come across new sentences:

 If the new sentences contain new words, then our vocabulary size would
increase and thereby, the length of the vectors would increase too.
 Additionally, the vectors would also contain many 0s, thereby resulting in a
sparse matrix (which is what we would like to avoid)
 We are retaining no information on the grammar of the sentences nor on the
ordering of the words in the text.

12.4.2 Vector Space Modeling for Representing Text Documents

The fundamental idea of a vector space model for text is to treat each distinct term as
its own dimension. So, let’s say you have a document D, of length M words, so we
say wi is the ith word in D, where i∈[1...M]. Furthermore, the set of words contained
in wi form a set called the vocabulary or, more evocatively, the term space, often
denoted V.

Here’s an example:

Let our actual document D be: "He is neither a friend nor is he a foe"

Then M=10, and w3="neither". Our term space consists of all distinct terms
in D: V={"He","is","neither","a","friend","nor","foe"}

14
Text and Web Mining

Now, lets impose an (arbitrary) ordering on V, so that that we form a basis V of terms.
In this basis, vi refers to the ith term in the vocabulary (i.e. we convert the Python
“set” V to a Python "sequence" V). Think V = list(V)

V:=["He","is","neither","a","friend","nor","foe"]

What we have done is define a basis for a vector space. In this example, we have
defined a 7-dimensional vector space, where each term vi represents an orthogonal
axis in a coordinate system much like the traditional x,y,z axes.

With this space, we now have a convenient way of describing documents: Each
document can be represented as a 7-dimensional vector (n1,...,n7) where ni is
the number of times term vi occurs in D (also called the "term frequency"). In our
example, we would represent D by projecting it onto our basis V, resulting in the
following vector:

D||B = (2,2,1,2,1,1,1)

This representation forms the core of most text mining methods. For example, you can
measure similarity between two documents as the cosine of the angle between their
associated vectors. There are many more uses of this method for encoding documents
(e.g., see TF-IDF as a refinement of the basic vector space model which is given
below).

12.4.3 Term Frequency-Inverse Document Frequency (TF-IDF)

Term frequency–inverse document frequency, is a numerical statistic that is intended


to reflect how important a word is to a document in a collection or corpus.

Term Frequency (TF)

Let’s first understand Term Frequent (TF). It is a measure of how frequently a term, t,
appears in a document, d:

Here, in the numerator, n is the number of times the term “t” appears in the document
“d”. Thus, each document and term would have its own TF value.

Consider the 3 reviews as shown below:

 Review 1: This movie is very scary and long


 Review 2: This movie is not scary and is slow
 Review 3: This movie is spooky and good

We will again use the same vocabulary we had built in the Bag-of-Words model to
show how to calculate the TF for Review #2:

Review 2: This movie is not scary and is slow

15
Text and Web Mining

Here,

 Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’,
‘spooky’, ‘good’
 Number of words in Review 2 = 8
 TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/(number
of terms in review 2) = 1/8
Similarly,
 TF(‘movie’) = 1/8
 TF(‘is’) = 2/8 = 1/4
 TF(‘very’) = 0/8 = 0
 TF(‘scary’) = 1/8
 TF(‘and’) = 1/8
 TF(‘long’) = 0/8 = 0
 TF(‘not’) = 1/8
 TF(‘slow’) = 1/8
 TF( ‘spooky’) = 0/8 = 0
 TF(‘good’) = 0/8 = 0

We can calculate the term frequencies for all the terms and all the reviews in this
manner:

Inverse Document Frequency (IDF)

IDF is a measure of how important a term is. We need the IDF value because
computing just the TF alone is not sufficient to understand the importance of words:

We can calculate the IDF values for the all the words in Review 2:
IDF(‘this’) = log(number of documents/number of documents containing the word
‘this’) = log(3/3) = log(1) = 0

Similarly,

 IDF(‘movie’, ) = log(3/3) = 0
 IDF(‘is’) = log(3/3) = 0

16
Text and Web Mining

 IDF(‘not’) = log(3/1) = log(3) = 0.48


 IDF(‘scary’) = log(3/2) = 0.18
 IDF(‘and’) = log(3/3) = 0
 IDF(‘slow’) = log(3/1) = 0.48

We can calculate the IDF values for each word like this. Thus, the IDF values for the
entire vocabulary would be:

Hence, we see that words like “is”, “this”, “and”, etc., are reduced to 0 and have little
importance; while words like “scary”, “long”, “good”, etc. are words with more
importance and thus have a higher value.

We can now compute the TF-IDF score for each word in the corpus. Words with a
higher score are more important, and those with a lower score are less important:

We can now calculate the TF-IDF score for every word in Review 2:
TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0
Similarly,

 TF-IDF(‘movie’, Review 2) = 1/8 * 0 = 0


 TF-IDF(‘is’, Review 2) = 1/4 * 0 = 0
 TF-IDF(‘not’, Review 2) = 1/8 * 0.48 = 0.06
 TF-IDF(‘scary’, Review 2) = 1/8 * 0.18 = 0.023
 TF-IDF(‘and’, Review 2) = 1/8 * 0 = 0
 TF-IDF(‘slow’, Review 2) = 1/8 * 0.48 = 0.06

Similarly, we can calculate the TF-IDF scores for all the words with respect to all the
reviews:

17
Text and Web Mining

We have now obtained the TF-IDF scores for our vocabulary. TF-IDF also gives
larger values for less frequent words and is high when both IDF and TF values are
high i.e the word is rare in all the documents combined but frequent in a single
document.

12.5 DIMENSIONALITY REDUCTION

The number of input features, variables, or columns present in a given dataset is


known as dimensionality, and the process to reduce these features is called
dimensionality reduction.

Dimensionality reduction is the process of reducing the number of random variables


or attributes under consideration. High-dimensionality data reduction, as part of a data
pre-processing-step, is extremely important in many real-world applications. High-
dimensionality reduction has emerged as one of the significant tasks in data mining
applications. For an example you may have a dataset with hundreds of features
(columns in your database). Then dimensionality reduction is that you reduce those
features of attributes of data by combining or merging them in such a way that it will
not loose much of the significant characteristics of the original dataset. One of the
major problems that occur with high dimensional data is widely known as the “Curse
of Dimensionality”. This pushes us to reduce the dimensions of our data if we want to
use them for analysis.

Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly known as


the curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of features
increases, the number of samples also gets increased proportionally, and the chance of
overfitting also increases. If the machine learning model is trained on high-dimensional
data, it becomes overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

18
Text and Web Mining

Some benefits of applying dimensionality reduction technique to the given dataset are
given below:

 By reducing the dimensions of the features, the space required to store the dataset also
gets reduced.
 Less Computation training time is required for reduced dimensions of features.
 Reduced dimensions of features of the dataset help in visualizing the data quickly.
 It removes the redundant features (if present) by taking care of multi-collinearity.

12.5.1 Techniques for Dimensionality Reduction

Dimensionality reduction is accomplished based on either feature selection or feature


extraction.

Feature selection is based on omitting those features from the available measurements
which do not contribute to class separability. In other words, redundant and irrelevant
features are ignored.

Feature extraction, on the other hand, considers the whole information content and
maps the useful information content into a lower dimensional feature space.

One can differentiate the techniques used for dimensionality reduction as linear
techniques and non-linear techniques as well. But here those techniques will be
described based on the feature selection and feature extraction standpoint.

As a stand-alone task, feature selection can be unsupervised (e.g. Variance


Thresholds) or supervised (e.g. Genetic Algorithms). You can also combine multiple
methods if needed.

12.5.1.1 Feature Selection Techniques

a) Variance Thresholds

This technique looks for the variance from one observation to another of a given
feature and then if the variance is not different in each observation according to the
given threshold, feature that is responsible for that observation is removed. Features
that don’t change much don’t add much effective information. Using variance
thresholds is an easy and relatively safe way to reduce dimensionality at the start of
your modeling process. But this alone will not be sufficient if you want to reduce the
dimensions as it’s highly subjective and you need to tune the variance threshold
manually. This kind of feature selection can be implemented using both Python and R.

b) Correlation Thresholds

Here the features are taken into account and checked whether those features are
correlated to each other closely. If they are, the overall effect to the final output of
both of the features would be similar even to the result we get when we used one of
those features. Which one should you remove? Well, you’d first calculate all pair-wise
correlations. Then, if the correlation between a pair of features is above a given
threshold, you’d remove the one that has larger mean absolute correlation with other
features. Like the previous technique, this is also based on intuition and hence the
burden of tuning the thresholds in such a way that the useful information will not be
19
Text and Web Mining

neglected, will fall upon the user. Because of those reasons, algorithms with built-in
feature selection or algorithms like PCA(Principal Component Analysis) are preferred
over this one.

c) Genetic Algorithms

They are search algorithms that are inspired by evolutionary biology and natural
selection, combining mutation and cross-over to efficiently traverse large solution
spaces. Genetic Algorithms are used to find an optimal binary vector, where each bit
is associated with a feature. If the bit of this vector equals 1, then the feature is
allowed to participate in classification. If the bit is a 0, then the corresponding feature
does not participate. In feature selection, “genes” represent individual features and the
“organism” represents a candidate set of features. Each organism in the “population”
is graded on a fitness score such as model performance on a hold-out set. The fittest
organisms survive and reproduce, repeating until the population converges on a
solution some generations later.

d) Stepwise Regression

In statistics, stepwise regression is a method of fitting regression models in which the


choice of predictive variables is carried out by an automatic procedure. In each step, a
variable is considered for addition to or subtraction from the set of explanatory
variables based on some pre-specified criterion. Usually, this takes the form of a
sequence of F-tests or T-Tests but other techniques are possible such as adjusted R2,
Akaike information criterion, Bayesian Information criterion etc.. .

This has two types: forward and backward. For forward stepwise search, you start
without any features. Then, you’d train a 1-feature model using each of your candidate
features and keep the version with the best performance. You’d continue adding
features, one at a time, until your performance improvements stall. Backward stepwise
search is the same process, just reversed: start with all features in your model and then
remove one at a time until performance starts to drop substantially.

This is a greedy algorithm and commonly has a lower performance than the
supervised methods such as regularizations etc.

12.5.1.2 Feature Extraction Techniques

Feature extraction is for creating a new, smaller set of features that still captures most
of the useful information. This can come as supervised (e.g. LDA) and unsupervised
(e.g. PCA) methods.

a) Linear Discriminant Analysis (LDA)

LDA uses the information from multiple features to create a new axis and projects the
data on to the new axis in such a way as to minimize the variance and maximize the
distance between the means of the classes. LDA is a supervised method that can only
be used with labeled data. It consists of statistical properties of your data, calculated
for each class. For a single input variable (x) this is the mean and the variance of the
variable for each class. For multiple variables, this is the same properties calculated
over the multivariate Gaussian, namely the means and the covariance matrix. The

20
Text and Web Mining

LDA transformation is also dependent on scale, so you should normalize your dataset
first. LDA is a supervised, so it needs labeled data..

b) Principal Component Analysis (PCA)

PCA is a dimensionality reduction that identifies important relationships in our data,


transforms the existing data based on these relationships, and then quantifies the
importance of these relationships so we can keep the most important relationships. To
remember this definition, we can break it down into four steps:

1. We identify the relationship among features through a Covariance Matrix.


2. Through the linear transformation or eigen-decomposition of the Covariance
Matrix, we get eigenvectors and eigenvalues.
3. Then we transform our data using eigenvectors into principal components.
4. Lastly, we quantify the importance of these relationships using Eigenvalues
and keep the important principal components.

The new features that are created by PCA are orthogonal, which means that they are
uncorrelated. Furthermore, they are ranked in order of their “explained variance.” The
first principal component (PC1) explains the most variance in your dataset, PC2
explains the second-most variance, and so on. you can reduce dimensionality by
limiting the number of principal components to keep based on cumulative explained
variance. The PCA transformation is also dependent on scale, so you should normalize
your dataset first. PCA is a find linear correlations between the features given. This
means that only if you have some of the variables in your dataset that are linearly
correlated, this will be helpful.

c) t-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is non-linear dimensionality reduction technique which is typically used to


visualize high dimensional datasets. Some of the main applications of t-SNE are
Natural Language Processing (NLP), speech processing, etc.

t-SNE works by minimizing the divergence between a distribution constituted by the


pairwise probability similarities of the input features in the original high dimensional
space and its equivalent in the reduced low dimensional space. t-SNE makes then use
of the Kullback-Leiber (KL) divergence in order to measure the dissimilarity of the
two different distributions. The KL divergence is then minimized using gradient
descent.

Here the lower dimensional space is modeled using t distribution while the higher
dimensional space is modeled using Gaussian distribution.

d) Autoencoders

Autoencoders are a family of Machine Learning algorithms which can be used as a


dimensionality reduction technique. Autoencoders also use non-linear transformations
to project data from a high dimension to a lower one. Autoencoders are neural
networks that are trained to reconstruct their original inputs. Basically autoencoders
consist with two parts.

21
Text and Web Mining

1. Encoder: takes the input data and compress it, so that to remove all the
possible noise and unhelpful information. The output of the Encoder stage is
usually called bottleneck or latent-space.
2. Decoder: takes as input the encoded latent space and tries to reproduce the
original Autoencoder input using just it’s compressed form (the encoded latent
space).

More on these techniques, you can read from MCS-224 Artificial Intelligence and
Machine Learning course.

12.6 WEB MINING

Web mining as the name suggests that it involves the mining of web data. The
extraction of information from websites uses data mining techniques. It is an
application based on data mining techniques. The parameters generally to be mined in
web pages are hyperlinks, text or content of web pages, linked user activity between
web pages of the same website or among different websites. All user activities are
stored in a web server log file. Web Mining can be referred as discovering interesting
and useful information from Web content and usage.

12.6.1 Features of Web Mining

Following are some of the essential features of Web Mining:

 Web search, e.g. Google, Yahoo, MSN, Ask, Froogle (comparison shopping),
job ads (Flipdog)
 The web mining is not like relation, it has text content and linkage structure.
 On the www the user generated data is increasing rapidly. So, Googles’ usage
logs are very huge in size. Data generated per day on google can be compared
with the largest data warehouse unit.
 Web mining can react in real-time with dynamic patterns generated on the
web. In this no direct human interaction is involved.
 Web Server: It maintains the entry of web log pages in the log file. This web
log entries helps to identify the loyal or potential customers from ecommerce
website or companies.
 Web page is considered as a graph like structure, where pages are considered
as nodes, hyperlinks as edges.
o Pages = nodes, hyperlinks = edges
o Ignore content
o Directed graph
 High linkage
o 8-10 links/page on average
o Power-law degree distribution

12.6.2 Web Mining Tasks

Web Mining performs various tasks such as:


1) Generating patterns existing in some websites, like customer buying behavior
or navigation of web sites.

22
Text and Web Mining

2) The web mining helps to retrieve faster results of the queries or the search text
posted on the search engines like Google, Yahoo etc.
3) The ability to classify web documents according to the search performed on
the ecommerce websites helps to increase businesses and transactions.

12.6.3 Applications of Web Mining

Some of the Applications of Web Mining are as follows:

 Personalized customer experience in Business to Consumer (B2C)


 Web Search
 Web-wide tracking (tracking an individual across all sites he visits, is an
intriguing and controversial technology)
 Understanding Web Communities
 Understanding Auction Behaviour
 Personalized portal for the web.
 Recommendations: e.g. Netflix, Amazon
 improving conversion rate: next best product to offer
 Advertising, e.g. Google AdSense
 Fraud detection
 Improving Web site design and performance 

12.7 TYPES OF WEB MINING

There are three types of web mining as shown in the following Fig 5.

Web  Mining

Web Content  Web Structure  Web Usage 


Mining Mining Mining

Document 
Text Hyperlinks Web Server Logs
Structure

Inter Document  Application 
Image
Hyperlink Server Logs

Intra Document  Application Level 
Audio
Hyperlink Logs

Video

Stuctured 
Record

Figure 5: Three types of Web Mining

23
Text and Web Mining

12.7.1 Web Content Mining

Web content mining is the process of extracting useful information from the contents
of web documents. Content data is the collection of facts a web page is designed to
contain. It may consist of text, images, audio, video, or structured records such as lists
and tables. Application of text mining to web content has been the most widely
researched. Issues addressed in text mining include topic discovery and tracking,
extracting association patterns, clustering of web documents and classification of web
pages. Research activities on this topic have drawn heavily on techniques developed in
other disciplines such as Information Retrieval (IR) and Natural Language Processing
(NLP). While there exists a significant body of work in extracting knowledge from
images in the fields of image processing and computer vision, the application of these
techniques to web content mining has been limited.

12.7.2 Web Structure Mining

The structure of a typical web graph consists of web pages as nodes, and hyperlinks as
edges connecting related pages. Web structure mining is the process of discovering
structure information from the web. This can be further divided into two kinds based
on the kind of structure information used.

Hyperlinks

A hyperlink is a structural unit that connects a location in a web page to a different


location, either within the same web page or on a different web page. A hyperlink that
connects to a different part of the same page is called an intra-document hyperlink,
and a hyperlink that connects two different pages is called an inter-document
hyperlink.

Document Structure

In addition, the content within a Web page can also be organized in a tree-structured
format, based on the various HTML and XML tags within the page. Mining efforts
here have focused on automatically extracting document object model (DOM)
structures out of documents

12.7.3 Web Usage Mining

Web usage mining is the application of data mining techniques to discover interesting
usage patterns from web usage data, in order to understand and better serve the needs
of web-based applications. Usage data captures the identity or origin of web users
along with their browsing behavior at a web site. Web usage mining itself can be
classified further depending on the kind of usage data considered:

Web Server Data

User logs are collected by the web server and typically include IP address, page
reference and access time.

24
Text and Web Mining

Application Server Data

Commercial application servers such as Weblogic, StoryServer have significant


features to enable E-commerce applications to be built on top of them with little effort.
A key feature is the ability to track various kinds of business events and log them in
application server logs.

Application Level Data

New kinds of events can be defined in an application, and logging can be turned on for
them — generating histories of these events. It must be noted, however, that many end
applications require a combination of one or more of the techniques applied in the
above the categories.

12.8 MINING MULTIMEDIA DATA ON THE WEB

The websites are flooded with the multimedia data like, video, audio, images, and
graphs. This multimedia data has different characteristics. The videos, images, audio,
and pictures have different methods of archiving and retrieving the information. The
multimedia data on the web has different properties this is the reason the typical
multimedia data mining techniques cannot be applied. This web-based multimedia has
texts and links. The text and links are the important features of the multimedia data to
organize web pages. The better organization of web pages helps in effective search
operation. The web page layout mining can be applied to segregate the web pages into
the set of multimedia semantic blocks from non-multimedia web pages. There are few
web-based mining terminologies and algorithms to understand.

PageRank: This measure is used to count the number of pages the webpage is
connected to other websites. It gives the importance of the webpage. The Google
search engine uses the algorithm PageRank and rank the web page very significant if
is frequently connected with the other webpages on the social network. It works on the
concept of probability distribution representing the likelihood that a person on random
click would reach to any page. It is assumed the equal distribution in the beginning of
the computational process. This measure works on iterations. Iterating or repetition of
page ranking process would help rank the web page closely reflecting to its true value.

HITS: This measure is used to rate the webpage. It was developed by Jon Kleinberg.
It uses hubs and authorities to be determined from a web page. Hubs and Authorities
define a recursive relationship between web pages.

 This algorithm helps in web link structure and speeds up the search operation
of a web page. Given a query to a Search Engine, the set of highly relevant
web pages are called Roots. They are potential Authorities.
 Pages that are not very relevant but point to pages in the Root are called Hubs.
Thus, an Authority is a page that many hubs link to whereas a Hub is a page
that links to many authorities.

Page Layout Analysis: It extracts and maintains the page-to-block, block-to-page


relationships from link structure of web pages.

25
Text and Web Mining

Vision page segmentation (VIPS) algorithm: It first extracts all the suitable blocks
from the HTML Document Object Model (DOM) tree, and then it finds the separators
between these blocks. Here separators denote the horizontal or vertical lines in a Web
page that visually cross with no blocks. Based on these separators, the semantic tree of
the Web page is constructed. A Web page can be represented as a set of blocks (leaf
nodes of the semantic tree). Compared with DOM-based methods, the segments
obtained by VIPS are more semantically aggregated. Noisy information, such as
navigation, advertisement, and decoration can be easily removed because these
elements are often placed in certain positions on a page. Contents with different topics
are distinguished as separate blocks.

You can understand simply by considering following points:

 The web page contains links and links contained in different semantic blocks
point to pages of different topics.
 Calculate the significance of web page using algorithms PageRank or HITS.
Split pages into semantic blocks

Apply link analysis on semantic block level. For example, in the below Fig 6, it is
clearly shown. We can see the links in different blocks point to the pages with
different topics. In this example, one link points to a page about entertainment and
another link points to a page about sports.

Figure 6: Example of a sample web page (new.yahoo.com), showing web page with different semantic blocks
(red, green, and brown rectangular boxes). Every block has different importance in the web page. The links
in different blocks points to the pages with different topics.

To analyze the web page containing multimedia data there is a technique known as
Link analysis. It uses two most significant algorithms PageRank and HITS to analyze
the significance of web pages. This technique uses each page as a single node in the
web graph. But since, web page with multimedia has lot of data and links. So, cannot
be considered as a single node in the graph. So, in this case the web page is partitioned
into blocks using vision page segmentation also called VIPS algorithm. So, now after
26
Text and Web Mining

extracting all the required information the semantic graph can be developed over
world wide web in which each node represents a semantic topic or semantic structure
of the web page.

VIPS algorithm helps in determining the text for web pages. This is the closely related
text that provides content or text description of web pages and used to build image
index. The web image search can then be performed using any traditional search
technique. Google, Yahoo still uses this approach to search web image page.

Block-level Link Analysis: The block-to-block model is quite useful for web image
retrieval and web page categorization. It uses kinds of relationships, i.e., block-to-page
and page-to-block. Let’s see some definitions. Let P denote the set of all the web
pages,

P = {p1, p2,.., pk}, where k is the number of web pages.

Let B denote the set of all the blocks,

B = {b1, b2, …, bn}, where n is the number of blocks.

It is important to note that, for each block there is only one page that contains that
block. bi ∈ pj means the block i is contained in the page j.

Block-Based Link Structure Analysis: This can be explained using matrix notations.
Consider Z is the block-to-page matrix with dimension n × k. Z can be formally
defined as follows:

where si is the number of pages that block i links to. Zij can also be viewed as a
probability of jumping from block i to page j.

The block-to-page relationship gives a more accurate and robust representation of the
link structures of the web unlikely, HITS as at times it deviates from the web text
information. It is used to organize the web image pages. The image graph deduced can
be used to achieve high-quality web image clustering results. The web page graph for
web image can be constructed by considering measuring which tells the relationship
between blocks and images, block-to-image, image-to-block, page-to-block and block-
to-pages.

12.9 AUTOMATIC CLASSIFICATION OF WEB


DOCUMENTS

The categorization of web pages into the respective subjects or domains is called
classification of web documents. For example, in the following Fig 7, it has shown
various categories like, books, electronics etc. let’s say you are doing online shopping
on the Amazon website and there are so many webpages so when you search for
electronics the respective web page containing the information of electronics is

27
Text and Web Mining

displayed. This is the classification of products which is done on the textual and image
contents.

Figure 7: Types of Web Documents Containing Different Types of Data

The problem with the classification of web documents is that every time the model is
to be constructed by applying some algorithms to classify the document is mammoth
task. The large number of unorganized web pages may have redundant documents.

The automated document classification of web pages is based on the textual content.
The model requires initial training phase of document classifiers for each category
based on training examples.

In the Fig 8 it is shown that the documents can be collected from different sources.
After the collection of documents data cleansing is performed using extraction
transformation and loading techniques. The documents can be grouped according to
the similarity measure (grouping of the documents according to the similarity between
the documents) and TF-IDF. The machine learning model is created and executed, and
different clusters are generated.

Figure 8: Automatic Classification of Web documents

Automated document classification identifies the documents and groups the relevant
documents without any external efforts. There are various tools available in the market
like RapidMiner, Azure, Machine Learning Studio, Amazon Sage maker, KNIME and
Python. The trained model automatically reads the data from documents (PDF, DOC,
28
Text and Web Mining

PPT) and classifies the data according to the category of the document. This trained
model is already trained with the Machine Learning and Natural Language Processing
techniques. There are domain experts who perform this task efficiently.

Benefits of Automatic Document Classification System

1) It is more efficient system of classification as produces improved accuracy of


results and speed up the process of classification.
2) The system incurs in less operational costs
3) Easy data store and retrieval.
4) It organizes the files and documents in a better streamlined way.

Check Your Progress 3

1) What are the techniques to analyze the web usage pattern?

……………………………………………………………………………………………
……………………………………………………………………………………………
…………………………………………………………………………………………….
2) What are the other applications of Web Mining which were not mentioned?
……………………………………………………………………………………………
……………………………………………………………………………………………
……………………………………………………………………………………………
3) What are the differences between Block HITS and HITS?
……………………………………………………………………………………………
……………………………………………………………………………………………
…………………………………………………………………………………………….
4) List some challenges in Web Mining.
……………………………………………………………………………………………
……………………………………………………………………………………………
…………………………………………………………………………………………….

12.10 SUMMARY

In this unit we had studied the important concepts of Text Mining and Web Mining.

Text mining, also referred to as text analysis, is the process of obtaining meaningful
information from large collections of unstructured data. By automatically identifying
patterns, topics, and relevant keywords, text mining uncovers relevant insights that
can help you answer specific questions. Text mining makes it possible to detect trends
and patterns in data that can help businesses support their decision-making processes.
Embracing a data-driven strategy allows companies to understand their customers’
problems, needs, and expectations, detect product issues, conduct market research, and
identify the reasons for customer churn, among many other things.

Web mining is the application of data mining techniques to extract knowledge from
web data, including web documents, hyperlinks between documents, usage logs of
web sites, etc..

29
Text and Web Mining

12.11 SOLUTIONS/ANSWERS
Check Your Progress 1:

1) Structured data: This data is standardized into a tabular format with


numerous rows and columns, making it easier to store and process for
analysis and machine learning algorithms. Structured data can include inputs
such as names, addresses, and phone numbers.

Unstructured data: This data does not have a predefined data format. It can
include text from sources, like social media or product reviews, or rich media
formats like, video and audio files.

Semi-structured data: As the name suggests, this data is a blend between


structured and unstructured data formats. While it has some organization, it
doesn’t have enough structure to meet the requirements of a relational
database. Examples of semi-structured data include XML, JSON and HTML
files.

2) The terms, text mining and text analytics, are largely synonymous in meaning
in conversation, but they can have a more nuanced meaning. Text mining and
text analysis identifies textual patterns and trends within unstructured data
through the use of machine learning, statistics, and linguistics. By transforming
the data into a more structured format through text mining and text analysis,
more quantitative insights can be found through text analytics. Data
visualization techniques can then be harnessed to communicate findings to
wider audiences.

Check Your Progress 2:

1) Techniques used to analyze the web usage patterns are as follows:

 Session and web page visitor analysis: The web log file contains the record
of users visiting web pages, frequency of visit, days, and the duration for
how long the user stays on the web page.
 OLAP (Online Analytical Processing): OLAP can be performed on
different parts of log related data in a certain interval of time.
 Web Structure Mining: It produces the structural summary of the web
pages. It identifies the web page and indirect or direct link of that page
with others. It helps the companies to identify the commercial link of
business websites.

2) Applications of Web Mining are:


 Digital Marketing
 Data analysis on website and application accomplishment.
 User behavior analysis
 Advertising and campaign accomplishment analysis.
30
Text and Web Mining

3. The main difference between BLHITS (Block HITS) and HITS are:

BLHITS HITS
Links are from blocks to pages Links from pages to pages
Root is top ranked blocks Root is top ranked pages
Analyses only top ranked block links Analyses all the links of all the pages
Content analysis at block level Content analysis at page level

4. Challenges in Web Mining are:


 The web page link structure is quite complex to analyze as the web page is
linked with many more other web pages. There exists lot of documents in
the digital library of the web. The data in this library is not organized.
 The web data is uploaded on the web pages dynamically on regular basis.
 Diversity of client networks having different interests, backgrounds, and
usage purposes. The network is growing rapidly.
 Another challenge is to extract the relevant data for subject or domain or
user.

12.12 FURTHER READINGS

1. Mining The Web: Discovering Knowledge From Hypertext Data,


Chakrabarti Soumen, Elsevier Science, 2014.
2. Data Mining, Charu C. Aggarwal, Springer, 2015.
rd
3. Data Mining: Concepts and Techniques, 3 Edition, Jiawei Han, Micheline
Kamber, Jian Pei, Elsevier, 2012.

31

You might also like