Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

RTM Nagpur University-MBA SEM III

Specialization: Business Analytics Subject: Data Mining Concepts


Question Bank

Module-1 Questions
Q.1 What are the two primary goals for data mining? List and explain in brief the
Data Mining tasks used to achieve prediction and description
Ans The two primary goals of data mining tend to be prediction and description.
Prediction involves using some variables or fields in the data set to predict
unknown or future values of other variables of interest. Description, on the other
hand, focuses on finding patterns describing the data that can be interpreted by
humans. Therefore, it is possible to put data-mining activities into one of two
categories:

1. Predictive data mining, which produces the model of the system described by
the given data set, or

2. Descriptive data mining, which produces new, nontrivial information based on


the available data set.

On the predictive end of the spectrum, the goal of data mining is to produce a
model, expressed as an executable code, which can be used to perform
classification, prediction, estimation, or other similar tasks. On the other,
descriptive end of the spectrum, the goal is to gain an understanding of the
analyzed system by uncovering patterns and relationships in large data sets.

The goals of prediction and description are achieved by using the following
primary data-mining tasks
1. Classification—Discovery of a predictive learning function that classifies a
data item into one of several predefined classes.

2. Regression—Discovery of a predictive learning function, which maps a data


item to a real-value prediction variable.

3. Clustering—A common descriptive task in which one seeks to identify a finite


set of categories or clusters to describe the data.

4. Summarization—An additional descriptive task that involves methods for


finding a compact description for a set (or subset) of data.

5. Dependency modeling—Finding a local model that describes significant


dependencies between variables or between the values of a feature in a data
set or in a part of a data set.

6. Change and deviation detection—Discovering the most significant changes in


the data set.

1|Page
Q.2 Explain the Typical Data Mining Process, with suitable diagram
Ans The problem of discovering or estimating dependencies from data or discovering
totally new data is only one part of the general experimental procedure used by
scientists, engineers, and others who apply standard steps to draw conclusions
from the data. The general experimental procedure adapted to data-mining
problems involves the following steps:

State the problem: In this step, a modeler usually specifies a set of variables for
the unknown dependency and, if possible, a general form of this dependency as
an initial hypothesis. There may be several hypotheses formulated for a single
problem at this stage. The first step requires the combined expertise of an
application domain and a data-mining model. In practice, it usually means a close
interaction between the data-mining expert and the application expert.

Collect the data: This step is concerned with how the data are generated and
collected. In general, there are two distinct possibilities. The first is when the data-
generation process is under the control of an expert (modeler): this approach is
known as a designed experiment. The second possibility is when the expert cannot
influence the data- generation process: this is known as the observational
approach.

Preprocessing the data: n the observational setting, data are usually “collected”
from the existing databases, data warehouses, and data marts. Data preprocessing
usually includes at least two common tasks:
1. Outlier detection (and removal): outliers result from measurement errors
and coding and recording errors and, sometimes, are natural, abnormal
values. Such non-representative samples can seriously affect the model
produced later
2. Scaling, encoding, and selecting features: application-specific encoding
methods usually achieve dimensionality reduction by providing a smaller
number of informative features for subsequent data modeling.

Estimate the model: Appropriate method is deployed to implement for prediction


and description. The selection and implementation of the appropriate data-mining
technique is the main task in this phase. This process is not straightforward;
usually, in practice, the implementation is based on several models, and selecting
the best one is an additional task

2|Page
Interpret the model and draw conclusions: models need to be interpretable in
order to be useful because humans are not likely to base their decisions on
complex “black-box” models. Note that the goals of accuracy of the model and
accuracy of its interpretation are somewhat contradictory.

Q.3 What care need to be taken for enhancing the data quality
Ans There are a number of indicators of data quality that have to be taken care of in
the preprocessing phase of a data-mining process:

1. The data should be accurate. The analyst has to check that the name is spelled
correctly, the code is in a given range, the value is complete, and so on.

2. The data should be stored according to data type. The analyst must ensure that
the numerical value is not presented in character form, that integers are not in the
form of real numbers, and so on.

3. The data should have integrity. Updates should not be lost because of conflicts
among different users; robust backup and recovery procedures should be
implemented if they are not already part of the Data Base Management System
(DBMS).

4. The data should be consistent. The form and the content should be the same
after integration of large data sets from different sources.

5. The data should not be redundant. In practice, redundant data should be


minimized, and reasoned duplication should be controlled, or duplicated records
should be eliminated.

6. The data should be timely. The time component of data should be recognized
explicitly from the data or implicitly from the manner of its organization.

7. The data should be well understood. Naming standards are a necessary but not
the only condition for data to be well understood. The user should know that the
data corresponds to an established domain.

8. The data set should be complete. Missing data, which occurs in reality, should
be minimized. Missing data could reduce the quality of a global model. On the
other hand, some data-mining techniques are robust enough to support analyses
of data sets with missing values

Q.4 What are major types of data transformation techniques used in data
warehousing?
Ans There are four main types of transformations, and each has its own characteristics:

1. Simple transformations—These transformations are the building blocks of all


other more complex transformations. This category includes manipulation of data
that is focused on one field at a time, without taking into account its values in

3|Page
related fields. Examples include changing the data type of a field or replacing an
encoded field value with a decoded value.

2. Cleansing and scrubbing—These transformations ensure consistent formatting


and usage of a field or of related groups of fields. This can include a
proper formatting of address information, for example. This class of transfor-
mations also includes checks for valid values in a particular field, usually
checking the range or choosing from an enumerated list.

3. Integration—This is a process of taking operational data from one or more


sources and mapping it, field by field, onto a new data structure in the data
warehouse. The common identifier problem is one of the most difficult integration
issues in building a data warehouse. Essentially, this situation occurs when there
are multiple system sources for the same entities and there is no clear way to
identify those entities as the same. In reality, it is common that some of these
values are contradictory, and resolving a conflict is not a straightforward process.
Just as difficult as having conflicting values is having no value for a data element in
a warehouse.

4. Aggregation and summarization—These are methods of condensing


instances of data found in the operational environment into fewer instances
in the warehouse environment. Although the terms aggregation and
summarization are often used interchangeably in the literature, we believe that
they do have slightly different meanings in the data-warehouse context.
Summarization is a simple addition of values along one or more data dimensions,
e.g. adding up daily sales to produce monthly sales. Aggregation refers to the
addition of different business elements into a common total; it is highly domain
dependent. For example, aggregation is adding daily product sales and monthly
consulting sales to get the combined monthly total.

Q.5 What is data Normalization? Demonstrate three data normalization technique


assuming your own suitable data
Ans Some data-mining methods, typically those that are based on distance
computation between points in an n-dimensional space, may need normalized
data for best results. The measured values can be scaled to a specific range, e.g.,
[−1, 1] or [0, 1]. If the values are not normalized, the distance measures will
overweight those features that have, on an average, larger values. The three Major
methods are Decimal Scaling, Min-Max Normalization and standard deviation
normalization. Following example will demonstrate the concept of normalization

Decimal Standard Deviation


Data Scaling Min-Max Normalization Normalization

Formula
-3 -0.3 -0.266666667 -0.463506141
-6 -0.6 -0.066666667 -0.954277348
5 0.5 -0.8 0.84521708
8 0.8 -1 1.335988287
-7 -0.7 0 -1.117867751
4|Page
2 0.2 -0.6 0.354445872
Min = -7 Mean = -0.166666667
Std.Dev =
Max = 8 6.112828042

Q.6 How Missing Data is dealt in data mining?


Ans For many real-world applications of data mining, even when there are huge
amounts of data, the subset of cases with complete data may be relatively small.
Available samples and also future cases may have values missing. The simplest
solution for this problem is the reduction of the data set and the elimination of all
samples with missing values. That is possible when large data sets are available,
and missing values occur only in a small percentage
of samples. If we do not drop the samples with missing values, then we have to fin
values for them.

First Approach: a data miner, together with the domain expert, can manually
examine samples that have no values and enter a reasonable, probable, or
expected value based on a domain experience. The method is straightforward for
small numbers of missing values and relatively small data sets.

Second Approach: This is based on a formal, often automatic replacement of


missing values with some constants, such as the following:
1. Replace all missing values with a single global constant (a selection of a
global constant is highly application dependent).
2. Replace a missing value with its feature mean.
3. Replace a missing value with its feature mean for the given class (this
approach is possible only for classification problems where samples are
classified in advance)

Third Approach: The data miner can generate a predictive model to predict each
of the missing values. For example, if three features A, B, and C are given for each
sample, then, based on samples that have all three values as a training set, the
data miner can generate a model of correlation between features. Different
techniques such as regression, Bayesian formalism, clustering, or decision-tree
induction may be used depending on data types

Q.7 What are outliers? Discuss the ways to deal with outliers
Ans In large data sets, there exist samples that do not comply with the general behavior
of the data model. Such samples, which are significantly different or inconsistent
with the remaining set of data, are called outliers. Outliers can be caused by
measurement error or they may be the result of inherent data variability.
Many data-mining algorithms try to minimize the influence of outliers on the final
model or to eliminate them in the preprocessing phases. Outliers arise due to
mechanical faults, changes in system behavior, fraudulent behavior, human error,
or instrument error or simply through natural deviations in populations.
Some data-mining applications are focused on outlier detection, and it is the
essential result of a data analysis. The process consists of two main steps:

5|Page
(1) build a profile of the “normal” behavior and
(2) use the “normal” profile to detect outliers

Outlier detection and potential removal from a data set can be described as a
process of the selection of k out of n samples that are considerably dissimilar,
exceptional, or inconsistent with respect to the remaining data (k n). The problem
of defining outliers is nontrivial, especially in multidimensional samples. Main
types of outlier detection schemes are:

• Graphical Or Visualization Techniques: Examples of visualization methods


include boxplot (1D), scatter plot (2D), and spin plot (3D)

• Statistical-based techniques: Examples include Univariate Analysis and Bivariate


Analysis

• Distance-based techniques: Example includes procedures based on complexity


calculations

• Model-based techniques: Example includes includes clustering algorithms such


as BIRCH and DBSCAN, kNN classification algorithms, and different neural
networks.

6|Page
Module-2 Questions
Q.1 Discuss the major comparison parameters used in data reduction techniques
Ans Performing standard data-reduction operations (deleting rows, columns, or
values) as a preparation for data mining, we need to know what we gain and/or
lose with these activities. The overall comparison involves the following
parameters for analysis:

1. Computing time—Simpler data, a result of the data-reduction process, can


hopefully lead to a reduction in the time taken for data mining. In most cases, we
cannot afford to spend too much time on the data-preprocessing phases, including
a reduction of data dimensions, although the more time we spend in preparation
the better the outcome.

2. Predictive/descriptive accuracy—This is the dominant measure for most data-


mining models since it measures how well the data is summarized and generalized
into the model. We generally expect that by using only relevant features, a data-
mining algorithm can not only learn faster but also with higher accuracy. Irrelevant
data may mislead a learning process and a final model, while redundant data may
complicate the task of learning and cause unexpected datamining results.

3. Representation of the data-mining model—The simplicity of representation,


obtained usually with data reduction, often implies that a model can be better
understood. The simplicity of the induced model and other results depends on its
representation. Therefore, if the simplicity of representation improves, a relatively
small decrease in accuracy may be tolerable. The need for a balanced view
between accuracy and simplicity is necessary, and dimensionality reduction is one
of the mechanisms for obtaining this balance.

It would be ideal if we could achieve reduced time, improved accuracy, and


simplified representation at the same time, using dimensionality reduction.

Q.2 Discuss in brief the recommended characteristics of data reduction algorithms


Ans Recommended characteristics of data-reduction algorithms that may be
guidelines for designers of these techniques are as follows:
1. Measurable quality—The quality of approximated results using a reduced
data set can be determined precisely.
2. Recognizable quality—The quality of approximated results can be easily
determined at run time of the data-reduction algorithm, before application
of any data-mining procedure.
3. Monotonicity—The algorithms are usually iterative, and the quality of results
is a nondecreasing function of time and input data quality.
4. Consistency—The quality of results is correlated with computation time and
input data quality.
5. Diminishing returns—The improvement in the solution is large in the early
stages (iterations) of the computation, and it diminishes over time.
6. Interruptability—The algorithm can be stopped at any time and provide some
answers.
7. Preemptability—The algorithm can be suspended and resumed with minimal
overhead.

7|Page
Q.3 Illustrate different types of learning
Ans There are two common types of the inductive-learning methods known:
1. Supervised learning (or learning with a teacher), and
2. Unsupervised learning (or learning without a teacher).
Supervised learning is used to estimate an unknown dependency from known
input–output samples. Classification and regression are common tasks supported
by this type of inductive learning. Supervised learning assumes the existence of a
teacher—fitness function or some other external method of estimating the
proposed model. The term “supervised” denotes that the output values for
training samples are known (i.e., provided by a “teacher”).
Following block diagram illustrates his form of learning. In conceptual terms, we
may think of the teacher as having knowledge of the environment

a)Supervised Learning b)Unsupervised Learning

The environment with its characteristics and model is, however, unknown to the
learning system. The parameters of the learning system are adjusted under the
combined influence of the training samples and the error signal. The error signal
is defined as the difference between the desired response and the actual response
of the learning system. Knowledge of the environment available to the teacher is
transferred to the learning system through the training samples, which adjust the
parameters of the learning system. It is a closed-loop feedback system, but the
unknown environment is not in the loop. As a performance measure for the
system, we may think in terms of the mean squared error or the sum of squared
errors over the training samples.
This function may be visualized as a multidimensional error surface, with the free
parameters of the learning system as coordinates. Any learning operation under
supervision is represented as a movement of a point on the error surface. For the
system to improve the performance over time and therefore learn from the
teacher, the operating point on an error surface has to move down successively
toward a minimum of the surface. The minimum point may be a local minimum or
a global minimum. The basic characteristics of optimization methods such as
stochastic approximation, iterative approach, and greedy optimization have been
given in the previous section. An adequate set of input–output samples will move
the operating point toward the minimum, and a supervised learning system will
be able to perform such tasks as pattern classification and function approximation.

8|Page
Q.4 From Data how to acknowledge what kind of learning task is defined for our
application?
Ans When the data are preprocessed and when we know what kind of learning task is
defined for our application, a list of data-mining methodologies and corresponding
computer-based tools is available. Depending on the characteristics of the
problem at hand and the available data set, we have to make a decision about the
application of one or more of the data-mining and knowledge-discovery
techniques, which include the following:

1. Statistical methods where the typical techniques are Bayesian inference, logistic
regression, ANOVA analysis, and log-linear models.

2. Cluster analysis, the common techniques of which are divisible algorithms,


agglomerative algorithms, partitional clustering, and incremental clustering.

3. Decision trees and decision rules are the set of methods of inductive learning
developed mainly in artificial intelligence. Typical techniques include the CLS
method, the ID3 algorithm, the C4.5 algorithm, and the corresponding pruning
algorithms.

4. Association rules represent a set of relatively new methodologies that include


algorithms such as market basket analysis, Apriori algorithm, and WWW path-
traversal patterns.

5. Artificial neural networks, where common examples are multilayer perceptrons


with backpropagation learning, Kohonen networks, or convolutional neural
networks.

6. Genetic algorithms are very useful as a methodology for solving hard-


optimization problems, and they are often a part of a data-mining algorithm.

7. Fuzzy inference systems are based on the theory of fuzzy sets and fuzzy logic.
Fuzzy modeling and fuzzy decision-making are steps very often included in the
data-mining process.

8. N-dimensional visualization methods are usually skipped in the literature as a


standard data-mining methodology, although useful information may be
discovered using these techniques and tools. Typical data-mining visualization
techniques are geometric, icon-based, pixel-oriented, and hierarchical techniques.

Q.5 Illustrate the SVM procedure in detail


Ans SVMs were developed to solve the classification problem, but recently they have
been extended to the domain of regression problems (for prediction of continuous
variables). SVMs can be applied to regression problems by the introduction of an
alternative loss function that is modified to include a distance measure. The term
SVM is referring to both classification and regression methods, and the terms
support vector classification (SVC) and support vector regression (SVR) may be
used for more precise specification. An SVM is a supervised learning algorithm
creating learning functions from a set of labeled training data.

9|Page
VM’s classification function is based on the concept of decision planes that define
decision boundaries between classes of samples. A simple example is shown in
following figure

a)A decision plane is 2D space is a line b) How to select a optimal decision


boundary

Assume we wish to perform a classification, and our data has a categorical target
variable with two categories. Also assume that there are two input attributes with
continuous values. If we plot the data points using the value of one attribute on
the X axis and the other on the Y axis, we might end up with an image such as
shown in Figure 4.16b. In this problem the goal is to separate the two classes by a
function that is induced from available examples. The goal is to produce a classifier
that will work well on unseen examples, i.e. it generalizes well. The main idea is
that the decision boundary should be as far away as possible from the data points
of both classes. Therefore a linear SVM classifier is termed the optimal separating
hyperplane with the maximum margin. Which can be seen in following figure

Q.6 Compare supervised and unsupervised learning


Ans
Supervised Learning Unsupervised Learning
Supervised learning algorithms are Unsupervised learning algorithms
trained using labeled data. are trained using unlabeled data.
Supervised learning model takes direct Unsupervised learning model does
feedback to check if it is predicting not take any feedback.
correct output or not.
Supervised learning model predicts the Unsupervised learning model finds
output. the hidden patterns in data.

10 | P a g e
In supervised learning, input data is In unsupervised learning, only input
provided to the model along with the data is provided to the model.
output.
The goal of supervised learning is to The goal of unsupervised learning is
train the model so that it can predict to find the hidden patterns and
the output when it is given new data. useful insights from the unknown
dataset.
Supervised learning needs supervision Unsupervised learning does not
to train the model. need any supervision to train the
model.
Supervised learning can be categorized Unsupervised Learning can be
in Classification and Regression classified in Clustering and
problems. Associations problems.
Supervised learning can be used for Unsupervised learning can be used
those cases where we know the input for those cases where we have only
as well as corresponding outputs. input data and no corresponding
output data.
Supervised learning model produces an Unsupervised learning model may
accurate result. give less accurate result as
compared to supervised learning.
Supervised learning is not close to true Unsupervised learning is more close
Artificial intelligence as in this, we first to the true Artificial Intelligence as it
train the model for each data, and then learns similarly as a child learns daily
only it can predict the correct output. routine things by his experiences.
It includes various algorithms such as It includes various algorithms such
Linear Regression, Logistic Regression, as Clustering, KNN, and Apriori
Support Vector Machine, Multi-class algorithm.
Classification, Decision tree, Bayesian
Logic, etc.

Q.7 Outline kNN algorithm in all aspects of Machine Learning


Ans 1. K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
2. K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
3. K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
4. K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
5. K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
6. It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.

11 | P a g e
7. KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
8. Example: Suppose there are two categories, i.e., Category A and Category
B, and we have a new data point x1, so this data point will lie in which of
these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:.

The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
Step-6: Our model is ready

12 | P a g e
Module-3 Questions
Q.1 Demonstrate the concept and application of decision trees
Ans The decision-tree representation is the most widely used logic method. They are
supervised learning methods that construct decision trees from a set of input
output samples. It is an efficient nonparametric method for classification and
regression. A decision tree is a hierarchical model for supervised learning where
the local region is identified in a sequence of recursive splits through decision
nodes with test function. A typical decision-tree learning system adopts a top
down strategy that searches for a solution in a part of the search space. It
guarantees that a simple, but not necessarily the simplest, tree will be found. A
decision tree consists of nodes where attributes are tested. In a univariate tree,
for each internal node, the test uses only one of the attributes for testing. The
outgoing branches of a node correspond to all the possible
outcomes of the test at the node. A simple decision tree for classification of
samples with two input attributes X and Y is given in Figure below

All samples with feature values X > 1 and Y = B belong to Class2, while the samples
with values X < 1 belong to Class1, whatever the value for feature Y. The samples,
at a nonleaf node in the tree structure, are thus partitioned along the branches,
and each child node gets its corresponding subset of samples.
The algorithmic procedure is like, An attribute is selected to partition these
samples. For each value of the attribute, a branch is created, and the
corresponding subset of samples that have the attribute value specified by the
branch is moved to the newly created child node. The algorithm is applied
recursively to each child node until all samples at a node are of one class. Every
path to the leaf in the decision tree represents a classification rule. Note that the
critical decision in such a top-down decision-tree-generation algorithm is the
choice of attribute at a node.

Q.2 Generate the decision tree for the following


Ans

Q.3 How decision tree pruning is done?


Ans Discarding one or more subtrees and replacing them with leaves simplify a decision
tree, and that is the main task in decision-tree pruning. In replacing the subtree
with a leaf, the algorithm expects to lower the predicted error rate and increase
the quality of a classification model. But computation of error rate is not simple.
An error rate based only on a training data set does not provide a suitable
estimate. One possibility to estimate the predicted error rate is to use a new,
additional set of test samples if they are available or to use the cross-validation
13 | P a g e
techniques. This technique divides initially available samples into equal-sized
blocks, and, for each block, the tree is constructed from all samples except this
block and tested with a given block of samples. With the available training and
testing samples, the basic idea of decision-tree pruning is to remove parts of the
tree (subtrees) that do not contribute to the classification accuracy of unseen
testing samples, producing a less complex and thus more comprehensible tree.
There are two ways in which the recursive-partitioning method can be modified:

1. Deciding not to divide a set of samples any further under some conditions. The
stopping criterion is usually based on some statistical tests, such as the χ2 test: If
there are no significant differences in classification accuracy before and after
division, then represent a current node as a leaf. The decision is made in advance,
before splitting, and therefore this approach is called prepruning.

2. Removing retrospectively some of the tree structure using selected accuracy


criteria. The decision in this process of postpruning is made after the tree has been
built

Q.4 Discuss the limitation of Decision Tree and Decision Rules


Ans Decision-rule and decision-tree-based models are relatively simple and readable,
and their generation is very fast. Unlike many statistical approaches, a logical
approach does not depend on assumptions about distribution of attribute values
or independence of attributes. Also, this method tends to be more robust across
tasks than most other statistical methods. But there are also some disadvantages
and limitations of a logical approach, as mentioned below

Not good for Regression: Logistic regression is a statistical analysis approach that
uses independent features to try to predict precise probability outcomes. On high-
dimensional datasets, this may cause the model to be over-fit on the training set,
overstating the accuracy of predictions on the training set, and so preventing the
model from accurately predicting results on the test set.

Expensive: The cost of creating a decision tree is high since each node requires
field sorting. In other algorithms, a mixture of several fields is used at the same
time, resulting in even higher expenses. Pruning methods are also expensive due
to the large number of candidate subtrees that must be produced and compared.

Independency between samples: Each training example must be completely


independent of the other samples in the dataset. If they are related in some
manner, the model will try to give those specific training instances more weight.
As a result, no matched data or repeated measurements should be used as training
data.
Unstable: Because slight changes in the data can result in an entirely different
tree being constructed, decision trees can be unstable. The use of decision trees
within an ensemble helps to solve this difficulty.

Greedy Approach: To form a binary tree, the input space must be partitioned
correctly. The greedy algorithm used for this is recursive binary splitting. It is a
numerical procedure that entails the alignment of various values. Data will be split

14 | P a g e
according to the first best split, and only that path will be used to split the data.
However, various pathways of the split could be more instructive; thus, that split
may not be the best.

Q.5 Comment with suitable explanation on “ANN offers useful properties and
capabilities in Machine Learning Process”
Ans It is apparent that an ANN derives its computing power through, first, its massive
parallel distributed structure and, second, its ability to learn and therefore to
generalize. Generalization refers to the ANN producing reasonable outputs for
new inputs not encountered during a learning process. The use of ANNs offers
several useful properties and capabilities
Nonlinearity: An artificial neuron as a basic unit can be a linear or nonlinear
processing element, but the entire ANN is highly nonlinear. It is a special kindof
nonlinearity in the sense that it is distributed throughout the network. This
characteristic is especially important, for ANN models the inherently nonlinear
real-world mechanisms responsible for generating data for learning.
Learning from examples: An ANN modifies its interconnection weights by applying
a set of training or learning samples. The final effects of a learning process are
tuned parameters of a network (the parameters are distributed through the main
components of the established model), and they represent implicitly stored
knowledge for the problem at hand.
Adaptivity: An ANN has a built-in capability to adapt its interconnection weights
to changes in the surrounding environment. In particular, an ANN trained to
operate in a specific environment can be easily retrained to deal with changes in
its environmental conditions. Moreover, when it is operating in a nonstationary
environment, an ANN can be designed to adopt its parameters in real time.
Evidential response: In the context of data classification, an ANN can be designed
to provide information not only about which particular class to select for a given
sample but also about confidence in the decision made. This later information may
be used to reject ambiguous data, should they arise, and therefore improve the
classification performance or performances of the other tasks modeled by the
network.
Fault tolerance: An ANN has the potential to be inherently fault tolerant or
capable of robust computation. Its performances do not degrade significantly
under adverse operating conditions such as disconnection of neurons and noisy or
missing data. There is some empirical evidence for robust computation, but usually
it is uncontrolled.
Uniformity of analysis and design: Basically, ANNs enjoy universality as
information processors. The same principles, notation, and the same steps in
methodology are used in all domains involving application of ANNs.

Q.6 Illustrate the feedforward and recurrent architectures of ANN


Ans The architecture of an ANN is defined by the characteristics of a node and the
characteristics of the node’s connectivity in the network. The basic characteristics
of a single node have been given in a previous section, and in this section the
parameters of connectivity will be introduced. Typically, network architecture is
specified by the number of inputs to the network, the number of outputs, the total
number of elementary nodes that are usually equal processing elements for the

15 | P a g e
entire network, and their organization and interconnections. Neural networks are
generally classified into two categories on the basic of the type of
interconnections: feedforward and recurrent.
The network is feedforward if the processing propagates from the input side to the
output side unanimously, without any loops or feedbacks. In a layered
representation of the feedforward neural network, there are no links between
nodes in the same layer; outputs of nodes in a specific layer are always connected
as inputs to nodes in succeeding layers. This representation is preferred because
of its modularity, i.e., nodes in the same layer have the same functionality or
generate the same level of abstraction about input vectors. If there is a feedback
link that forms a circular path in a network (usually with a delay element as a
synchronization component), then the network is recurrent. Examples of ANNs
belonging to both classes are given in Figure below

Feedforward Network Recurrent Network


Although many neural-network models have been proposed in both classes, the
multilayer feedforward network with a backpropagation-learning mechanism is
the most widely used model in terms of practical applications.

Q.7 What is pattern recognition? What types of pattern recognition algorithms are
used in machine learning?
Ans Pattern Recognition is defined as the process of identifying the trends (global or
local) in the given pattern. A pattern can be defined as anything that follows a
trend and exhibits some kind of regularity. The recognition of patterns can be done
physically, mathematically, or by the use of algorithms. When we talk about
pattern recognition in machine learning, it indicates the use of powerful
algorithms for identifying the regularities in the given data. Pattern recognition is
widely used in the new age technical domains like computer vision, speech
recognition, face recognition, etc.
Types of Pattern Recognition Algorithms in Machine Learning
1. Supervised Algorithms
The pattern recognition a supervised approach is called classification. These
algorithms use a two stage methodology for identifying the patterns. The first
stage the development/construction of the model and the second stage involves
the prediction for new or unseen objects. The key features involving this concept
are listed below.
• Partition the given data into two sets- Training and Test set
• Train the model using a suitable machine learning algorithm such as SVM
(Support Vector Machines), decision trees, random forest, etc.
• Training is the process through which the model learns or recognizes the
patterns in the given data for making suitable predictions.

16 | P a g e
• The test set contains already predicted values.
• It is used for validating the predictions made by the training set.
• The model is trained on the training set and tested on the test set.
• The performance of the model is evaluated based on correct predictions
made.
• The trained and tested model developed for recognizing patterns using
machine learning algorithms is called a classifier.
• This classifier is used to make predictions for unseen data/objects.

2. Unsupervised Algorithms
In contrast to the supervised algorithms for pattern make use of training and
testing sets, these algorithms use a group by approach. They observe the patterns
in the data and group them based on the similarity in their features such as
dimension to make a prediction. Let’s say that we have a basket of different kinds
of fruits such as apples, oranges, pears, and cherries. We assume that we do not
know the names of the fruits. We keep the data as unlabeled. Now, suppose we
encounter a situation where someone comes and tells us to identify a new fruit
that was added to the basket. In such a case we make use of a concept called
clustering.
• Clustering combines or group items having the same features.
• No previous knowledge is available for identifying a new item.
• They use machine learning algorithms like hierarchical and k-means
clustering.
• Based on the features or properties of the new object, it is assigned to a
group to make a prediction.

17 | P a g e
Module-4 Questions
Q.1 What is Market Basket Analysis? In what way it helps to retailers? What are
essential steps for implementing MBA?
Ans Frequent itemset mining leads to the discovery of associations and correlations
between items in huge transactional or relational datasets. The disclosure of
“Correlation Relationships” among huge amounts of transaction records can help
in many decision-making processes.
A popular example of frequent itemset mining is Market Basket Analysis. This
process identifies customer buying habits by finding associations between the
different items that customers place in their “shopping baskets”. The discovery of
this kind of association will be helpful for retailers or marketers to develop
marketing strategies by gaining insight into which items are frequently bought
together by customers.
For example, if customers are buying milk, how probably are they to also buy bread
(and which kind of bread) on the same trip to the supermarket?
There are many advantages to implementing Market Basket Analysis in marketing.
Market basket Analysis(MBA) can be applied to data of customers from the point
of sale (PoS) systems.
It helps retailers with:
• Increases customer engagement
• Boosting sales and increasing RoI
• Improving customer experience
• Optimize marketing strategies and campaigns
• Help to understand customers better
• Identifies customer behavior and pattern

The essential steps for implementing MBA can be given as:-

• First, define the minimum support and confidence for the association rule.
• Find out all the subsets in the transactions with higher support(sup) than
the minimum support.
• Find all the rules for these subsets with higher confidence than minimum
confidence.
• Sort these association rules in decreasing order.
• Analyze the rules along with their confidence and support.

Q.2 State basic definitions and Rule evaluation metrics of “Association Rule Mining”
Ans For sake of better understanding and explanation please consider the simple
transaction database
Transaction-ID Items
1 Bread, Milk
2 Bread, Diapers, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Association Rule Mining: Basic Definitions


18 | P a g e
Before defining the rules of Association Rule Mining, let us first have a look at the
basic definitions.
Support Count(σ): It accounts for the frequency of occurrence of an itemset.

Here σ({Milk, Bread, Diaper})=2

Frequent Itemset: It represents an itemset whose support is greater than or equal


to the minimum threshold.

Association Rule: It represents an implication expression of the form X -> Y. Here


X and Y represent any 2 itemsets. Example: {Milk, Diaper}->{Beer}

Association Rule Mining: Rule Evaluation Metrics


The rule evaluation metrics used in Association Rule Mining are as follows:

Support(s): It is the number of transactions that include items from the {X} and {Y}
parts of the rule as a percentage of total transactions. It can be represented in the
form of a percentage of all transactions that shows how frequently a group of
items occurs together.

Support = σ(X+Y) ÷ total: It is a fraction of transactions that include both X and Y.

Confidence(c): This ratio represents the total number of transactions of all of the
items in {A} and {B} to the number of transactions of the items in {A}.

Conf(X=>Y) = Supp(X∪Y) ÷ Supp(X): It counts the number of times each item in Y


appears in transactions that also include items in X.

Lift(l): The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence. here, it is assumed that the itemsets X and Y are independent of one
another. The expected confidence is calculated by dividing the confidence by the
frequency of {Y}.

Lift(X=>Y) = Conf(X=>Y) ÷ Supp(Y): Lift values near 1 indicate that X and Y almost
always appear together as expected. Lift values greater than 1 indicate that they
appear together more than expected, and lift values less than 1 indicate that they
appear less than expected. Greater lift values indicate a more powerful
association.

Q.3 Apply the Apriori algorithm for the following data

with minimum support count as 2 and minimum confidence as 50%


Ans Step-1: K=1

19 | P a g e
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.

Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining
Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each
itemset)
Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.

Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1
is that it should have (K-2) elements in common. So here, for L2, first element
should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}

20 | P a g e
Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent.
For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every
itemset)
find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L3.

Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1
(K=4) is that, they should have (K-2) elements in common. So here, for L3, first 2
elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent).
So no itemset in C4
We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.

Confidence –
A confidence of 50% means that 50% of the customers, who purchased milk and
bread also bought butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule
generation.

Itemset {I1, I2, I3} //from L3

SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong


association rules

21 | P a g e
Q.4 Compare Apriori and FP Growth Aplgorithm
Ans Apriori FP Growth
It is an array based algorithm It is a Tree based algorithm
It uses Join and Prune techniques It construct conditional frequent
pattern tree from database which
satisfy minimum support
Apriori uses a breadth-first search FP Uses depth first search algorithm
algorithm
Apriori uses a level wise approach FP growth utilizes a pattern growth
where it generates pattern containing approach, means that it only considers
1 item then 2 items and so on patterns actually existing in database
Candidate generation is extremely Runtime increases linearly, depending
slow. Runtime increases exponentially upon the number of transactions and
depending on the number of different items
items
Candidate generation is very Data are very interdependent, each
parallelizable node needs root
It requires large memory space due to It requires less memory space due to
large number of candidate generation compact structure and no candidate
generation
It scans the database multiple times It scans the dataset only twice for
for generating candidate sets. constructing frequent pattern tree

Q.5 What is FP Growth algorithm? List the advantages and Disadvantages of FP


Growth Algorithm.
Ans The FP-Growth Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so much, it uses a
divide-and-conquer strategy. The core of this method is the usage of a special data
structure named frequent-pattern tree (FP-tree), which retains the item set
association information
The FP growth algorithm works as follows
o First, it compresses the input database creating an FP-tree instance to
represent frequent items.
o After this first step, it divides the compressed database into a set of
conditional databases, each associated with one frequent pattern.
o Finally, each such database is mined separately.

Using this strategy, the FP-Growth reduces the search costs by recursively looking
for short patterns and then concatenating them into the long frequent patterns.

In large databases, holding the FP tree in the main memory is impossible. A


strategy to cope with this problem is to partition the database into a set of smaller
databases (called projected databases) and then construct an FP-tree from each
of these smaller databases

22 | P a g e
Advantages of FP Growth Algorithm
• This algorithm needs to scan the database twice when compared to Apriori,
which scans the transactions for each iteration.
• The pairing of items is not done in this algorithm, making it faster.
• The database is stored in a compact version in memory.
• It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages of FP-Growth Algorithm


• FP Tree is more cumbersome and difficult to build than Apriori.
• It may be expensive.
• The algorithm may not fit in the shared memory when the database is
large.

Q.6 What are different types of Association rules in data mining? Briefly mention
about the algorithms used for Association Rule mining
Ans There are typically three different types of association rules in data mining. They
are
• Multi-relational association rules
• Generalized Association rule
• Quantitative Association Rules

Multi-Relational Association Rule


Also known as MRAR, multi-relational association rule is defined as a new class of
association rules that are usually derived from different or multi-relational
databases. Each rule under this class has one entity with different relationships
that represent the indirect relationships between entities.

Generalized Association Rule


Moving on to the next type of association rule, the generalized association rule is
largely used for getting a rough idea about the interesting patterns that often tend
to stay hidden in data.

Quantitative Association Rules


This particular type is actually one of the most unique kinds of all the four
association rules available. What sets it apart from the others is the presence of
numeric attributes in at least one attribute of quantitative association rules. This
is in contrast to the generalized association rule, where the left and right sides
consist of categorical attributes.

Algorithms Of Association Rule In Data Mining


There are mainly three different types of algorithms that can be used to generate
associate rules in data mining. Let’s take a look at them.

Apriori Algorithm
Apriori algorithm identifies the frequent individual items in a given database and
then expands them to larger item sets, keeping in check that the item sets appear
sufficiently often in the database.

Eclat Algorithm
23 | P a g e
ECLAT algorithm is also known as Equivalence Class Clustering and bottomup.
Latice Traversal is another widely used method for associate rule in data mining.
Some even consider it to be a better and more efficient version of the Apriori
algorithm.

FP-growth Algorirthm
Also known as the recurring pattern, this algorithm is particularly useful for finding
frequent patterns without the need for candidate generation. It mainly operates
in two stages namely, FP-tree construction and extract frequently used item sets.

24 | P a g e
Module-5 Questions
Q.1 Define webmining.
Webmining is decomposed into some major subtasks; What are they?, explain
in brief
Ans
Web mining may be defined as the use of data-mining techniques to automatically
discover and extract information from Web documents and services. It refers to
the overall process of discovery, not just to the application of standard data-mining
tools.

The process of webmining is decomposed in to four major subtasks mentioned as


below

1. Resource finding—This is the process of retrieving data, which is either online


or offline, from the multimedia sources on the Web, such as news articles, forums,
blogs, and the text content of HTML documents obtained by removing the HTML
tags.

2. Information selection and preprocessing—This is the process by which different


kinds of original data retrieved in the previous subtask is transformed. These
transformations could be either a kind of preprocessing such as removing stop
words, stemming, etc. or a preprocessing aimed at obtaining the desired
representation, such as finding phrases in the training corpus, representing the
text in the first-order logic form, etc.

3. Generalization—Generalization is the process of automatically discovering


general patterns within individual Web sites as well as across multiple sites.
Different general-purpose machine-learning techniques, data-mining techniques,
and specific Web-oriented methods are used.

4. Analysis—This is a task in which validation and/or interpretation of the mined


patterns is performed.

Q.2 Compare Web Content, Web Structure and Web Usage


Ans Web Content Web
Criterion Web Usage
IR VIEW DB VIEW Structure
Unstructured Semi-structured
View of data Link structure Interactivity
Structured Website as DB
Text documents Server logs
Hypertext
Main data Hypertext Link structure
documents Browser logs
documents
Machine Proprietary Machine
Learning algorithm learning
Statistical Association Proprietary
Method Statistical
(Including NLP) rules algorithm
Association
Rules

25 | P a g e
Bag of words, n- Edged labeled Relational
gram terms graph Table
Phrases,
Representation concepts or Relational Graph Graph
ontology
Relational
Finding frequent Site
Categorization Categorization
sub structures construction
Web site Adaptation
Clustering schema Clustering and
Application
discovery management
Categories
Finding Extract
rules
Finding Patterns
in text

Q.3 Narrate the working of HITS algorithm


Ans n the HITS algorithm, the first step is to retrieve the most relevant pages to the
search query. This set is called the root set and can be obtained by taking the top
pages returned by a text-based search algorithm. A base set is generated by
augmenting the root set with all the web pages that are linked from it and some
of the pages that link to it. The web pages in the base set and all hyperlinks among
those pages form a focused subgraph. The HITS computation is performed only on
this focused subgraph. According to Kleinberg the reason for constructing a base
set is to ensure that most (or many) of the strongest authorities are included.

Authority and hub values are defined in terms of one another in a mutual
recursion. An authority value is computed as the sum of the scaled hub values that
point to that page. A hub value is the sum of the scaled authority values of the
pages it points to. Some implementations also consider the relevance of the linked
pages.

The algorithm performs a series of iterations, each consisting of two basic steps:

• Authority update: Update each node's authority score to be equal to the


sum of the hub scores of each node that points to it. That is, a node is given
a high authority score by being linked from pages that are recognized as
Hubs for information.
• Hub update: Update each node's hub score to be equal to the sum of the
authority scores of each node that it points to. That is, a node is given a
high hub score by linking to nodes that are considered to be authorities on
the subject.
The Hub score and Authority score for a node is calculated with the following
algorithm:

• Start with each node having a hub score and authority score of 1.
• Run the authority update rule
• Run the hub update rule

26 | P a g e
• Normalize the values by dividing each Hub score by square root of the sum
of the squares of all Hub scores, and dividing each Authority score by
square root of the sum of the squares of all Authority scores.
• Repeat from the second step as necessary.

Q.4 Discuss the Page Rank algorithm and Implementation


Ans PageRank (PR) is an algorithm used by Google Search to rank websites in their
search engine results. PageRank was named after Larry Page, one of the founders
of Google. PageRank is a way of measuring the importance of website pages.
According to Google:

“PageRank works by counting the number and quality of links to a page to


determine a rough estimate of how important the website is. The underlying
assumption is that more important websites are likely to receive more links from
other websites.”

The PageRank algorithm outputs a probability distribution used to represent the


likelihood that a person randomly clicking on links will arrive at any particular page.
PageRank can be calculated for collections of documents of any size. It is assumed
in several research papers that the distribution is evenly divided among all
documents in the collection at the beginning of the computational process. The
PageRank computations require several passes, called “iterations”, through the
collection to adjust approximate PageRank values to more closely reflect the
theoretical true value.
Assume a small universe of four web pages: A, B, C, and D. Links from a page to
itself, or multiple outbound links from one single page to another single page, are
ignored. PageRank is initialized to the same value for all pages. In the original form
of PageRank, the sum of PageRank over all pages was the total number of pages
on the web at that time, so each page in this example would have an initial value
of 1. However, later versions of PageRank, and the remainder of this section,
assume a probability distribution between 0 and 1. Hence the initial value for each
page in this example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links
upon the next iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would
transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.
PR(A) = PR(B) + PR(C) + PR(D),
Suppose instead that page B had a link to pages C and A, page C had a link to page
A, and page D had links to all three pages. Thus, upon the first iteration, page B
would transfer half of its existing value, or 0.125, to page A and the other half, or
0.125, to page C. Page C would transfer all of its existing value, 0.25, to the only
page it links to, A. Since D had three outbound links, it would transfer one-third of
its existing value, or approximately 0.083, to A. At the completion of this iteration,
page A will have a PageRank of approximately 0.458.

In other words, the PageRank conferred by an outbound link is equal to the


document’s own PageRank score divided by the number of outbound links L( ).
27 | P a g e
In the general case, the PageRank value for any page u can be expressed as:

i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u),
divided by the number L(v) of links from page v. The algorithm involves a damping
factor for the calculation of the PageRank. It is like the income tax which the govt
extracts from one despite paying him itself.

Q.5 What is Text Mining?


Ans Text mining, also known as text data mining, is the process of transforming
unstructured text into a structured format to identify meaningful patterns and
new insights. By applying advanced analytical techniques, such as Naïve Bayes,
Support Vector Machines (SVM), and other deep learning algorithms, companies
are able to explore and discover hidden relationships within their unstructured
data.

Text is a one of the most common data types within databases. Depending on the
database, this data can be organized as:

Structured data: This data is standardized into a tabular format with numerous
rows and columns, making it easier to store and process for analysis and machine
learning algorithms. Structured data can include inputs such as names, addresses,
and phone numbers.

Unstructured data: This data does not have a predefined data format. It can
include text from sources, like social media or product reviews, or rich media
formats like, video and audio files.

Semi-structured data: As the name suggests, this data is a blend between


structured and unstructured data formats. While it has some organization, it
doesn’t have enough structure to meet the requirements of a relational database.
Examples of semi-structured data include XML, JSON and HTML files.

Since roughly 80% of data in the world resides in an unstructured format (link
resides outside ibm.com), text mining is an extremely valuable practice within
organizations. Text mining tools and natural language processing (NLP)
techniques, like information extraction, allow us to transform unstructured
documents into a structured format to enable analysis and the generation of high-
quality insights. This, in turn, improves the decision-making of organizations,
leading to better business outcomes.

28 | P a g e

You might also like