Professional Documents
Culture Documents
Question Bank DMC
Question Bank DMC
Module-1 Questions
Q.1 What are the two primary goals for data mining? List and explain in brief the
Data Mining tasks used to achieve prediction and description
Ans The two primary goals of data mining tend to be prediction and description.
Prediction involves using some variables or fields in the data set to predict
unknown or future values of other variables of interest. Description, on the other
hand, focuses on finding patterns describing the data that can be interpreted by
humans. Therefore, it is possible to put data-mining activities into one of two
categories:
1. Predictive data mining, which produces the model of the system described by
the given data set, or
On the predictive end of the spectrum, the goal of data mining is to produce a
model, expressed as an executable code, which can be used to perform
classification, prediction, estimation, or other similar tasks. On the other,
descriptive end of the spectrum, the goal is to gain an understanding of the
analyzed system by uncovering patterns and relationships in large data sets.
The goals of prediction and description are achieved by using the following
primary data-mining tasks
1. Classification—Discovery of a predictive learning function that classifies a
data item into one of several predefined classes.
1|Page
Q.2 Explain the Typical Data Mining Process, with suitable diagram
Ans The problem of discovering or estimating dependencies from data or discovering
totally new data is only one part of the general experimental procedure used by
scientists, engineers, and others who apply standard steps to draw conclusions
from the data. The general experimental procedure adapted to data-mining
problems involves the following steps:
State the problem: In this step, a modeler usually specifies a set of variables for
the unknown dependency and, if possible, a general form of this dependency as
an initial hypothesis. There may be several hypotheses formulated for a single
problem at this stage. The first step requires the combined expertise of an
application domain and a data-mining model. In practice, it usually means a close
interaction between the data-mining expert and the application expert.
Collect the data: This step is concerned with how the data are generated and
collected. In general, there are two distinct possibilities. The first is when the data-
generation process is under the control of an expert (modeler): this approach is
known as a designed experiment. The second possibility is when the expert cannot
influence the data- generation process: this is known as the observational
approach.
Preprocessing the data: n the observational setting, data are usually “collected”
from the existing databases, data warehouses, and data marts. Data preprocessing
usually includes at least two common tasks:
1. Outlier detection (and removal): outliers result from measurement errors
and coding and recording errors and, sometimes, are natural, abnormal
values. Such non-representative samples can seriously affect the model
produced later
2. Scaling, encoding, and selecting features: application-specific encoding
methods usually achieve dimensionality reduction by providing a smaller
number of informative features for subsequent data modeling.
2|Page
Interpret the model and draw conclusions: models need to be interpretable in
order to be useful because humans are not likely to base their decisions on
complex “black-box” models. Note that the goals of accuracy of the model and
accuracy of its interpretation are somewhat contradictory.
Q.3 What care need to be taken for enhancing the data quality
Ans There are a number of indicators of data quality that have to be taken care of in
the preprocessing phase of a data-mining process:
1. The data should be accurate. The analyst has to check that the name is spelled
correctly, the code is in a given range, the value is complete, and so on.
2. The data should be stored according to data type. The analyst must ensure that
the numerical value is not presented in character form, that integers are not in the
form of real numbers, and so on.
3. The data should have integrity. Updates should not be lost because of conflicts
among different users; robust backup and recovery procedures should be
implemented if they are not already part of the Data Base Management System
(DBMS).
4. The data should be consistent. The form and the content should be the same
after integration of large data sets from different sources.
6. The data should be timely. The time component of data should be recognized
explicitly from the data or implicitly from the manner of its organization.
7. The data should be well understood. Naming standards are a necessary but not
the only condition for data to be well understood. The user should know that the
data corresponds to an established domain.
8. The data set should be complete. Missing data, which occurs in reality, should
be minimized. Missing data could reduce the quality of a global model. On the
other hand, some data-mining techniques are robust enough to support analyses
of data sets with missing values
Q.4 What are major types of data transformation techniques used in data
warehousing?
Ans There are four main types of transformations, and each has its own characteristics:
3|Page
related fields. Examples include changing the data type of a field or replacing an
encoded field value with a decoded value.
Formula
-3 -0.3 -0.266666667 -0.463506141
-6 -0.6 -0.066666667 -0.954277348
5 0.5 -0.8 0.84521708
8 0.8 -1 1.335988287
-7 -0.7 0 -1.117867751
4|Page
2 0.2 -0.6 0.354445872
Min = -7 Mean = -0.166666667
Std.Dev =
Max = 8 6.112828042
First Approach: a data miner, together with the domain expert, can manually
examine samples that have no values and enter a reasonable, probable, or
expected value based on a domain experience. The method is straightforward for
small numbers of missing values and relatively small data sets.
Third Approach: The data miner can generate a predictive model to predict each
of the missing values. For example, if three features A, B, and C are given for each
sample, then, based on samples that have all three values as a training set, the
data miner can generate a model of correlation between features. Different
techniques such as regression, Bayesian formalism, clustering, or decision-tree
induction may be used depending on data types
Q.7 What are outliers? Discuss the ways to deal with outliers
Ans In large data sets, there exist samples that do not comply with the general behavior
of the data model. Such samples, which are significantly different or inconsistent
with the remaining set of data, are called outliers. Outliers can be caused by
measurement error or they may be the result of inherent data variability.
Many data-mining algorithms try to minimize the influence of outliers on the final
model or to eliminate them in the preprocessing phases. Outliers arise due to
mechanical faults, changes in system behavior, fraudulent behavior, human error,
or instrument error or simply through natural deviations in populations.
Some data-mining applications are focused on outlier detection, and it is the
essential result of a data analysis. The process consists of two main steps:
5|Page
(1) build a profile of the “normal” behavior and
(2) use the “normal” profile to detect outliers
Outlier detection and potential removal from a data set can be described as a
process of the selection of k out of n samples that are considerably dissimilar,
exceptional, or inconsistent with respect to the remaining data (k n). The problem
of defining outliers is nontrivial, especially in multidimensional samples. Main
types of outlier detection schemes are:
6|Page
Module-2 Questions
Q.1 Discuss the major comparison parameters used in data reduction techniques
Ans Performing standard data-reduction operations (deleting rows, columns, or
values) as a preparation for data mining, we need to know what we gain and/or
lose with these activities. The overall comparison involves the following
parameters for analysis:
7|Page
Q.3 Illustrate different types of learning
Ans There are two common types of the inductive-learning methods known:
1. Supervised learning (or learning with a teacher), and
2. Unsupervised learning (or learning without a teacher).
Supervised learning is used to estimate an unknown dependency from known
input–output samples. Classification and regression are common tasks supported
by this type of inductive learning. Supervised learning assumes the existence of a
teacher—fitness function or some other external method of estimating the
proposed model. The term “supervised” denotes that the output values for
training samples are known (i.e., provided by a “teacher”).
Following block diagram illustrates his form of learning. In conceptual terms, we
may think of the teacher as having knowledge of the environment
The environment with its characteristics and model is, however, unknown to the
learning system. The parameters of the learning system are adjusted under the
combined influence of the training samples and the error signal. The error signal
is defined as the difference between the desired response and the actual response
of the learning system. Knowledge of the environment available to the teacher is
transferred to the learning system through the training samples, which adjust the
parameters of the learning system. It is a closed-loop feedback system, but the
unknown environment is not in the loop. As a performance measure for the
system, we may think in terms of the mean squared error or the sum of squared
errors over the training samples.
This function may be visualized as a multidimensional error surface, with the free
parameters of the learning system as coordinates. Any learning operation under
supervision is represented as a movement of a point on the error surface. For the
system to improve the performance over time and therefore learn from the
teacher, the operating point on an error surface has to move down successively
toward a minimum of the surface. The minimum point may be a local minimum or
a global minimum. The basic characteristics of optimization methods such as
stochastic approximation, iterative approach, and greedy optimization have been
given in the previous section. An adequate set of input–output samples will move
the operating point toward the minimum, and a supervised learning system will
be able to perform such tasks as pattern classification and function approximation.
8|Page
Q.4 From Data how to acknowledge what kind of learning task is defined for our
application?
Ans When the data are preprocessed and when we know what kind of learning task is
defined for our application, a list of data-mining methodologies and corresponding
computer-based tools is available. Depending on the characteristics of the
problem at hand and the available data set, we have to make a decision about the
application of one or more of the data-mining and knowledge-discovery
techniques, which include the following:
1. Statistical methods where the typical techniques are Bayesian inference, logistic
regression, ANOVA analysis, and log-linear models.
3. Decision trees and decision rules are the set of methods of inductive learning
developed mainly in artificial intelligence. Typical techniques include the CLS
method, the ID3 algorithm, the C4.5 algorithm, and the corresponding pruning
algorithms.
7. Fuzzy inference systems are based on the theory of fuzzy sets and fuzzy logic.
Fuzzy modeling and fuzzy decision-making are steps very often included in the
data-mining process.
9|Page
VM’s classification function is based on the concept of decision planes that define
decision boundaries between classes of samples. A simple example is shown in
following figure
Assume we wish to perform a classification, and our data has a categorical target
variable with two categories. Also assume that there are two input attributes with
continuous values. If we plot the data points using the value of one attribute on
the X axis and the other on the Y axis, we might end up with an image such as
shown in Figure 4.16b. In this problem the goal is to separate the two classes by a
function that is induced from available examples. The goal is to produce a classifier
that will work well on unseen examples, i.e. it generalizes well. The main idea is
that the decision boundary should be as far away as possible from the data points
of both classes. Therefore a linear SVM classifier is termed the optimal separating
hyperplane with the maximum margin. Which can be seen in following figure
10 | P a g e
In supervised learning, input data is In unsupervised learning, only input
provided to the model along with the data is provided to the model.
output.
The goal of supervised learning is to The goal of unsupervised learning is
train the model so that it can predict to find the hidden patterns and
the output when it is given new data. useful insights from the unknown
dataset.
Supervised learning needs supervision Unsupervised learning does not
to train the model. need any supervision to train the
model.
Supervised learning can be categorized Unsupervised Learning can be
in Classification and Regression classified in Clustering and
problems. Associations problems.
Supervised learning can be used for Unsupervised learning can be used
those cases where we know the input for those cases where we have only
as well as corresponding outputs. input data and no corresponding
output data.
Supervised learning model produces an Unsupervised learning model may
accurate result. give less accurate result as
compared to supervised learning.
Supervised learning is not close to true Unsupervised learning is more close
Artificial intelligence as in this, we first to the true Artificial Intelligence as it
train the model for each data, and then learns similarly as a child learns daily
only it can predict the correct output. routine things by his experiences.
It includes various algorithms such as It includes various algorithms such
Linear Regression, Logistic Regression, as Clustering, KNN, and Apriori
Support Vector Machine, Multi-class algorithm.
Classification, Decision tree, Bayesian
Logic, etc.
11 | P a g e
7. KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
8. Example: Suppose there are two categories, i.e., Category A and Category
B, and we have a new data point x1, so this data point will lie in which of
these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:.
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
Step-6: Our model is ready
12 | P a g e
Module-3 Questions
Q.1 Demonstrate the concept and application of decision trees
Ans The decision-tree representation is the most widely used logic method. They are
supervised learning methods that construct decision trees from a set of input
output samples. It is an efficient nonparametric method for classification and
regression. A decision tree is a hierarchical model for supervised learning where
the local region is identified in a sequence of recursive splits through decision
nodes with test function. A typical decision-tree learning system adopts a top
down strategy that searches for a solution in a part of the search space. It
guarantees that a simple, but not necessarily the simplest, tree will be found. A
decision tree consists of nodes where attributes are tested. In a univariate tree,
for each internal node, the test uses only one of the attributes for testing. The
outgoing branches of a node correspond to all the possible
outcomes of the test at the node. A simple decision tree for classification of
samples with two input attributes X and Y is given in Figure below
All samples with feature values X > 1 and Y = B belong to Class2, while the samples
with values X < 1 belong to Class1, whatever the value for feature Y. The samples,
at a nonleaf node in the tree structure, are thus partitioned along the branches,
and each child node gets its corresponding subset of samples.
The algorithmic procedure is like, An attribute is selected to partition these
samples. For each value of the attribute, a branch is created, and the
corresponding subset of samples that have the attribute value specified by the
branch is moved to the newly created child node. The algorithm is applied
recursively to each child node until all samples at a node are of one class. Every
path to the leaf in the decision tree represents a classification rule. Note that the
critical decision in such a top-down decision-tree-generation algorithm is the
choice of attribute at a node.
1. Deciding not to divide a set of samples any further under some conditions. The
stopping criterion is usually based on some statistical tests, such as the χ2 test: If
there are no significant differences in classification accuracy before and after
division, then represent a current node as a leaf. The decision is made in advance,
before splitting, and therefore this approach is called prepruning.
Not good for Regression: Logistic regression is a statistical analysis approach that
uses independent features to try to predict precise probability outcomes. On high-
dimensional datasets, this may cause the model to be over-fit on the training set,
overstating the accuracy of predictions on the training set, and so preventing the
model from accurately predicting results on the test set.
Expensive: The cost of creating a decision tree is high since each node requires
field sorting. In other algorithms, a mixture of several fields is used at the same
time, resulting in even higher expenses. Pruning methods are also expensive due
to the large number of candidate subtrees that must be produced and compared.
Greedy Approach: To form a binary tree, the input space must be partitioned
correctly. The greedy algorithm used for this is recursive binary splitting. It is a
numerical procedure that entails the alignment of various values. Data will be split
14 | P a g e
according to the first best split, and only that path will be used to split the data.
However, various pathways of the split could be more instructive; thus, that split
may not be the best.
Q.5 Comment with suitable explanation on “ANN offers useful properties and
capabilities in Machine Learning Process”
Ans It is apparent that an ANN derives its computing power through, first, its massive
parallel distributed structure and, second, its ability to learn and therefore to
generalize. Generalization refers to the ANN producing reasonable outputs for
new inputs not encountered during a learning process. The use of ANNs offers
several useful properties and capabilities
Nonlinearity: An artificial neuron as a basic unit can be a linear or nonlinear
processing element, but the entire ANN is highly nonlinear. It is a special kindof
nonlinearity in the sense that it is distributed throughout the network. This
characteristic is especially important, for ANN models the inherently nonlinear
real-world mechanisms responsible for generating data for learning.
Learning from examples: An ANN modifies its interconnection weights by applying
a set of training or learning samples. The final effects of a learning process are
tuned parameters of a network (the parameters are distributed through the main
components of the established model), and they represent implicitly stored
knowledge for the problem at hand.
Adaptivity: An ANN has a built-in capability to adapt its interconnection weights
to changes in the surrounding environment. In particular, an ANN trained to
operate in a specific environment can be easily retrained to deal with changes in
its environmental conditions. Moreover, when it is operating in a nonstationary
environment, an ANN can be designed to adopt its parameters in real time.
Evidential response: In the context of data classification, an ANN can be designed
to provide information not only about which particular class to select for a given
sample but also about confidence in the decision made. This later information may
be used to reject ambiguous data, should they arise, and therefore improve the
classification performance or performances of the other tasks modeled by the
network.
Fault tolerance: An ANN has the potential to be inherently fault tolerant or
capable of robust computation. Its performances do not degrade significantly
under adverse operating conditions such as disconnection of neurons and noisy or
missing data. There is some empirical evidence for robust computation, but usually
it is uncontrolled.
Uniformity of analysis and design: Basically, ANNs enjoy universality as
information processors. The same principles, notation, and the same steps in
methodology are used in all domains involving application of ANNs.
15 | P a g e
entire network, and their organization and interconnections. Neural networks are
generally classified into two categories on the basic of the type of
interconnections: feedforward and recurrent.
The network is feedforward if the processing propagates from the input side to the
output side unanimously, without any loops or feedbacks. In a layered
representation of the feedforward neural network, there are no links between
nodes in the same layer; outputs of nodes in a specific layer are always connected
as inputs to nodes in succeeding layers. This representation is preferred because
of its modularity, i.e., nodes in the same layer have the same functionality or
generate the same level of abstraction about input vectors. If there is a feedback
link that forms a circular path in a network (usually with a delay element as a
synchronization component), then the network is recurrent. Examples of ANNs
belonging to both classes are given in Figure below
Q.7 What is pattern recognition? What types of pattern recognition algorithms are
used in machine learning?
Ans Pattern Recognition is defined as the process of identifying the trends (global or
local) in the given pattern. A pattern can be defined as anything that follows a
trend and exhibits some kind of regularity. The recognition of patterns can be done
physically, mathematically, or by the use of algorithms. When we talk about
pattern recognition in machine learning, it indicates the use of powerful
algorithms for identifying the regularities in the given data. Pattern recognition is
widely used in the new age technical domains like computer vision, speech
recognition, face recognition, etc.
Types of Pattern Recognition Algorithms in Machine Learning
1. Supervised Algorithms
The pattern recognition a supervised approach is called classification. These
algorithms use a two stage methodology for identifying the patterns. The first
stage the development/construction of the model and the second stage involves
the prediction for new or unseen objects. The key features involving this concept
are listed below.
• Partition the given data into two sets- Training and Test set
• Train the model using a suitable machine learning algorithm such as SVM
(Support Vector Machines), decision trees, random forest, etc.
• Training is the process through which the model learns or recognizes the
patterns in the given data for making suitable predictions.
16 | P a g e
• The test set contains already predicted values.
• It is used for validating the predictions made by the training set.
• The model is trained on the training set and tested on the test set.
• The performance of the model is evaluated based on correct predictions
made.
• The trained and tested model developed for recognizing patterns using
machine learning algorithms is called a classifier.
• This classifier is used to make predictions for unseen data/objects.
2. Unsupervised Algorithms
In contrast to the supervised algorithms for pattern make use of training and
testing sets, these algorithms use a group by approach. They observe the patterns
in the data and group them based on the similarity in their features such as
dimension to make a prediction. Let’s say that we have a basket of different kinds
of fruits such as apples, oranges, pears, and cherries. We assume that we do not
know the names of the fruits. We keep the data as unlabeled. Now, suppose we
encounter a situation where someone comes and tells us to identify a new fruit
that was added to the basket. In such a case we make use of a concept called
clustering.
• Clustering combines or group items having the same features.
• No previous knowledge is available for identifying a new item.
• They use machine learning algorithms like hierarchical and k-means
clustering.
• Based on the features or properties of the new object, it is assigned to a
group to make a prediction.
17 | P a g e
Module-4 Questions
Q.1 What is Market Basket Analysis? In what way it helps to retailers? What are
essential steps for implementing MBA?
Ans Frequent itemset mining leads to the discovery of associations and correlations
between items in huge transactional or relational datasets. The disclosure of
“Correlation Relationships” among huge amounts of transaction records can help
in many decision-making processes.
A popular example of frequent itemset mining is Market Basket Analysis. This
process identifies customer buying habits by finding associations between the
different items that customers place in their “shopping baskets”. The discovery of
this kind of association will be helpful for retailers or marketers to develop
marketing strategies by gaining insight into which items are frequently bought
together by customers.
For example, if customers are buying milk, how probably are they to also buy bread
(and which kind of bread) on the same trip to the supermarket?
There are many advantages to implementing Market Basket Analysis in marketing.
Market basket Analysis(MBA) can be applied to data of customers from the point
of sale (PoS) systems.
It helps retailers with:
• Increases customer engagement
• Boosting sales and increasing RoI
• Improving customer experience
• Optimize marketing strategies and campaigns
• Help to understand customers better
• Identifies customer behavior and pattern
• First, define the minimum support and confidence for the association rule.
• Find out all the subsets in the transactions with higher support(sup) than
the minimum support.
• Find all the rules for these subsets with higher confidence than minimum
confidence.
• Sort these association rules in decreasing order.
• Analyze the rules along with their confidence and support.
Q.2 State basic definitions and Rule evaluation metrics of “Association Rule Mining”
Ans For sake of better understanding and explanation please consider the simple
transaction database
Transaction-ID Items
1 Bread, Milk
2 Bread, Diapers, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Support(s): It is the number of transactions that include items from the {X} and {Y}
parts of the rule as a percentage of total transactions. It can be represented in the
form of a percentage of all transactions that shows how frequently a group of
items occurs together.
Confidence(c): This ratio represents the total number of transactions of all of the
items in {A} and {B} to the number of transactions of the items in {A}.
Lift(l): The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence. here, it is assumed that the itemsets X and Y are independent of one
another. The expected confidence is calculated by dividing the confidence by the
frequency of {Y}.
Lift(X=>Y) = Conf(X=>Y) ÷ Supp(Y): Lift values near 1 indicate that X and Y almost
always appear together as expected. Lift values greater than 1 indicate that they
appear together more than expected, and lift values less than 1 indicate that they
appear less than expected. Greater lift values indicate a more powerful
association.
19 | P a g e
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining
Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each
itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1
is that it should have (K-2) elements in common. So here, for L2, first element
should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
20 | P a g e
Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent.
For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every
itemset)
find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L3.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1
(K=4) is that, they should have (K-2) elements in common. So here, for L3, first 2
elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent).
So no itemset in C4
We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Confidence –
A confidence of 50% means that 50% of the customers, who purchased milk and
bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule
generation.
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
21 | P a g e
Q.4 Compare Apriori and FP Growth Aplgorithm
Ans Apriori FP Growth
It is an array based algorithm It is a Tree based algorithm
It uses Join and Prune techniques It construct conditional frequent
pattern tree from database which
satisfy minimum support
Apriori uses a breadth-first search FP Uses depth first search algorithm
algorithm
Apriori uses a level wise approach FP growth utilizes a pattern growth
where it generates pattern containing approach, means that it only considers
1 item then 2 items and so on patterns actually existing in database
Candidate generation is extremely Runtime increases linearly, depending
slow. Runtime increases exponentially upon the number of transactions and
depending on the number of different items
items
Candidate generation is very Data are very interdependent, each
parallelizable node needs root
It requires large memory space due to It requires less memory space due to
large number of candidate generation compact structure and no candidate
generation
It scans the database multiple times It scans the dataset only twice for
for generating candidate sets. constructing frequent pattern tree
Using this strategy, the FP-Growth reduces the search costs by recursively looking
for short patterns and then concatenating them into the long frequent patterns.
22 | P a g e
Advantages of FP Growth Algorithm
• This algorithm needs to scan the database twice when compared to Apriori,
which scans the transactions for each iteration.
• The pairing of items is not done in this algorithm, making it faster.
• The database is stored in a compact version in memory.
• It is efficient and scalable for mining both long and short frequent patterns.
Q.6 What are different types of Association rules in data mining? Briefly mention
about the algorithms used for Association Rule mining
Ans There are typically three different types of association rules in data mining. They
are
• Multi-relational association rules
• Generalized Association rule
• Quantitative Association Rules
Apriori Algorithm
Apriori algorithm identifies the frequent individual items in a given database and
then expands them to larger item sets, keeping in check that the item sets appear
sufficiently often in the database.
Eclat Algorithm
23 | P a g e
ECLAT algorithm is also known as Equivalence Class Clustering and bottomup.
Latice Traversal is another widely used method for associate rule in data mining.
Some even consider it to be a better and more efficient version of the Apriori
algorithm.
FP-growth Algorirthm
Also known as the recurring pattern, this algorithm is particularly useful for finding
frequent patterns without the need for candidate generation. It mainly operates
in two stages namely, FP-tree construction and extract frequently used item sets.
24 | P a g e
Module-5 Questions
Q.1 Define webmining.
Webmining is decomposed into some major subtasks; What are they?, explain
in brief
Ans
Web mining may be defined as the use of data-mining techniques to automatically
discover and extract information from Web documents and services. It refers to
the overall process of discovery, not just to the application of standard data-mining
tools.
25 | P a g e
Bag of words, n- Edged labeled Relational
gram terms graph Table
Phrases,
Representation concepts or Relational Graph Graph
ontology
Relational
Finding frequent Site
Categorization Categorization
sub structures construction
Web site Adaptation
Clustering schema Clustering and
Application
discovery management
Categories
Finding Extract
rules
Finding Patterns
in text
Authority and hub values are defined in terms of one another in a mutual
recursion. An authority value is computed as the sum of the scaled hub values that
point to that page. A hub value is the sum of the scaled authority values of the
pages it points to. Some implementations also consider the relevance of the linked
pages.
The algorithm performs a series of iterations, each consisting of two basic steps:
• Start with each node having a hub score and authority score of 1.
• Run the authority update rule
• Run the hub update rule
26 | P a g e
• Normalize the values by dividing each Hub score by square root of the sum
of the squares of all Hub scores, and dividing each Authority score by
square root of the sum of the squares of all Authority scores.
• Repeat from the second step as necessary.
i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u),
divided by the number L(v) of links from page v. The algorithm involves a damping
factor for the calculation of the PageRank. It is like the income tax which the govt
extracts from one despite paying him itself.
Text is a one of the most common data types within databases. Depending on the
database, this data can be organized as:
Structured data: This data is standardized into a tabular format with numerous
rows and columns, making it easier to store and process for analysis and machine
learning algorithms. Structured data can include inputs such as names, addresses,
and phone numbers.
Unstructured data: This data does not have a predefined data format. It can
include text from sources, like social media or product reviews, or rich media
formats like, video and audio files.
Since roughly 80% of data in the world resides in an unstructured format (link
resides outside ibm.com), text mining is an extremely valuable practice within
organizations. Text mining tools and natural language processing (NLP)
techniques, like information extraction, allow us to transform unstructured
documents into a structured format to enable analysis and the generation of high-
quality insights. This, in turn, improves the decision-making of organizations,
leading to better business outcomes.
28 | P a g e