DPM 8

Data Mining Algorithms
Muchake Brian
Phone: 0701178573
Email: bmuchake@gmail.com, bmuchake@cis.mak.ac.ug,
Do not Keep Company With Worthless People

Psalms 26:11
Introduction to Data Mining Algorithms
• With an enormous amount of data stored in databases and data warehouses, it is increasingly
important to develop powerful tools for analysis of such data and mining interesting
knowledge from it. Data mining is a process of inferring knowledge from such huge data.
• Data-mining algorithms are at the heart of the data-mining process. These algorithms
determine how cases are processed and hence provide the decision-making capabilities
needed to classify, segment, associate, and analyze data for processing.
• A data mining algorithm is a set of examining and analytical algorithms which help in creating
a model for the data.
• To get a concrete model the algorithm must first analyze the data that you provide which can
be finding specific types of patterns or trends.
Introduction to Data Mining Algorithms [Cont’d]
• The result of this algorithm is an analysis of different iterations which can help in finding optimal
parameters for a proper data mining model. These sets of parameters can be applied across the entire
data set and they help in extracting the actionable patterns and getting a detailed statistic of the data.
• Data mining can be categorized into supervised, unsupervised and semi-supervised learning.
1.Supervised Learning
• Supervised learning as the name indicates the presence of a supervisor as a teacher.
• Basically supervised learning is a learning in which we teach or train the machine using data which is
well labeled that means some data is already tagged with the correct answer.
• After that, the machine is provided with a new set of examples(data) so that supervised learning
algorithm analyses the training data(set of training examples) and produces a correct outcome from
labeled data.
• For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to
train the machine with all different fruits one by one like this:
• If shape of object is rounded and depression at top having color Red then it will be labelled as – Apple.
• If shape of object is long curving cylinder having color Green-Yellow then it will be labelled as –
Banana.
• Now suppose after training the data, you have given a new separate fruit say Banana from basket and
asked to identify it.
• Since the machine has already learned the things from previous data and this time have to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as
BANANA and put it in Banana category. Thus the machine learns the things from training data(basket
containing fruits) and then apply the knowledge to test data(new fruit).
• Supervised learning classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
2. UnSupervised Learning
• Unsupervised learning is the training of machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance.
• Here the task of machine is to group unsorted information according to similarities, patterns
and differences without any prior training of data.
• Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore machine is restricted to find the hidden structure in unlabeled data by
our-self.
• For instance, suppose it is given an image having both dogs and cats which have not seen
ever.
• Thus the machine has no idea about the features of dogs and cat so we can’t categorize it in
dogs and cats. But it can categorize them according to their similarities, patterns, and
differences i.e., we can easily categorize the above picture into two parts. First part may
contain all pics having dogs in it and second part may contain all pics having cats in it. Here
you didn’t learn anything before, means no training data or examples.
• Unsupervised learning classified into two categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
3. Semi supervised Learning
• Semi-supervised learning is a class of machine learning tasks and techniques that also make use of
unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled
data.
• Semi-supervised learning falls between unsupervised learning (without any labeled training data) and
supervised learning (with completely labeled training data).
• The acquisition of labeled data for a learning problem often requires a skilled human agent
(e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D
structure of a protein or determining whether there is oil at a particular location). The cost
associated with the labeling process thus may render a fully labeled training set infeasible,
whereas acquisition of unlabeled data is relatively inexpensive.
• In such situations, semi-supervised learning can be of great practical value. Semi-supervised
learning is also of theoretical interest in machine learning and as a model for human learning.
• Alternatively data mining can also be categorized based on components of: 1. Clustering or
Classification, 2. Association Rules and 3. Sequence Analysis.
• In classification/clustering we analyze a set of data and generate a set of grouping rules
which can be used to classify future data.
• For example, one may classify diseases and provide the symptoms which describe each
class or subclass. This has much in common with traditional work in statistics and
machine learning.
• However, there are important new issues which arise because of the sheer size of the
data. One of the important problem in data mining is the Classification-rule learning which
involves finding rules that partition given data into predefined classes.
• In the data mining domain where millions of records and a large number of attributes are
involved, the execution time of existing algorithms can become prohibitive, particularly in
interactive applications
• An association rule is a rule which implies certain association relationships among a set of
objects in a database. In this process we discover a set of association rules at multiple levels
of abstraction from the relevant set(s) of data in a database.
• For example, one may discover a set of symptoms often occurring together with certain
kinds of diseases and further study the reasons behind them. Since finding interesting
association rules in databases may disclose some useful patterns for decision support,
selective marketing, financial forecast, medical diagnosis, and many other applications, it
has attracted a lot of attention in recent data mining research .
• Mining association rules may require iterative scanning of large transaction or relational
databases which is quite costly in processing. Therefore, efficient mining of association
rules in transaction and/or relational databases has been studied substantially.
• In sequential Analysis, we seek to discover patterns that occur in sequence. This deals with
data that appear in separate transactions (as opposed to data that appear in the same
transaction in the case of association).
• For e.g. : If a shopper buys item A in the first week of the month, then s/he buys item B in
the second week etc
Data Mining Classification Algorithms [Cont’d]
• Each object in the dataset is classified according to its similarities. Classification is the best-known and
most used method of DM.
• The aim of the classification method is to predict accurately the target class of objects of which the class
label is unknown.
• It is used to analyze a given data set and takes each instance of it. It assigns this instance to a particular
class. Such that classification error will be least. It is used to extract models. That define important data
classes within the given data set. Classification is a two-step process.
• During the first step, the model is created by applying a classification algorithm. That is on training data
set.
• Then in the second step, the extracted model is tested against a predefined test data set. That is to
measure the model trained performance and accuracy. So classification is the process to assign class
label from a data set whose class label is unknown.
• Classification is the task of generalizing known structure to apply to new data. For example, an email
program might attempt to classify an email as legitimate or spam.
• Common algorithms include:
1.Decision Tree
• A Decision Tree Classifier consists of a decision tree generated on the basis of instances. A decision tree is
a classifier expressed as a recursive partition of the instance space.
• The decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a node called
“root” that has no incoming edges.
• All other nodes have exactly one incoming edge. A node with outgoing edges is called an internal or test
node. All other nodes are called leaves (also known as terminal or decision nodes). In a decision tree, each
internal node splits the instance space into two or more sub-spaces a certain discrete function of the input
attributes values.
• Below is a Decision tree model
• The root and the internal nodes are associated with attributes, leaf nodes are associated with
classes.
• Basically, each non-leaf node has an outgoing branch for each possible value of the
attribute associated with the node.
• To determine the class for a new instance using a decision tree, beginning with the root,
successive internal nodes are visited until a leafnode is reached.
• At the root node and at each internal node, a test is applied. The outcome of the test
determines the branch traversed, and the next node visited. The class for the instance is
the class of the final leaf node.
• The estimation criterion in the decision tree algorithm is the selection of an attribute to test at
each decision node in the tree.
• The goal is to select the attribute that is most useful for classifying examples. A good
quantitative measure of the worth of an attribute is a statistical property called information
gain that measures how well a given attribute separates the training examples according
to their target classification.
• This measure is used to select among the candidate attributes at each step while growing
the tree.
• The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
• The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3 (Iterative
Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a greedy approach.
In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer
manner.
Tree Pruning
• Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The pruned
trees are smaller and less complex.
Tree Pruning Approaches
• There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
• The cost complexity is measured by the following two parameters −
• Number of leaves in the tree, and
• Error rate of the tree.
2. K-Nearest Neighbor Classifiers(KNN)
•K-Nearest neighbor classifiers are based on learning by analogy. The training samples are
described by n dimensional numeric attributes. Each sample represents a point in an n-dimensional
space.
•In this way, all of the training samples are stored in an n-dimensional pattern space. When given
an unknown sample, a k-nearest neighbour classifier searches the pattern space for the k training
samples that are closest to the unknown sample.
•"Closeness" is defined in terms of Euclidean distance, where the Euclidean distance, where the
Euclidean distance between two points, slower at classification since all computation is delayed to
that time.
•X=(x1,x2,......,xn) and Y=(y1,y2,....,yn) is d(X, Y)= 2
• K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both
classification as well as regression predictive problems. However, it is mainly used for classification
predictive problems in industry.
• The following two properties would define KNN well −
(i) Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized
training phase and uses all the data for training while classification.
(ii) Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it
doesn’t assume anything about the underlying data.
• K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints
which further means that the new data point will be assigned a value based on how closely it matches
the points in the training set. We can understand its working with the help of following steps −
• Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must
load the training as well as test data.
• Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.
• Step 3 − For each point in the test data do the following −
• 3.1 − Calculate the distance between test data and each row of training data with the help of any
of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used
method to calculate distance is Euclidean.
• 3.2 − Now, based on the distance value, sort them in ascending order.
• 3.3 − Next, it will choose the top K rows from the sorted array.
• 3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
• Step 4 − End
• The following is an example to understand the concept of K and working of KNN algorithm −
• Suppose we have a dataset which can be plotted as follows −
• Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We
are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next diagram −
• We can see in the above diagram the three nearest neighbors of the data point with black dot.
Among those three, two of them lies in Red class hence the black dot will also be assigned in
red class.
Pros
• It is very simple algorithm to understand and interpret.
• It is very useful for nonlinear data because there is no assumption about data in this
algorithm.
• It is a versatile algorithm as we can use it for classification as well as regression.
• It has relatively high accuracy but there are much better supervised learning models than
KNN.
Cons
• It is computationally a bit expensive algorithm because it stores all the training data.
• High memory storage required as compared to other supervised learning algorithms.
• Prediction is slow in case of big N.
• It is very sensitive to the scale of data as well as irrelevant features.
Applications of KNN
 The following are some of the areas in which KNN can be applied successfully −
• Banking System
• KNN can be used in banking system to predict weather an individual is fit for loan approval? Does that individual have
the characteristics similar to the defaulters one?
• Calculating Credit Ratings
• KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having similar traits.
• Politics
• With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will not Vote”,
“Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
• Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition
and Video Recognition.
3. Support-vector machine (SVM)
•In machine learning, support-vector machines (SVMs, also support-vector networks) are
supervised learning models with associated learning algorithms that analyze data used for
classification and regression analysis.
•A Support Vector Machine (SVM) performs classification by finding the hyperplane that
maximizes the margin between the two classes. The vectors (cases) that define the hyperplane
are the support vectors.
•A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating
hyperplane. In other words, given labeled training data (supervised learning), the algorithm
outputs an optimal hyperplane which categorizes new examples. In two dimentional space this
hyperplane is a line dividing a plane in two parts where in each class lay in either side.
• SVM was first introduced by Vapnik and has been very effective method for regression, classification
and general pattern recognition.
• It is considered a good classifier because of its high generalization performance without the need to add
a priori knowledge, even when the dimension of the input space is very high.
• The aim of SVM is to find the best classification function todistinguish between members of the two
classes in the training data.
• The metric for the concept of the “best” classification function can be realized geometrically.
• Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM
performs a similar task like C4.5 except SVM doesn’t use decision trees at all.
• A hyperplane is a function like the equation for a line,
• In fact, for a simple classification task with just 2 features, the hyperplane can be a line.
4. Bayesian Networks
•A Bayesian network (BN) consists of a directed, acyclic graph and a probability distribution for each
node in that graph given its immediate predecessors.
•A Bayes Network Classifier is based on a bayesian network which represents a joint probability
distribution over a set of categorical attributes.
•It consists of two parts, the directed acyclic graph G consisting of nodes and arcs and the conditional
probability tables.
•The nodes represent attributes whereas the arcs indicate direct dependencies.
•The density of the arcs in a BN is one measure of its complexity. Sparse BNs can represent simple
probabilistic models (e.g., naïve Bayes models and hidden Markov models), whereas dense BNs can
capture highly complex models. Thus, BNs provide a flexible method for probabilistic modelling.
• Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
• Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each other.
• This is one of the easiest algorithms as it is easy to construct and does not have any
complicated parameter estimation schemas. It can be easily applied to huge data sets as
well. It does not need any complicated iterative parameter estimation schemes and hence
users who are unskilled can understand why the classifications are made.
Models:
• Pruned Naive Bayes (Naive Bayes Build)
• Simplified decision tree (Single Feature Build)
•Boosted (Multi Feature Build)
The advantages of Bayesian Networks:
• Visually represent all the relationships between the variables
• Easy to recognize the dependence and independence between nodes.
• Can handle incomplete data
• scenarios where it is not practical to measure all variables (costs, not enough sensors, etc.)
• Help to model noisy systems.
• Can be used for any system model - from all known parameters to no known parameters.
The limitations of Bayesian Networks:
• All branches must be calculated in order to calculate the probability of any one branch.
• The quality of the results of the network depends on the quality of the prior beliefs or
model.
• Calculation can be NP-hard
• Calculations and probabilities using Baye's rule and marginalization can become complex
and are often characterized by subtle wording, and care must be taken to calculate them
properly.
5.Neural Networks
•Neural Networks Artificial neural networks mimic the pattern-finding capacity of the human brain and
hence some researchers have suggested applying Neural Network algorithms to pattern-mapping.
Neural networks have been applied successfully in a few applications that involve classification.
•An artificial neural network(ANN), often just called a "neural network" (NN), is a mathematical model or
computational model based on biological neural networks, in other words, is an emulation of biological
neural system.
•It consists of an interconnected group of artificial neurons and processes information using a
connectionist approach to computation. In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows through the network during the learning
phase.
• A neural network consists of an interconnected group of artificial neurons, and it processes
information using a connectionist approach to computation. In most cases a neural network is an
adaptive system that changes its structure during a learning phase. Neural networks are used to
model complex relationships between inputs and outputs or to find patterns in data.
6. Logistic Regression
• Definition: Logistic regression is a machine learning algorithm for classification. In this algorithm, the
probabilities describing the possible outcomes of a single trial are modelled using a logistic function.
• Advantages: Logistic regression is designed for this purpose (classification), and is most useful for
understanding the influence of several independent variables on a single outcome variable.
• Disadvantages: Works only when the predicted variable is binary, assumes all predictors are
independent of each other, and assumes data is free of missing values.
7. Stochastic Gradient Descent
•Definition: Stochastic gradient descent is a simple and very efficient approach to fit linear
models. It is particularly useful when the number of samples is very large. It supports different
loss functions and penalties for classification.
•Advantages: Efficiency and ease of implementation.
•Disadvantages: Requires a number of hyper-parameters and it is sensitive to feature scaling.
8. Random Forest
•Definition: Random forest classifier is a meta-estimator that fits a number of decision trees on various
sub-samples of datasets and uses average to improve the predictive accuracy of the model and
controls over-fitting. The sub-sample size is always the same as the original input sample size but the
samples are drawn with replacement.
•Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees
in most cases.
•Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.
9. ID3 Algorithm
•This Data Mining Algorithms starts with the original set as the root hub. On every cycle, it emphasizes
through every unused attribute of the set and figures. That the entropy of attribute. At that point chooses
the attribute. That has the smallest entropy value.
•The set is S then split by the selected attribute to produce subsets of the information. This Data Mining
algorithms proceed to recurse on each item in a subset. Also, considering only items never selected
before. Recursion on a subset may bring to a halt in one of these cases:
• Every element in the subset belongs to the same class (+ or -), then the node is turned into a leaf
and labeled with the class of the examples
• If there are no more attributes to select but the examples still do not belong to the same class. Then
the node is turned into a leaf and labeled with the most common class of the examples in that
subset.
• If there are no examples in the subset, then this happens. Whenever parent set found to
be matching a specific value of the selected attribute.
• For example, if there was no example matching with marks >=100. Then a leaf is created
and is labeled with the most common class of the examples in the parent set.
10. C4.5 Algorithm
•There are constructs that are used by classifiers which are tools in data mining. These
systems take inputs from a collection of cases where each case belongs to one of the small
numbers of classes and are described by its values for a fixed set of attributes.
•The output classifier can accurately predict the class to which it belongs. It makes use of
decision trees where the first initial tree is acquired by using a divide and conquer algorithm.
• C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's
earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is
often referred to as a statistical classifier. In 2011, authors of the Weka machine learning software described the C4.5
algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in
practice to date".
 C4.5 made a number of improvements to ID3. Some of these are:
• Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold
and then splits the list into those whose attribute value is above the threshold and those that are less than or equal
to it.
• Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing.
Missing attribute values are simply not used in gain and entropy calculations.
• Handling attributes with differing costs.
• Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove
branches that do not help by replacing them with leaf nodes.
Data Mining Association Rule Algorithms
• An association rule is a rule which implies certain association relationships among a set of
objects (such as ``occur together'' or ``one implies the other'') in a database.
• Given a set of transactions, where each transaction is a set of literals (called items), an
association rule is an expression of the form X Y , where X and Y are sets of items. The intuitive
meaning of such a rule is that transactions of the database which contain X tend to contain Y .
• An example of an association rule is: ``30% of transactions that contain beer also contain
diapers; 2% of all transactions contain both of these items''. Here 30% is called the confidence of
the rule, and 2% the support of the rule. The problem is to find all association rules that satisfy
user-specified minimum support and minimum confidence constraints.
• Association Rule is one of the very important concepts of machine learning being used in market
basket analysis.
Data Mining Association Rule Algorithms [Cont’d]
• Market Basket Analysis is the study of customer transaction databases to determine
dependencies between the various items they purchase at different times .
• Association rule learning is a rule-based machine learning method for discovering interesting
relations between variables in large databases. It identifies frequent if-then associations
called association rules which consists of an antecedent (if) and a consequent (then).
• For example: “If tea and milk, then sugar” (“If tea and milk are purchased, then sugar would
also be bought by the customer”)
• Antecedent: Tea and Milk
• Consequent: Sugar.
 There are three common metrics to measure association:
• Support is an indication of how frequently the items appear in the data. Mathematically,
support is the fraction of the total number of transactions in which the item set occurs.
• Confidence indicates the number of times the if-then statements are found true. Confidence is
the conditional probability of occurrence of consequent given the antecedent.
 Lift can be used to compare confidence with expected confidence. This says how likely item Y
is purchased when item X is purchased, while controlling for how popular item Y is.
Mathematically,
1. Apriori Algorithm
• Apriori uses a breadth-first search strategy to count the support of itemsets and uses a
candidate generation function which exploits the downward closure property of support.
• Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases.
• It proceeds by identifying the frequent individual items in the database and extending them to
larger and larger item sets as long as those item sets appear sufficiently often in the database.
• The frequent item sets determined by Apriori can be used to determine association rules which
highlight general trends in the database: this has applications in domains such as market basket
analysis.
• The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to
operate on databases containing transactions (for example, collections of items bought by
customers, or details of a website frequentation or IP addresses).
• Other algorithms are designed for finding association rules in data having no transactions
(Winepi and Minepi), or having no timestamps (DNA sequencing). Each transaction is seen as a
set of items (an itemset).
• Given a threshold, the Apriori algorithm identifies the item sets which are subsets of at least
transactions in the database.
• Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time
(a step known as candidate generation), and groups of candidates are tested against the
data. The algorithm terminates when no further successful extensions are found.
• Apriori uses breadth-first search and a Hash tree structure to count candidate item sets
efficiently. It generates candidate item sets of length from item sets of length.
• Then it prunes the candidates which have an infrequent sub pattern. According to the
downward closure lemma, the candidate set contains all frequent-length item sets. After that,
it scans the transaction database to determine frequent item sets among the candidates.
2. ECLAT Algorithm
•The ECLAT algorithm stands for Equivalence Class Clustering and bottom-up Lattice Traversal . It is one of the
popular methods of Association Rule mining. It is a more efficient and scalable version of the Apriori algorithm.
•While the Apriori algorithm works in a horizontal sense imitating the Breadth-First Search of a graph, the ECLAT
algorithm works in a vertical manner just like the Depth-First Search of a graph. This vertical approach of the
ECLAT algorithm makes it a faster algorithm than the Apriori algorithm.
•The basic idea is to use Transaction Id Sets(tidsets) intersections to compute the support value of a candidate
and avoiding the generation of subsets which do not exist in the prefix tree. In the first call of the function, all
single items are used along with their tidsets. Then the function is called recursively and in each recursive call,
each item-tidset pair is verified and combined with other item-tidset pairs. This process is continued until no
candidate item-tidset pairs can be combined.
3. FP-growth algorithm
•FP stands for frequent pattern.
•In the first pass, the algorithm counts the occurrences of items (attribute-value pairs) in the dataset of
transactions, and stores these counts in a 'header table'. In the second pass, it builds the FP-tree structure
by inserting transactions into a trie.
•Items in each transaction have to be sorted by descending order of their frequency in the dataset before
being inserted so that the tree can be processed quickly. Items in each transaction that do not meet the
minimum support requirement are discarded. If many transactions share most frequent items, the FP-tree
provides high compression close to tree root.
• Recursive processing of this compressed version of the main dataset grows frequent item sets directly,
instead of generating candidate items and testing them against the entire database (as in the apriori
algorithm)
• In Data Mining the task of finding frequent pattern in large databases is very important and
has been studied in large scale in the past few years. Unfortunately, this task is
computationally expensive, especially when a large number of patterns exist.
• The FP-Growth Algorithm, proposed by Han in , is an efficient and scalable method for mining
the complete set of frequent patterns by pattern fragment growth, using an extended prefix-
tree structure for storing compressed and crucial information about frequent patterns named
frequent-pattern tree (FP-tree).
Data Mining Sequential Pattern Algorithms [Cont’d]
 Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns
between data examples where the values are delivered in a sequence.
 It is usually presumed that the values are discrete, and thus time series mining is closely related, but
usually considered a different activity. Sequential pattern mining is a special case of structured data
mining.
 There are several key traditional computational problems addressed within this field. These include
building efficient databases and indexes for sequence information, extracting the frequently occurring
patterns, comparing sequences for similarity, and recovering missing sequence members.
 In general, sequence mining problems can be classified as string mining which is typically based on string
processing algorithms and itemset mining which is typically based on association rule learning.
 Local process models extend sequential pattern mining to more complex patterns that can include
(exclusive) choices, loops, and concurrency constructs in addition to the sequential ordering construct.
 With a great variation of products and user buying behaviors, shelf on which products are being
displayed is one of the most important resources in retail environment.
 Retailers can not only increase their profit but, also decrease cost by proper management of shelf space
allocation and products display. To solve this problem, George and Binu (2013) have proposed an
approach to mine user buying patterns using PrefixSpan algorithm and place the products on shelves
based on the order of mined purchasing patterns.
1. GSP algorithm
 GSP algorithm (Generalized Sequential Pattern algorithm) is an algorithmused for sequence mining. The
algorithms for solving sequence mining problems are mostly based on the apriori (level-wise) algorithm.
 One way to use the level-wise paradigm is to first discover all the frequent items in a level-wise fashion.
It simply means counting the occurrences of all singleton elements in the database.
 Then, the transactions are filtered by removing the non-frequent items. At the end of this step,
each transaction consists of only the frequent elements it originally contained. This modified
database becomes an input to the GSP algorithm. This process requires one pass over the
whole database.
 GSP algorithm makes multiple database passes. In the first pass, all single items (1-
sequences) are counted. From the frequent items, a set of candidate 2-sequences are
formed, and another pass is made to identify their frequency. The frequent 2-sequences are
used to generate the candidate 3-sequences, and this process is repeated until no more
frequent sequences are found. There are two main steps in the algorithm.
•Sequential PAttern Discovery using Equivalence classes (SPADE)

2. Sequential Pattern Discovery using Equivalence classes (SPADE)
•Frequent Sequence Mining is used to discover a set of patterns shared among objects which
have between them a specific order.
•For instance, a retail shop may possess a transaction database which specifies which
products were acquired by each customer over time.
•In this case, the store may use Frequent Sequence Mining to find that 40% of the people who
bought the first volume of Lord of the Rings came back to buy the second volume a month
later. This kind of information may be used to support directed advertising campaigns or
recommendation systems.

3. Freespan Algorithm
•The freespan algorithm reduces the cost require to candidate generation and testing of
apriori, with satisfying its basic feature.
•In short, the freespan algorithm uses the frequent items to iteratively project the sequence
database into projected database while growing subsequence’s frequently in each projected
dataset.
•Every projection divides the database and confines further testing to progressively smaller
and more manageable units.
•The important issue is to considerable amount of sequences can appear in more than single
projected database and the size of database decreases with each iteration.

4. WAP-MINE
•WAP-MINE- This is pattern-growth based algorithm with tree-structure mining technique
on its WAP-tree data structure.
•In this algorithm the sequence database is scanned twice to build up the WAP-tree from the
frequent sequences by their support values. Here header table is maintained first to point that
where is first occurrence of the each item in a frequent item set which can be helpful to mine
the tree for frequent sequences built up on their suffix. It found in the analysis that the WAP-
MINE algorithm have more scalability than GSP and perform bitterly by marginal points.
•Although this algorithm scans the database twice only and avoids the problem of generating
huge candidate as in case of apriori-based approach, the WAP-MINE faces the problem of
memory consumption, as it iteratively regenerate n increase automatically.

5. PrefixSpan
•PrefixSpan- The PrefixSpan (Prefix Projected Sequential pattern Mining ) algorithms presented
by Jian Pei, Jiavei Han and Helen Pinto is the only projection based algorithms from all the
sequencing pattern mining algorithms.
•It performs better than the algorithm like apriori, freespan, SPADE (vertical data format). This
algorithm finds the frequent items by scanning the sequence database once. The database is
projected into several smaller databases according to the frequent items. By recursively growing
subsequence fragment in every projected database, we got the complete set of sequential pattern.
•The main concept behind the prefixspan algorithm to successfully discovered patterns is
employing the divide-and-conquer strategy. The prefixspan algorithm requires high memory
space as compare to the other algorithms in the sense that it requires creation and processing of
huge number of projected sub-databases.

6. SPIRIT
•SPIRIT - The basic concept behind this algorithm is to use the regular expression at flexible
tool for the constraint specifications.
• It provides the generic user specified regular expression constraint on the mined pattern, for
providing the more powerful restriction. There are many versions in the algorithm. The
selection of the regular expression as a constraint specification tool is considered on the basic
of two important factors.
•The first regular expression is the simple form and natural syntax for specification of
families of sequential pattern and second it has the more power for specifying huge range of
interesting pattern constraints.
Data Mining Functions And Application Areas

DPM 8

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DPM 8

Uploaded by

Copyright:

Available Formats

Data Mining Algorithms

Do not Keep Company With Worthless People

• The benefits of having a decision tree are as follows −

Data Mining Sequential Pattern Algorithms [Cont’d]

Data Mining Sequential Pattern Algorithms [Cont’d]

Data Mining Sequential Pattern Algorithms [Cont’d]

Data Mining Sequential Pattern Algorithms [Cont’d]

Data Mining Sequential Pattern Algorithms [Cont’d]

You might also like