Professional Documents
Culture Documents
DPM 8
DPM 8
Muchake Brian
Phone: 0701178573
Email: bmuchake@gmail.com, bmuchake@cis.mak.ac.ug,
• Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We
are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next diagram −
Data Mining Classification Algorithms [Cont’d]
• We can see in the above diagram the three nearest neighbors of the data point with black dot.
Among those three, two of them lies in Red class hence the black dot will also be assigned in
red class.
Data Mining Classification Algorithms [Cont’d]
Pros
• It is very simple algorithm to understand and interpret.
• It is very useful for nonlinear data because there is no assumption about data in this
algorithm.
• It is a versatile algorithm as we can use it for classification as well as regression.
• It has relatively high accuracy but there are much better supervised learning models than
KNN.
Data Mining Classification Algorithms [Cont’d]
Cons
• It is computationally a bit expensive algorithm because it stores all the training data.
• High memory storage required as compared to other supervised learning algorithms.
• Prediction is slow in case of big N.
• It is very sensitive to the scale of data as well as irrelevant features.
Data Mining Classification Algorithms [Cont’d]
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
• Banking System
• KNN can be used in banking system to predict weather an individual is fit for loan approval? Does that individual have
the characteristics similar to the defaulters one?
• Calculating Credit Ratings
• KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having similar traits.
• Politics
• With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will not Vote”,
“Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
• Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition
and Video Recognition.
Data Mining Classification Algorithms [Cont’d]
3. Support-vector machine (SVM)
•In machine learning, support-vector machines (SVMs, also support-vector networks) are
supervised learning models with associated learning algorithms that analyze data used for
classification and regression analysis.
•A Support Vector Machine (SVM) performs classification by finding the hyperplane that
maximizes the margin between the two classes. The vectors (cases) that define the hyperplane
are the support vectors.
•A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating
hyperplane. In other words, given labeled training data (supervised learning), the algorithm
outputs an optimal hyperplane which categorizes new examples. In two dimentional space this
hyperplane is a line dividing a plane in two parts where in each class lay in either side.
Data Mining Classification Algorithms [Cont’d]
• SVM was first introduced by Vapnik and has been very effective method for regression, classification
and general pattern recognition.
• It is considered a good classifier because of its high generalization performance without the need to add
a priori knowledge, even when the dimension of the input space is very high.
• The aim of SVM is to find the best classification function todistinguish between members of the two
classes in the training data.
• The metric for the concept of the “best” classification function can be realized geometrically.
• Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM
performs a similar task like C4.5 except SVM doesn’t use decision trees at all.
• A hyperplane is a function like the equation for a line,
• In fact, for a simple classification task with just 2 features, the hyperplane can be a line.
Data Mining Classification Algorithms [Cont’d]
4. Bayesian Networks
•A Bayesian network (BN) consists of a directed, acyclic graph and a probability distribution for each
node in that graph given its immediate predecessors.
•A Bayes Network Classifier is based on a bayesian network which represents a joint probability
distribution over a set of categorical attributes.
•It consists of two parts, the directed acyclic graph G consisting of nodes and arcs and the conditional
probability tables.
•The nodes represent attributes whereas the arcs indicate direct dependencies.
•The density of the arcs in a BN is one measure of its complexity. Sparse BNs can represent simple
probabilistic models (e.g., naïve Bayes models and hidden Markov models), whereas dense BNs can
capture highly complex models. Thus, BNs provide a flexible method for probabilistic modelling.
Data Mining Classification Algorithms [Cont’d]
• Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
• Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each other.
• This is one of the easiest algorithms as it is easy to construct and does not have any
complicated parameter estimation schemas. It can be easily applied to huge data sets as
well. It does not need any complicated iterative parameter estimation schemes and hence
users who are unskilled can understand why the classifications are made.
Data Mining Classification Algorithms [Cont’d]
Models:
• Pruned Naive Bayes (Naive Bayes Build)
• Simplified decision tree (Single Feature Build)
•Boosted (Multi Feature Build)
The advantages of Bayesian Networks:
• Visually represent all the relationships between the variables
• Easy to recognize the dependence and independence between nodes.
• Can handle incomplete data
• scenarios where it is not practical to measure all variables (costs, not enough sensors, etc.)
• Help to model noisy systems.
• Can be used for any system model - from all known parameters to no known parameters.
Data Mining Classification Algorithms [Cont’d]
The limitations of Bayesian Networks:
• All branches must be calculated in order to calculate the probability of any one branch.
• The quality of the results of the network depends on the quality of the prior beliefs or
model.
• Calculation can be NP-hard
• Calculations and probabilities using Baye's rule and marginalization can become complex
and are often characterized by subtle wording, and care must be taken to calculate them
properly.
Data Mining Classification Algorithms [Cont’d]
5.Neural Networks
•Neural Networks Artificial neural networks mimic the pattern-finding capacity of the human brain and
hence some researchers have suggested applying Neural Network algorithms to pattern-mapping.
Neural networks have been applied successfully in a few applications that involve classification.
•An artificial neural network(ANN), often just called a "neural network" (NN), is a mathematical model or
computational model based on biological neural networks, in other words, is an emulation of biological
neural system.
•It consists of an interconnected group of artificial neurons and processes information using a
connectionist approach to computation. In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows through the network during the learning
phase.
Data Mining Classification Algorithms [Cont’d]
• A neural network consists of an interconnected group of artificial neurons, and it processes
information using a connectionist approach to computation. In most cases a neural network is an
adaptive system that changes its structure during a learning phase. Neural networks are used to
model complex relationships between inputs and outputs or to find patterns in data.
6. Logistic Regression
• Definition: Logistic regression is a machine learning algorithm for classification. In this algorithm, the
probabilities describing the possible outcomes of a single trial are modelled using a logistic function.
• Advantages: Logistic regression is designed for this purpose (classification), and is most useful for
understanding the influence of several independent variables on a single outcome variable.
• Disadvantages: Works only when the predicted variable is binary, assumes all predictors are
independent of each other, and assumes data is free of missing values.
Data Mining Classification Algorithms [Cont’d]
7. Stochastic Gradient Descent
•Definition: Stochastic gradient descent is a simple and very efficient approach to fit linear
models. It is particularly useful when the number of samples is very large. It supports different
loss functions and penalties for classification.
•Advantages: Efficiency and ease of implementation.
•Disadvantages: Requires a number of hyper-parameters and it is sensitive to feature scaling.
Data Mining Classification Algorithms [Cont’d]
8. Random Forest
•Definition: Random forest classifier is a meta-estimator that fits a number of decision trees on various
sub-samples of datasets and uses average to improve the predictive accuracy of the model and
controls over-fitting. The sub-sample size is always the same as the original input sample size but the
samples are drawn with replacement.
•Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees
in most cases.
•Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.
Data Mining Classification Algorithms [Cont’d]
9. ID3 Algorithm
•This Data Mining Algorithms starts with the original set as the root hub. On every cycle, it emphasizes
through every unused attribute of the set and figures. That the entropy of attribute. At that point chooses
the attribute. That has the smallest entropy value.
•The set is S then split by the selected attribute to produce subsets of the information. This Data Mining
algorithms proceed to recurse on each item in a subset. Also, considering only items never selected
before. Recursion on a subset may bring to a halt in one of these cases:
• Every element in the subset belongs to the same class (+ or -), then the node is turned into a leaf
and labeled with the class of the examples
• If there are no more attributes to select but the examples still do not belong to the same class. Then
the node is turned into a leaf and labeled with the most common class of the examples in that
subset.
Data Mining Classification Algorithms [Cont’d]
• If there are no examples in the subset, then this happens. Whenever parent set found to
be matching a specific value of the selected attribute.
• For example, if there was no example matching with marks >=100. Then a leaf is created
and is labeled with the most common class of the examples in the parent set.
10. C4.5 Algorithm
•There are constructs that are used by classifiers which are tools in data mining. These
systems take inputs from a collection of cases where each case belongs to one of the small
numbers of classes and are described by its values for a fixed set of attributes.
•The output classifier can accurately predict the class to which it belongs. It makes use of
decision trees where the first initial tree is acquired by using a divide and conquer algorithm.
Data Mining Classification Algorithms [Cont’d]
• C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's
earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is
often referred to as a statistical classifier. In 2011, authors of the Weka machine learning software described the C4.5
algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in
practice to date".
C4.5 made a number of improvements to ID3. Some of these are:
• Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold
and then splits the list into those whose attribute value is above the threshold and those that are less than or equal
to it.
• Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing.
Missing attribute values are simply not used in gain and entropy calculations.
• Handling attributes with differing costs.
• Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove
branches that do not help by replacing them with leaf nodes.
Data Mining Association Rule Algorithms
• An association rule is a rule which implies certain association relationships among a set of
objects (such as ``occur together'' or ``one implies the other'') in a database.
• Given a set of transactions, where each transaction is a set of literals (called items), an
association rule is an expression of the form X Y , where X and Y are sets of items. The intuitive
meaning of such a rule is that transactions of the database which contain X tend to contain Y .
• An example of an association rule is: ``30% of transactions that contain beer also contain
diapers; 2% of all transactions contain both of these items''. Here 30% is called the confidence of
the rule, and 2% the support of the rule. The problem is to find all association rules that satisfy
user-specified minimum support and minimum confidence constraints.
• Association Rule is one of the very important concepts of machine learning being used in market
basket analysis.
Data Mining Association Rule Algorithms [Cont’d]
• Market Basket Analysis is the study of customer transaction databases to determine
dependencies between the various items they purchase at different times .
• Association rule learning is a rule-based machine learning method for discovering interesting
relations between variables in large databases. It identifies frequent if-then associations
called association rules which consists of an antecedent (if) and a consequent (then).
• For example: “If tea and milk, then sugar” (“If tea and milk are purchased, then sugar would
also be bought by the customer”)
• Antecedent: Tea and Milk
• Consequent: Sugar.
Data Mining Association Rule Algorithms [Cont’d]
There are three common metrics to measure association:
• Support is an indication of how frequently the items appear in the data. Mathematically,
support is the fraction of the total number of transactions in which the item set occurs.
• Confidence indicates the number of times the if-then statements are found true. Confidence is
the conditional probability of occurrence of consequent given the antecedent.
Data Mining Association Rule Algorithms [Cont’d]
Lift can be used to compare confidence with expected confidence. This says how likely item Y
is purchased when item X is purchased, while controlling for how popular item Y is.
Mathematically,
1. Apriori Algorithm
• Apriori uses a breadth-first search strategy to count the support of itemsets and uses a
candidate generation function which exploits the downward closure property of support.
• Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases.
Data Mining Association Rule Algorithms [Cont’d]
• It proceeds by identifying the frequent individual items in the database and extending them to
larger and larger item sets as long as those item sets appear sufficiently often in the database.
• The frequent item sets determined by Apriori can be used to determine association rules which
highlight general trends in the database: this has applications in domains such as market basket
analysis.
• The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to
operate on databases containing transactions (for example, collections of items bought by
customers, or details of a website frequentation or IP addresses).
• Other algorithms are designed for finding association rules in data having no transactions
(Winepi and Minepi), or having no timestamps (DNA sequencing). Each transaction is seen as a
set of items (an itemset).
Data Mining Association Rule Algorithms [Cont’d]
• Given a threshold, the Apriori algorithm identifies the item sets which are subsets of at least
transactions in the database.
• Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time
(a step known as candidate generation), and groups of candidates are tested against the
data. The algorithm terminates when no further successful extensions are found.
• Apriori uses breadth-first search and a Hash tree structure to count candidate item sets
efficiently. It generates candidate item sets of length from item sets of length.
• Then it prunes the candidates which have an infrequent sub pattern. According to the
downward closure lemma, the candidate set contains all frequent-length item sets. After that,
it scans the transaction database to determine frequent item sets among the candidates.
Data Mining Association Rule Algorithms [Cont’d]
2. ECLAT Algorithm
•The ECLAT algorithm stands for Equivalence Class Clustering and bottom-up Lattice Traversal . It is one of the
popular methods of Association Rule mining. It is a more efficient and scalable version of the Apriori algorithm.
•While the Apriori algorithm works in a horizontal sense imitating the Breadth-First Search of a graph, the ECLAT
algorithm works in a vertical manner just like the Depth-First Search of a graph. This vertical approach of the
ECLAT algorithm makes it a faster algorithm than the Apriori algorithm.
•The basic idea is to use Transaction Id Sets(tidsets) intersections to compute the support value of a candidate
and avoiding the generation of subsets which do not exist in the prefix tree. In the first call of the function, all
single items are used along with their tidsets. Then the function is called recursively and in each recursive call,
each item-tidset pair is verified and combined with other item-tidset pairs. This process is continued until no
candidate item-tidset pairs can be combined.
Data Mining Association Rule Algorithms [Cont’d]
3. FP-growth algorithm
•FP stands for frequent pattern.
•In the first pass, the algorithm counts the occurrences of items (attribute-value pairs) in the dataset of
transactions, and stores these counts in a 'header table'. In the second pass, it builds the FP-tree structure
by inserting transactions into a trie.
•Items in each transaction have to be sorted by descending order of their frequency in the dataset before
being inserted so that the tree can be processed quickly. Items in each transaction that do not meet the
minimum support requirement are discarded. If many transactions share most frequent items, the FP-tree
provides high compression close to tree root.
• Recursive processing of this compressed version of the main dataset grows frequent item sets directly,
instead of generating candidate items and testing them against the entire database (as in the apriori
algorithm)
Data Mining Association Rule Algorithms [Cont’d]
• In Data Mining the task of finding frequent pattern in large databases is very important and
has been studied in large scale in the past few years. Unfortunately, this task is
computationally expensive, especially when a large number of patterns exist.
• The FP-Growth Algorithm, proposed by Han in , is an efficient and scalable method for mining
the complete set of frequent patterns by pattern fragment growth, using an extended prefix-
tree structure for storing compressed and crucial information about frequent patterns named
frequent-pattern tree (FP-tree).
Data Mining Sequential Pattern Algorithms [Cont’d]
Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns
between data examples where the values are delivered in a sequence.
It is usually presumed that the values are discrete, and thus time series mining is closely related, but
usually considered a different activity. Sequential pattern mining is a special case of structured data
mining.
There are several key traditional computational problems addressed within this field. These include
building efficient databases and indexes for sequence information, extracting the frequently occurring
patterns, comparing sequences for similarity, and recovering missing sequence members.
In general, sequence mining problems can be classified as string mining which is typically based on string
processing algorithms and itemset mining which is typically based on association rule learning.
Local process models extend sequential pattern mining to more complex patterns that can include
(exclusive) choices, loops, and concurrency constructs in addition to the sequential ordering construct.
Data Mining Sequential Pattern Algorithms [Cont’d]
With a great variation of products and user buying behaviors, shelf on which products are being
displayed is one of the most important resources in retail environment.
Retailers can not only increase their profit but, also decrease cost by proper management of shelf space
allocation and products display. To solve this problem, George and Binu (2013) have proposed an
approach to mine user buying patterns using PrefixSpan algorithm and place the products on shelves
based on the order of mined purchasing patterns.
1. GSP algorithm
GSP algorithm (Generalized Sequential Pattern algorithm) is an algorithmused for sequence mining. The
algorithms for solving sequence mining problems are mostly based on the apriori (level-wise) algorithm.
One way to use the level-wise paradigm is to first discover all the frequent items in a level-wise fashion.
It simply means counting the occurrences of all singleton elements in the database.
Data Mining Sequential Pattern Algorithms [Cont’d]
Then, the transactions are filtered by removing the non-frequent items. At the end of this step,
each transaction consists of only the frequent elements it originally contained. This modified
database becomes an input to the GSP algorithm. This process requires one pass over the
whole database.
GSP algorithm makes multiple database passes. In the first pass, all single items (1-
sequences) are counted. From the frequent items, a set of candidate 2-sequences are
formed, and another pass is made to identify their frequency. The frequent 2-sequences are
used to generate the candidate 3-sequences, and this process is repeated until no more
frequent sequences are found. There are two main steps in the algorithm.
•Sequential PAttern Discovery using Equivalence classes (SPADE)