Professional Documents
Culture Documents
DWDM Notes Unit-4
DWDM Notes Unit-4
• Mean, Median, Mode and Mid-range are the common Central tenancies.
Mean with respect to Data mining and data-warehousing is nothing but the
weighted arithmetic mean.The weights reflects the significance,importance,or
occurence freequency attatched to their respective values.In this case, we can
compute the mean with the freequency values.
Median is the middle value in the ordered set as we all know and it requires to
arrange the distribution in ascending order and appliying the formula to
calculate the median.
Mode is a value that occurs freequently or the value with the highest
freequency in the distribution.
mean-mode=3*(mean-median)
Mid-range is the average of the largest and the smallest values in a set.
Measuring the dispersion of data:
The degree to which numeric data tend to spread is
called the dispersion, or variance of the data. The
most common measures of dispersion are:
• Five-number summary
• inter-quatrile range
• standard deviation
• Quartiles, outliers and box plots are used to
fragment,cluster and represent the data
respectively.
Measuring the dispersion of data:
• The kth percentile of a set of data in numerical order us the value x having the
property that lie at or below x. Eg: in a group of 100 values the 50th value can be called
as 50th percentile.
Every 25th percentile value can be called as quartiles. For eg: the 25th percentile, 50th
percentile,75th percentile and 100th percentile are 1st,2nd,3rd,and 4th quatrile
respectively.
The lower bound value is the minimum and upper bound value is the maximum.
The five number summary is the representation of minimum value, 1st quartile value,
median,3rd quartile value and the maximum value.
The box-plot is the graphical representation of all the five number summary values.
Graph displays of Basic statistical class descriptions:
• Regression is a data
mining technique used to predict a
range of numeric values (also called
continuous values), given a particular
dataset. Regression is used across
multiple industries for business and
marketing planning, financial
forecasting, environmental modeling
and analysis of trends.
• Linear regression performs the task
to predict a dependent variable value
(y) based on a given independent
variable (x). So,
this regression technique finds out
a linear relationship between x
(input) and y(output).
Linear regression
• Linear regression is a way • In statistical modeling, regression
to model the relationship analysis is a set of statistical
processes for estimating the
between two variables. relationships among variables.
• The equation has the form • Regression analysis is also used
Y=a+bX, where Y is the to understand which among the
dependent variable (that's independent variables are
the variable that goes on related to the dependent
variable, and to explore the
the Y axis), X is the forms of these relationships.
independent variable (i.e. it •
is plotted on the X axis), b
is the slope of the line and
a is the y-intercept.
Multiple Regression
• Multiple linear regression is an extension of linear regression
analysis.
• It uses two or more independent variables to predict an
outcome and a single continuous dependent variable.
Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e
where,
'Y' is the response variable.
X1 + X2 + Xk are the independent predictors.
'e' is random error.
a0, a1, a2, ak are the regression coefficients.
Bayesian classifier
Bayesian classifier
Example
Example
Distance Based algo: KNN
classification
• K nearest neighbors is a simple algorithm that
stores all available cases and classifies new
cases based on a similarity measure (e.g.,
distance functions).
• A case is classified by a majority vote of its
neighbors, with the case being assigned to the
class most common amongst its K nearest
neighbors measured by a distance function
K-NN classifier
Algorithm
The core algorithm for building decision trees called ID3 by J. R. Quinlan
which employs a top-down, greedy search through the space of
possible branches with no backtracking. ID3 uses Entropy and
Information Gain to construct a decision tree. I
Decision tree
During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser).
• Quinlan later presented C4.5 (a successor of ID3)
• A group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the book
Classification and Regression Trees (CART), which described the generation of binary
decision trees
• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer manner.
Decision Tree
• A decision tree is a flowchart-like
tree structure, where each
internal node (nonleaf
node)denotes a test on an
attribute, each branch represents
an outcome of the test, and each
leafnode (or terminal node) holds
a class label. The topmost node in
a tree is the root node.
Attribute to Split
CART Algo
• Decision Trees are commonly used in data mining with the objective of
creating a model that predicts the value of a target (or dependent
variable) based on the values of several input (or independent variables).
In today's post, we discuss the CART decision tree methodology. The
CART or Classification & Regression Trees methodology was introduced in
1984 by Leo Breiman, Jerome Friedman, Richard Olshen and
Charles Stone as an umbrella term to refer to the following types of
decision trees:
CART
Classification Trees: where the target Regression Trees: where the target
variable is categorical and the tree is variable is continuous and tree is used
used to identify the "class" within which to predict it's value.
a target variable would likely fall into.
CART( Decision Tree’s variant)
• The main elements of CART (and any decision tree algorithm) are:
• Rules for splitting data at a node based on the value of one variable;
• Stopping rules for deciding when a branch is terminal and can be split no more; and
• Finally, a prediction for the target variable in each terminal node.
• Advantages of CART
• Simple to understand, interpret, visualize.
• Decision trees implicitly perform variable screening or feature selection.
• Can handle both numerical and categorical data. Can also handle multi-output problems.
• Decision trees require relatively little effort from users for data preparation.
• Nonlinear relationships between parameters do not affect tree performance.
• Disadvantages of CART
• Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
• Decision trees can be unstable because small variations in the data might result in a completely different tree
being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
• Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training
multiple trees, where the features and samples are randomly sampled with replacement.
• Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the
data set prior to fitting with the decision tree.
•
Clustering
Introduction:
Clustering Algo:
Hierarchical Clustering & Partition Clustering
Hierarchical clustering
• Hierarchical clustering, also known as hierarchical cluster analysis, is an
algorithm that groups similar objects into groups called clusters. The endpoint is
a set of clusters, where each cluster is distinct from each other cluster, and the
objects within each cluster are broadly similar to each other.
• How hierarchical clustering works
• Hierarchical clustering starts by treating each observation as a separate cluster.
Then, it repeatedly executes the following two steps: (1) identify the two clusters
that are closest together, and (2) merge the two most similar clusters. This
continues until all the clusters are merged together. This is illustrated in the
diagrams below.
• BIRCH Algo
• BIRCH (balanced iterative reducing
and clustering using hierarchies) is an
unsupervised data mining algorithm
used to perform hierarchical
clustering over particularly large
data-sets. In most cases, BIRCH only
requires a single scan of the
database.
Model-based clustering
statistical approach: EM Model
statistical approach: EM Model
Association Rule: What Is Frequent Pattern
Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
72
Why Is Freq. Pattern Mining Important?
• Freq. pattern: An intrinsic and important property of datasets
• Foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia,
time-series, and stream data
– Classification: discriminative, frequent pattern
analysis
– Cluster analysis: frequent pattern-based clustering
– Data warehousing: iceberg cube and cube-gradient
– Semantic data compression: fascicles
– Broad applications
73
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
• Basic Concepts
Evaluation Methods
• Summary
74
The Apriori Algorithm
(Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
75
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
76
Candidate Generation: An SQL Implementation
• SQL Implementation of candidate generation
– Suppose the items in Lk-1 are listed in an order
– Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
– Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
77
Scalable Frequent Itemset
Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
78
Further Improvement of the Apriori Method
79
Dependent Events