Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 89

Different forms of data

Processing/preprocessing( Ans 12)


Parallel DBMSs use three different
types of parallelism
• 1) Inter-query parallelism: each query runs on one
processor, but different queries can be distributed
among different nodes. A common use case for this is
transaction processing, where each transaction can be
executed in a different node.
• 2) Inter-operator parallelism: each query runs on
multiple processors. The parallelism corresponds to
different operators of a query running in different
processors.
• 3)Intra-operator parallelism: a single operator is
distributed among multiple processors. This is also
commonly referred to as data parallelism.
Classification:
• Classification is a data
mining function that assigns
items in a collection to
target categories or classes.
The goal of classification is
to accurately predict the
target class for each case in
the data. For example, a
classification model could
be used to identify loan
applicants as low, medium,
or high credit risks.
Data Generalization
• Data Generalization is the process of creating
successive layers of summary data in an
evaluational database. It is a process of
zooming out to get a broader view of a
problem, trend or situation. It is also known as
rolling-up data a genarlization
• Example: higher level concept such as young
old, middle level
Analytical Characterization:

• has materialized view of data ,which has been


pre computed in data warehouse
• Generalization is performed by Attribute
Removal and generalization
Concept description
• A concept usually refers to a collection
of data, such as winners, frequent buyers,
best sellers, and so on. As a data mining task,
concept description is not simple enumeration
of the data.
Attribute relevance analysis
• Attribute relevance analysis has two
important functions:
• Recognition of most important variables which
has greatest impact on target variable.
• Understanding relations and logic between
most important predictor and target variable,
and understanding relations and logic
between most important predictors from
target variable perspective.
Mining class comparison
• Identify the attribute ( feature) which best
distinguishes different classes.
• Makes pure sub class
Mining class comparison
Mining Descriptive statistical measure in large databases

• Measuring the Central Tendency


• Mesuring the dispersion of data
• Graph displays of basic statistical class
descriptions
Measuring the Central Tenancy:

• Mean, Median, Mode and Mid-range are the common Central tenancies.

Mean with respect to Data mining and data-warehousing is nothing but the
weighted arithmetic mean.The weights reflects the significance,importance,or
occurence freequency attatched to their respective values.In this case, we can
compute the mean with the freequency values.

Median is the middle value in the ordered set as we all know and it requires to
arrange the distribution in ascending order and appliying the formula to
calculate the median.

Mode is a value that occurs freequently or the value with the highest
freequency in the distribution.
mean-mode=3*(mean-median)

Mid-range is the average of the largest and the smallest values in a set.
Measuring the dispersion of data:
The degree to which numeric data tend to spread is
called the dispersion, or variance of the data. The
most common measures of dispersion are:
• Five-number summary
• inter-quatrile range
• standard deviation
• Quartiles, outliers and box plots are used to
fragment,cluster and represent the data
respectively.
Measuring the dispersion of data:
• The kth percentile of a set of data in numerical order us the value x having the
property that lie at or below x. Eg: in a group of 100 values the 50th value can be called
as 50th percentile.

Every 25th percentile value can be called as quartiles. For eg: the 25th percentile, 50th
percentile,75th percentile and 100th percentile are 1st,2nd,3rd,and 4th quatrile
respectively.

Inter-quatrile range is the distance between thrid and first quatrile

Outliers are the values that do not belong to any clusters.

The lower bound value is the minimum and upper bound value is the maximum.

The five number summary is the representation of minimum value, 1st quartile value,
median,3rd quartile value and the maximum value.

The box-plot is the graphical representation of all the five number summary values.
Graph displays of Basic statistical class descriptions:

• The data are represented by plotting


histograms, or frequency histograms. The
various representation are :
• Quantile plot
• Histogram
• Quantile-quantile plot
• Scatter plot
• loess curve
Data Generalization
• Data Generalization is
• Example: In general, data
the process of creating generalization summarizes data
successive layers of by replacing relatively low-level
summary data in an values (e.g., numeric values for an
attribute age) with higher-level
evaluational database. It concepts (e.g., young, middle-
is a process of zooming aged, and senior), or by reducing
out to get a broader view the number of dimensions to
summarize data in concept space
of a problem, trend or involving fewer dimensions (e.g.,
situation. removingbirth_date
and telephone number when
summarizing the behavior of a
group of students).
Analytical characterization
• Analytical
characterization: data
dispersion analysis. Use
information
gain analysis (e.g.,
entropy or other
measures) to identify
highly relevant
dimensions and levels.
determines the classifying
power of an attribute
within a set of data.
Parametic vs Non Parametic model
Parametic vs non Parametic model
• In a parametric model, you know which
model exactly you will fit to the data, e.g.,
linear regression line.
• In a non-parametric model, however, the data
tells you what the 'regression' should look
like.
examples
• Some more examples of • Some more examples of
parametric machine popular nonparametric
learning algorithms machine learning
include: algorithms are:
• Logistic Regression • k-Nearest Neighbors
• Linear Discriminant • Decision Trees like CART
Analysis and C4.5
• Perceptron • Support Vector Machines
• Naive Bayes
• Simple Neural Networks
Algorithm can be classified as either
parametric or non-parametric.
• A machine learning algorithm can be classified as
either parametric or non-parametric. .
• A common example of a parametric algorithm is linear
regression. In contrast, a non-parametric algorithm
uses a flexible number of parameters, and the number
of parameters often grows as it learns from more data.
• In the literal meaning of the terms, a parametric
statistical test is one that makes assumptions about
the parameters (defining properties) of the population
distribution(s) from which one's data are drawn, while
a non-parametric test is one that makes no such
assumptions.
Classification:
• Classification is a data
mining function that assigns
items in a collection to
target categories or classes.
The goal of classification is
to accurately predict the
target class for each case in
the data. For example, a
classification model could
be used to identify loan
applicants as low, medium,
or high credit risks.
Issues in classification:
• 1) missing Data
• 2) measuring performance: using ROC curve in
binary classifier and or confusion matrix( Mutil
class classifier
• (Receiver operating characteristic curve,
or ROC curve, relative operating characteristic
curve ie relation between False Positive and
true Positive )
Confusion matrix: Error Matrix
• Definition of the Terms:
• Positive (P) : Observation is positive
(for example: is an apple).
• Negative (N) : Observation is not
positive (for example: is not an
apple).
• True Positive (TP) : Observation is
positive, and is predicted to be
positive.
• False Negative (FN) : Observation is
positive, but is predicted negative.
• True Negative (TN) : Observation is
negative, and is predicted to be
negative.
• False Positive (FP) : Observation is
negative, but is predicted positive.
Confusion Matrix
Classification is the process of classifying
the data with the help of class labels. On the
other hand, Clustering is similar to
classification but there are no predefined
class labels. Classification is geared with
supervised learning. As against, clustering
is also known as unsupervised learning.

Q8) write short nots on: ( 2008-09)


Classification vs clustering
Algo
• 1)Statistical based algo: Regression, Bayesian
• 2)Distance based :K-NN algo
• 3)Decision Tree based Algo: (ID3,C4.5, CART,
SPRINT)
• NN based algo: Perceptron, Feed Forward,
Feed backward)
Statistical based Algo: Regression : 1. Linear regression
2.Multiple Regression

• Regression is a data
mining technique used to predict a
range of numeric values (also called
continuous values), given a particular
dataset. Regression is used across
multiple industries for business and
marketing planning, financial
forecasting, environmental modeling
and analysis of trends.
• Linear regression performs the task
to predict a dependent variable value
(y) based on a given independent
variable (x). So,
this regression technique finds out
a linear relationship between x
(input) and y(output).
Linear regression
• Linear regression is a way • In statistical modeling, regression
to model the relationship analysis is a set of statistical
processes for estimating the
between two variables. relationships among variables.
• The equation has the form • Regression analysis is also used
Y=a+bX, where Y is the to understand which among the
dependent variable (that's independent variables are
the variable that goes on related to the dependent
variable, and to explore the
the Y axis), X is the forms of these relationships.
independent variable (i.e. it •
is plotted on the X axis), b
is the slope of the line and
a is the y-intercept.
Multiple Regression
• Multiple linear regression is an extension of linear regression
analysis.
• It uses two or more independent variables to predict an
outcome and a single continuous dependent variable.
Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e
where,
'Y' is the response variable.
X1 + X2 + Xk are the independent predictors.
'e' is random error.
a0, a1, a2, ak are the regression coefficients.
Bayesian classifier
Bayesian classifier
Example
Example
Distance Based algo: KNN
classification
• K nearest neighbors is a simple algorithm that
stores all available cases and classifies new
cases based on a similarity measure (e.g.,
distance functions).
• A case is classified by a majority vote of its
neighbors, with the case being assigned to the
class most common amongst its K nearest
neighbors measured by a distance function
K-NN classifier
Algorithm
The core algorithm for building decision trees called ID3 by J. R. Quinlan
which employs a top-down, greedy search through the space of
possible branches with no backtracking. ID3 uses Entropy and
Information Gain to construct a decision tree. I
Decision tree
During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser).
• Quinlan later presented C4.5 (a successor of ID3)

• A group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the book
Classification and Regression Trees (CART), which described the generation of binary
decision trees

• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer manner.
Decision Tree
• A decision tree is a flowchart-like
tree structure, where each
internal node (nonleaf
node)denotes a test on an
attribute, each branch represents
an outcome of the test, and each
leafnode (or terminal node) holds
a class label. The topmost node in
a tree is the root node.
Attribute to Split
CART Algo
• Decision Trees are commonly used in data mining with the objective of
creating a model that predicts the value of a target (or dependent
variable) based on the values of several input (or independent variables).
In today's post, we discuss the CART decision tree methodology. The
CART or Classification & Regression Trees methodology was introduced in
1984 by Leo Breiman, Jerome Friedman, Richard Olshen and
Charles Stone as an umbrella term to refer to the following types of
decision trees:
CART

Classification Trees: where the target Regression Trees: where the target
variable is categorical and the tree is variable is continuous and tree is used
used to identify the "class" within which to predict it's value.
a target variable would likely fall into.
CART( Decision Tree’s variant)
• The main elements of CART (and any decision tree algorithm) are:
• Rules for splitting data at a node based on the value of one variable;
• Stopping rules for deciding when a branch is terminal and can be split no more; and
• Finally, a prediction for the target variable in each terminal node.
• Advantages of CART
• Simple to understand, interpret, visualize.
• Decision trees implicitly perform variable screening or feature selection.
• Can handle both numerical and categorical data. Can also handle multi-output problems.
• Decision trees require relatively little effort from users for data preparation.
• Nonlinear relationships between parameters do not affect tree performance.
• Disadvantages of CART
• Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
• Decision trees can be unstable because small variations in the data might result in a completely different tree
being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
• Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training
multiple trees, where the features and samples are randomly sampled with replacement.
• Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the
data set prior to fitting with the decision tree.

Clustering
Introduction:
Clustering Algo:
Hierarchical Clustering & Partition Clustering
Hierarchical clustering
• Hierarchical clustering, also known as hierarchical cluster analysis, is an
algorithm that groups similar objects into groups called clusters. The endpoint is
a set of clusters, where each cluster is distinct from each other cluster, and the
objects within each cluster are broadly similar to each other.
• How hierarchical clustering works
• Hierarchical clustering starts by treating each observation as a separate cluster.
Then, it repeatedly executes the following two steps: (1) identify the two clusters
that are closest together, and (2) merge the two most similar clusters. This
continues until all the clusters are merged together. This is illustrated in the
diagrams below.

• The main output of Hierarchical Clustering is


a dendrogram, which shows the hierarchical
relationship between the clusters:
Hierarchical clustering:Agglomerative( Bottom Up) versus
divisive ( Top Down) algorithms

• Agglomerative( Bottom Up) versus divisive


( Top Down) algorithms
• Strategies for hierarchical clustering generally
fall into two types: Agglomerative: This is a
"bottom-up" approach: each observation starts
in its own cluster, and pairs of clusters are
merged as one moves up the hierarchy.
• Hierarchical clustering typically works by
sequentially merging similar clusters, as shown
above. This is known as agglomerative
hierarchical clustering. In theory, it can also be
done by initially grouping all the observations
into one cluster, and then successively splitting
these clusters. This is known as divisive
hierarchical clustering. Divisive clustering is
rarely done in practice.

CURE algo
• CURE (Clustering Using REpresentatives) is an efficient data
clustering algorithm for large databases.
• Compared with K-means clustering it is
more robust to outliers and able to identify clusters having
non-spherical shapes and size variances.
CURE Algo
• To avoid the problems with non-uniform sized or shaped clusters, CURE employs a
hierarchical clustering algorithm that adopts a ,middle ground between the
centroid based and all point extremes.
• In CURE, a constant number c of well scattered points of a cluster are chosen and
they are shrunk towards the centroid of the cluster by a fraction α. The scattered
points after shrinking are used as representatives of the cluster.
• The clusters with the closest pair of representatives are the clusters that are
merged at each step of CURE's hierarchical clustering algorithm.
• This enables CURE to correctly identify the clusters and makes it less sensitive to
outliers.
• Running time is O(n2 log n), making it rather expensive, and space complexity is
O(n).
Chameleon Algo:
DBSCAN
• eps: if the eps value chosen is too small, a large part of the data will not be
clustered. It will be considered outliers because don’t satisfy the number of points
to create a dense region.
• On the other hand, if the value that was chosen is too high, clusters will merge and
the majority of objects will be in the same cluster.
• The eps should be chosen based on the distance of the dataset (we can use a k-
distance graph to find it), but in general small eps values are preferable
• .minPoints: As a general rule, a minimum minPoints can be derived from a number
of dimensions (D) in the data set, as minPoints ≥ D + 1. Larger values are usually
better for data sets with noise and will form more significant clusters. The
minimum value for the minPoints must be 3, but the larger the data set, the larger
the minPoints value that should be chosen .
OPTICS
• Ordering points to identify the clustering structure (OPTICS)
is an algorithm for finding density-based
• Ordering points to identify the clustering structure (OPTICS)
is an algorithm for finding density-based clusters in spatial
data.
• Its basic idea is similar to DBSCAN, but it addresses one of
DBSCAN's major weaknesses: the problem of detecting
meaningful clusters in data of varying density.
GRID Based: clustering
STING:
• STatistical Information Grid (STING) is a grid-based clustering algorithm.
The dataset is recursively divided into a hierarchy structure. The whole
input dataset serves as the root node in the hierarchy structure.
• Each cell/unit in a layer is composed of a couple of cells/units in the lower
layer. An example is shown in the following diagram:
• To support the query for a dataset, the statistical information of each unit
is calculated in advance for further processing; this information is also
called statistics parameters.
• The characteristics of STING algorithms are (but not limited to) the
following:
• A query-independent structure
• Intrinsically parallelizable
• Efficiency
The STING algorithm
CLIQUE based clustering
CLIQUE: based Algo
CLARA
CLARANS( clustering large application based on randomized
search)
BIRCH
(balanced iterative reducing and clustering using hierarchies

• BIRCH Algo
• BIRCH (balanced iterative reducing
and clustering using hierarchies) is an
unsupervised data mining algorithm
used to perform hierarchical
clustering over particularly large
data-sets. In most cases, BIRCH only
requires a single scan of the
database.
Model-based clustering
statistical approach: EM Model
statistical approach: EM Model
Association Rule: What Is Frequent Pattern
Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.

72
Why Is Freq. Pattern Mining Important?
• Freq. pattern: An intrinsic and important property of datasets
• Foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia,
time-series, and stream data
– Classification: discriminative, frequent pattern
analysis
– Cluster analysis: frequent pattern-based clustering
– Data warehousing: iceberg cube and cube-gradient
– Semantic data compression: fascicles
– Broad applications

73
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

• Basic Concepts

• Frequent Itemset Mining Methods

• Which Patterns Are Interesting?—Pattern

Evaluation Methods

• Summary

74
The Apriori Algorithm
(Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

75
Implementation of Apriori
• How to generate candidates?

– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation

– L3={abc, abd, acd, ace, bcd}


– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}

76
Candidate Generation: An SQL Implementation
• SQL Implementation of candidate generation
– Suppose the items in Lk-1 are listed in an order
– Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
– Step 2: pruning
forall itemsets c in Ck do

forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

77
Scalable Frequent Itemset
Mining Methods
• Apriori: A Candidate Generation-and-Test Approach

• Improving the Efficiency of Apriori

• FPGrowth: A Frequent Pattern-Growth Approach

• ECLAT: Frequent Pattern Mining with Vertical Data Format

• Mining Close Frequent Patterns and Maxpatterns

78
Further Improvement of the Apriori Method

• Major computational challenges

– Multiple scans of transaction database


– Huge number of candidates
– Tedious workload of support counting for
candidates
• Improving Apriori: general ideas

– Reduce passes of transaction database scans


– Shrink number of candidates
– Facilitate support counting of candidates

79
Dependent Events

• Events can be "dependent" which means they


can be affected by previous events .
• So we have to say which one we want, and use
the symbol "|" to mean "given":
• P(B|A) means "Event B given Event A"
• In other words, event A has already happened,
now what is the chance of event B?
• P(B|A) is also called the "Conditional Probability"
of B given A.
Example:

• Example: Ice Cream


• 70% of your friends like Chocolate, and 35% like
Chocolate AND like Strawberry.
• What percent of those who like Chocolate also
like Strawberry?
• P(Strawberry|Chocolate) = P(Chocolate and
Strawberry) / P(Chocolate)
• 0.35 / 0.7 = 50%
• 50% of your friends who like Chocolate also like
Strawberry
f we know the conditional probability , we can use the bayes rule to find out the reverse probabilities .

If we know the conditional


probability, we can use the bayes
rule to find out the
reverse probabilities
Neural Network Model
• Neural networks are non-linear
statistical data modeling tools. They can be
used to model complex relationships between
inputs and outputs or to find patterns in data
Neural Network Model
Example
• Taking the example of the bank credit
approval wherein the attributes of the customers
such as age, income, existing loans etc. are
considered as input and denoted as a vector X={ x1,
x2, x3.....xd} and weights of these attributes as
W={w1,w2, w3......wd}.
• Note that bias is also known as threshold, which
will now make more sense to you.

example
Example2
• Consider a scenario wherein your NN should answer the following question:
How likely are you to go for a movie today?
• Consider the features(inputs) as [x1,x2,x3] where
x1 = Is the weather good?
x2 = Is anyone accompanying me?
x3 = Is it near public transit? I don’t own a car.
• Since weights represent the importance of each input, my sample weights are:
w1 = 3 (How important is the weather condition for me to go to a movie ?)
w2 = 4 (How much I desire someone to accompany me? )
w3 = 7 ( How much you prefer a nearby place ?)
• Bias represents your overall willingness to go to the movie. If bias is too big, you
are tilting towards a positive result.
• Output = No if ∑ wx+bias ≤ 0
• Output = Yes if ∑ wx+bias > 0
ANN
ANN

You might also like