Data Mining Chapter

DATA MINING
STATISTICAL-BASED ALGORITHMS
CLASSIFICATION
INTRODUCTION:
 Prediction can be thought of as classifying an attribute value into one
set of possible classes.it is often viewed as forecasting a continuous
value, while classification forecasts a discrete value.
 All classification techniques assume some knowledge of the data.

Training data consists of sample input data as well as the
classification assignment for each data tuple. Given a database D of
tuple and a set of classes C, the classification problem is to define a
mapping f : D →C where each tuple is assigned to one class.
 The problem is implemented in two phases
 Create a specific model by evaluating the training data.
 Apply the model to classifying tuples from the target database.
 There are three basic methods used to solve the classification:

1. Specifying boundaries;
2. Using probability distributions;
3. Using posterior probabilities.
 A major issue associated with classification is overfitting. It the

classification model fits the data exactly, it may not be applicable
to a broader population.
 Statistical algorithms are based directly on the use of statistical
information.
MEASURING PERFORMANCE AND ACCURACY

 Classification accuracy is usually calculated by determining
the percentage of tuples placed in the correct class.
 Given a specific class and a database tuple may or may not

be assigned to that class while its actual membership may
not be in that class.
This give us four quadrants:
 True positive (TP): ti predicated to be in cj and is actually in
it.
 False positive (FP): ti predicated to be in cj and is not
actually in it.
 True negative (TP): ti not predicated to be in cj and not
actually in it.
 False negative (FP): ti not predicated to be in cj but actually
in it.
 An OC curve or ROC curve shows the relationship between

false positives and true positives. The horizontal axis has the
percentage of false positives and vertical axis has the
percentage of true positives for a database sample.
 A confusion matrix illustrates the accuracy of the solution to

a classification problem. Given m classes, a confusion matrix
is an m*m matrix where entry ci j indicates the number of
tuples from D that were assigned to class Cj but where the
correct class is Ci.
STATISTICAL-BASED ALGORITHMS
Regression
 Regression problems deal with estimation of an output value

based on input values
 Regression can be used to perform classification using two
different approaches:
1) Division: The data are divided into regions based on class.
2) Prediction: Formulas are generated to predict the output

class value.
 Noise is erroneous data. Outliers are data values that

are exceptions to the usual and expected data
y = CO + C 1 X 1 + ··+CnXn + E
Here E is a random error with a mean of 0. As with point

estimation, we can estimate the accuracy of the fit of a
linear regression model to the actual data using a mean
squared error function.
 A commonly used regression technique is
called logistic regression.
P = e(c0+c1x1)/
1+ e(c0+c1x1)
The logistic curve gives a value between 0 and 1 so it can be

interpreted as the probability of class membership. As with linear
regression, it can be used when classification into two classes is
desired.
To perform the regression, the logarithmic function can be applied
to obtain the logistic function
loge p/ 1-p = c0+c1x1
Here p is the probability of being in the class and 1 - p is the

probability that it is not.
However, the process chooses values for co and q that maximize
the probability of observing the given values.
Bayesian Classification
 Assuming that the contribution by all attributes are

independent and that each contributes equally to the
classification problem, a simple classification scheme
called naive
 Training data can be used to determine P(Cj), P(xi | Cj),

and P(Xi). From these values, Bayes theorem allows us
to estimate the posterior probability P (Cj I Xi) and then
P ( Cj I ti).
P(ti I Cj) = ∏ P (Xik ICj)

k=l
 To calculate P(ti), we can estimate the likelihood that ti

is in each class.
 The posterior probability P ( Cj I ti) is then found for

each class. The class with the highest probability is the
one chosen for the tuple
 Only one scan of the training data is required. In
simple relationships, the technique often does yield
good results.
 . The technique does not handle continuous data.

Dividing the continuous values into ranges could be
used to solve this problem

Data Mining Chapter

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Chapter

Uploaded by

Copyright:

Available Formats

DATA MINING

 All classification techniques assume some knowledge of the data.

 The problem is implemented in two phases

 Create a specific model by evaluating the training data.

 Apply the model to classifying tuples from the target database.

 There are three basic methods used to solve the classification:

 A major issue associated with classification is overfitting. It the

MEASURING PERFORMANCE AND ACCURACY

 Given a specific class and a database tuple may or may not

 An OC curve or ROC curve shows the relationship between

 A confusion matrix illustrates the accuracy of the solution to

 Regression problems deal with estimation of an output value

1) Division: The data are divided into regions based on class.

2) Prediction: Formulas are generated to predict the output

 Noise is erroneous data. Outliers are data values that

Here E is a random error with a mean of 0. As with point

The logistic curve gives a value between 0 and 1 so it can be

loge p/ 1-p = c0+c1x1

Here p is the probability of being in the class and 1 - p is the

 Assuming that the contribution by all attributes are

 Training data can be used to determine P(Cj), P(xi | Cj),

P(ti I Cj) = ∏ P (Xik ICj)

 To calculate P(ti), we can estimate the likelihood that ti

 The posterior probability P ( Cj I ti) is then found for

 . The technique does not handle continuous data.

You might also like