Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

DATA MINING

STATISTICAL-BASED ALGORITHMS

CLASSIFICATION
INTRODUCTION:
 Prediction can be thought of as classifying an attribute value into one
set of possible classes.it is often viewed as forecasting a continuous
value, while classification forecasts a discrete value.

 All classification techniques assume some knowledge of the data.


Training data consists of sample input data as well as the
classification assignment for each data tuple. Given a database D of
tuple and a set of classes C, the classification problem is to define a
mapping f : D →C where each tuple is assigned to one class.

 The problem is implemented in two phases

 Create a specific model by evaluating the training data.

 Apply the model to classifying tuples from the target database.

 There are three basic methods used to solve the classification:


1. Specifying boundaries;
2. Using probability distributions;
3. Using posterior probabilities.

 A major issue associated with classification is overfitting. It the


classification model fits the data exactly, it may not be applicable
to a broader population.
 Statistical algorithms are based directly on the use of statistical
information.

MEASURING PERFORMANCE AND ACCURACY


 Classification accuracy is usually calculated by determining
the percentage of tuples placed in the correct class.

 Given a specific class and a database tuple may or may not


be assigned to that class while its actual membership may
not be in that class.
This give us four quadrants:
 True positive (TP): ti predicated to be in cj and is actually in
it.
 False positive (FP): ti predicated to be in cj and is not
actually in it.
 True negative (TP): ti not predicated to be in cj and not
actually in it.
 False negative (FP): ti not predicated to be in cj but actually
in it.

 An OC curve or ROC curve shows the relationship between


false positives and true positives. The horizontal axis has the
percentage of false positives and vertical axis has the
percentage of true positives for a database sample.

 A confusion matrix illustrates the accuracy of the solution to


a classification problem. Given m classes, a confusion matrix
is an m*m matrix where entry ci j indicates the number of
tuples from D that were assigned to class Cj but where the
correct class is Ci.

STATISTICAL-BASED ALGORITHMS

Regression

 Regression problems deal with estimation of an output value


based on input values
 Regression can be used to perform classification using two
different approaches:

1) Division: The data are divided into regions based on class.

2) Prediction: Formulas are generated to predict the output


class value.

 Noise is erroneous data. Outliers are data values that


are exceptions to the usual and expected data

y = CO + C 1 X 1 + ··+CnXn + E

Here E is a random error with a mean of 0. As with point


estimation, we can estimate the accuracy of the fit of a
linear regression model to the actual data using a mean
squared error function.
 A commonly used regression technique is
called logistic regression.
P = e(c0+c1x1)/
1+ e(c0+c1x1)

The logistic curve gives a value between 0 and 1 so it can be


interpreted as the probability of class membership. As with linear
regression, it can be used when classification into two classes is
desired.
To perform the regression, the logarithmic function can be applied
to obtain the logistic function

loge p/ 1-p = c0+c1x1

Here p is the probability of being in the class and 1 - p is the


probability that it is not.
However, the process chooses values for co and q that maximize
the probability of observing the given values.
Bayesian Classification

 Assuming that the contribution by all attributes are


independent and that each contributes equally to the
classification problem, a simple classification scheme
called naive

 Training data can be used to determine P(Cj), P(xi | Cj),


and P(Xi). From these values, Bayes theorem allows us
to estimate the posterior probability P (Cj I Xi) and then
P ( Cj I ti).

P(ti I Cj) = ∏ P (Xik ICj)


k=l

 To calculate P(ti), we can estimate the likelihood that ti


is in each class.

 The posterior probability P ( Cj I ti) is then found for


each class. The class with the highest probability is the
one chosen for the tuple
 Only one scan of the training data is required. In
simple relationships, the technique often does yield
good results.

 . The technique does not handle continuous data.


Dividing the continuous values into ranges could be
used to solve this problem

You might also like