Professional Documents
Culture Documents
Unit Vi: Classification and Prediction
Unit Vi: Classification and Prediction
UNIT VI
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Rule-Based Classification
Classification and Prediction Classification by back propagation
Support Vector Machines
Associative Classification
Lazy Learners
Other Classification Methods
Prediction
Classification accuracy
1
Classification—A Two-Step Process N-fold Cross-validation
1. Model construction: describing a set of predetermined classes
In order to solve over-fitting problem, n-fold cross-
Each tuple/sample is assumed to belong to a predefined class, as validation is usually used
determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or For example, 10 fold cross validation:
mathematical formulae
Divide the whole training dataset into 10 parts equally
2. Model usage: for classifying future or unknown objects
Take the first part away, train the model on the rest 9 portions
After the model is trained, feed the first part as testing dataset,
Estimate accuracy of the model obtain the accuracy
The known label of test sample is compared with the classified result
Repeat step two and three, but take the second part away and
from the model so on…
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set, otherwise over-fitting will occur
Classification Process (1): Model Construction Classification Process (2): Use the Model in Prediction
Classification
Algorithms
Training Classifier
Data
Testing
Data Unseen Data
NAME RANK YEARS Loan Sanc Classifier
Mike Assistant Prof 3 no (Model)
(Jeff, Professor, 4)
Mary Assistant Prof 7 yes
Bill Professor 2 yes NAME RANK YEARS Loan Sanc
Jim Associate Prof 7 yes Tom Assistant Prof 2 no Loan sanc?
IF rank = ‘professor’ Merlisa Associate Prof 7 no
Dave Assistant Prof 6 no
OR years > 6 George Professor 5 yes
Anne Associate Prof 3 no
THEN loan sanc = ‘yes’ Joseph Assistant Prof Prepared7by S.Palaniappan,yes
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Assoc Prof
2
Supervised vs. Unsupervised Learning Issues regarding classification and prediction
Data Preparation
Supervised learning (classification) Preprocessing steps may be applied to the data to help improve the
Supervision: The training data (observations, measurements, accuracy, efficiency, and scalability of the classification or prediction
process
etc.) are accompanied by labels indicating the class of the
observations
Data cleaning
New data is classified based on the training set
Preprocess data in order to reduce noise and handle missing
values - this step can help reduce confusion during learning
Unsupervised learning (clustering) Relevance analysis (feature selection)
The class labels of training data is unknown Remove the irrelevant (Attribute subset selection) or redundant
if we did not have the Class data available for the training set, attributes (a strong correlation between attributes)
we could use clustering to try to determine “groups of like Data transformation
tuples,”
Generalize (Concept hierarchies may be used for this purpose)
and/or normalize data (fall within a small specified range, such
as -1.0 to 1.0, or 0.0 to 1.0.)
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
The accuracy of a classifier / Predictor refers to the ability of a given Internal node denotes a test on an attribute
classifier / Preditor to correctly predict the class label / value of new Branch represents an outcome of the test
or previously unseen data Leaf nodes represent class labels or class distribution
Speed
cost to construct the model Basic algorithm (a greedy algorithm)
cost to use the model
Tree is constructed in a top-down recursive divide-and-conquer
Robustness manner
handling noise and missing values
1970s and early 1980s, J. Ross Quinlan, a researcher in machine
Scalability
learning, developed a decision tree algorithm known as ID3
efficiency in disk-resident databases - large amounts of data
C4.5 and CART
Interpretability:
understanding and insight provided by the model
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
3
Decision Tree Example
outliers
4
Algorithm for Decision Tree Induction
Input : Training Dataset (pseudocode)
age income student credit_rating buys_computer
<=30 high no fair no Algorithm GenDecTree(Sample S, Attlist A)
<=30 high no excellent no 1. create a node N
31…40 high no fair yes 2. If all samples are of the same class C then label N with C;
>40 medium no fair yes terminate;
>40 low yes fair yes 3. If A is empty then label N with the most common class C in S
>40 low yes excellent no (majority voting); terminate;
31…40 low yes excellent yes 4. Select aA, with the highest information gain; Label N with a;
<=30 medium no fair no 5. For each value v of a:
<=30 low yes fair yes a. Grow a branch from N with condition a=v;
>40 medium yes fair yes b. Let Sv be the subset of samples in S with a=v;
<=30 medium yes excellent yes c. If Sv is empty then attach a leaf labeled with the most common class in
S;
31…40 medium no excellent yes
d. Else attach the node generated by GenDecTree(Sv, A-a)
31…40 high yes fair yes
>40 medium no excellent no
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET
Assume there exist several possible split values for each attribute
May need other tools, such as clustering, to get the possible split
values
Can be modified for categorical attributes
(a) If A is discrete-valued, then one branch is grown for each known value of A.
Attribute Selection Measure to determine the splitting criterion
(b) If A is continuous-valued, then two branches are grown, corresponding to A
Tells us which branches to grow from node N with respect to the
< split point and A > split point.
outcomes of the chosen test
(c) If A is discrete-valued and a binary tree must be produced, then the test is
of the form A €SA
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
5
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D| Assume there are two classes, P and N
Let the set of examples S contain p elements of class P and n
Expected information (entropy) needed to classify a tuple in D: elements of class N
m
Info ( D ) p i log ( pi )
| C i ,D | The amount of information, needed to decide if an arbitrary example
2 pj
i 1 |D| in S belongs to P or N is defined as
Information needed (after using A to split D into v partitions) to classify D:
v |Dj |
Info A ( D ) |D|
I (D j ) p p n n
j 1 I ( p, n ) log 2 log 2
pn pn pn pn
Information gained by branching on attribute A
6
Gain Ratio for Attribute Selection (C4.5)
Output: A Decision Tree for “buys_computer”
Information gain measure is biased towards attributes with a large
number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
age?
v | Dj | | Dj |
SplitInfo A ( D ) log 2 ( )
j 1 |D| |D|
<=30 overcast
30..40 >40
A test on income splits the data of into three partitions, namely low,
medium, and high
student? yes credit rating?
4 4 6 6 4 4
SplitInfo income ( D ) log 2 ( ) log 2 ( ) log 2 ( ) 0 .926
14 14 14 14 14 14
1. Prepruning: Halt tree construction early—do not split a node if this Function of the number of leaves in the tree and the error rate of
would result in the goodness measure falling below a threshold the tree (percentage of tuples misclassified by the tree)
It starts from the bottom of the tree. For each internal node, N, it
Difficult to choose an appropriate threshold compares the cost complexity of the subtree at N before and after pruning
A pruning set of class-labeled tuples is used to estimate cost complexity.
2. Postpruning: Remove branches from a “fully grown” tree— The leaf is Pessimistic pruning – C4.5
labeled with the most frequent class among the subtree being Use a set of data different from the training data to decide which is
replaced the “best pruned tree”
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
7
Scalability and Decision Tree Induction Scalable Decision Tree Induction Methods in Data
Mining Studies
The efficiency of existing decision tree algorithms, such as ID3,
C4.5, and CART, has been well established for relatively small data SLIQ (EDBT’96 — Mehta et al.)
sets. builds an index for each attribute and only class list and the current
In data mining applications, very large training sets of millions of attribute list reside in memory
tuples are common SPRINT (VLDB’96 — J. Shafer et al.)
Mining of very large real-world databases – Greater Issue constructs an attribute list data structure
Becomes inefficient due to swapping of the training tuples in and out PUBLIC (VLDB’98 — Rastogi & Shim)
of main and cache memories integrates tree splitting and tree pruning: stop growing the tree
Scalable approaches capable of handling training data that are too earlier
large to fit in memory, are required RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
separates the scalability aspects from the criteria that determine the
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
SLIQ SPRINT
SLIQ employs disk-resident attribute lists and a single memory- SPRINT uses a different attribute list data structure that holds the
resident class list class and RID information
When a node is split, the attribute lists are partitioned and distributed
among the resulting child nodes accordingly
Each attribute has an associate attribute list, indexed by RID (a When a list is partitioned, the order of the records in the list is
record identifier). maintained
Each tuple is represented by a linkage
The class list remains in memory because it is often accessed and
modified in the building and pruning phases.
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
8
Bayesian Classification Bayes Theorem
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
9
Naïve Bayesian Classification Naïve Bayesian Classification
From Bayes Theorem
4. Naïve assumption: attribute independence class conditional
P (C i | X ) P ( X | C i) P (C i) independence
n
P(X ) P( X | Ci) = P(x1,…,xn|C) = P ( X | C ) k i
k 1
3. P(X) is constant. Only P( X | Ci) P(Ci) need be
maximized.
5. In order to classify an unknown sample X,evaluate P( X | Ci) P(Ci)
If class prior probabilities not known, then assume for each class Ci. Sample X is assigned to the class Ci if
all classes to be equally likely and therefore
maximize P(X|Ci)
Otherwise maximize P( X | Ci) P(Ci) P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j m, ji
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
10
Naïve Bayesian Classification Naïve Bayesian Classification
EXAMPLE EXAMPLE
P(age<=30 | buys_comp=Y)=2/9=0.222 P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044
P(age<=30 | buys_comp=N)=3/5=0.600 P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019
P(income=medium | buys_comp=Y)=4/9=0.444
P(income=medium | buys_comp=N)=2/5=0.400 P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028
P(student=Y | buys_comp=Y)=6/9=0.667 P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007
P(student=Y | buys_comp=N)=1/5=0.200
P(credit_rating=FAIR | buys_comp=Y)=6/9=0.667 CONCLUSION: X buys computer
P(credit_rating=FAIR | buys_comp=N)=2/5=0.400
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high false N P(overcast|p) = 4/9 P(overcast|n) = 0
sunny hot high true N P(p) = 9/14
P(rain|p) = 3/9 P(rain|n) = 2/5
overcast hot high false P
rain mild high false P temperature P(n) = 5/14
rain cool normal false P P(hot|p) = 2/9 P(hot|n) = 2/5
rain cool normal true N P(mild|p) = 4/9 P(mild|n) = 2/5
overcast cool normal true P P(cool|p) = 3/9 P(cool|n) = 1/5
sunny mild high false N humidity
sunny cool normal false P P(high|p) = 3/9 P(high|n) = 4/5
rain mild normal false P
P(normal|p) = 6/9 P(normal|n) = 2/5
sunny mild normal true P
overcast mild high true P windy
overcast hot normal false P P(true|p) = 3/9 P(true|n) = 3/5
rain mild high true N P(false|p) = 6/9 P(false|n) = 2/5
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
11
Name Give Birth Can Fly
Live in Water
Have Legs Class
Play-tennis example: classifying X human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
whale yes no yes no mammals
An unseen sample X = <outlook = rain, Temperature = hot,Humidity frog no no sometimesyes non-mammals
= high, Windy = false> komodo no no no yes non-mammals
bat yes yes no yes mammals
P(X|p)·P(p) = pigeon no yes no yes non-mammals
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = cat yes no no yes mammals
3/9 X 2/9 X 3/9 X 6/9 X 9/14 = 0.010582 leopard shark
yes no yes no non-mammals
turtle no no sometimesyes non-mammals
P(X|n)·P(n) =
penguin no no sometimesyes non-mammals
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
porcupine yes no no yes mammals
2/5 X 2/5 X 4/5 X 2/5 X 5/14 = 0.018286 eel no no yes no non-mammals
salamander no no sometimesyes non-mammals
Sample X is classified in class n (don’t play) gila monsterno no no yes non-mammals
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
eagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Class
Instead of requiring that all the attributes be CI given the class, BBN allows
us to specify which pair of attributes are CI
GRIET GRIET
12
Bayesian Belief Networks Bayesian Belief Networks
6 Boolean variables
Lung Cancer is CI of Emphysema, given its parents, FH &
Arcs allow representation of causal knowledge
Smoker
Having lung cancer is influenced by family history and smoking
BBN has a Conditional Probability Table (CPT) for each
PositiveXray is ind. of whether the patient has a FH or if he/she is a
smoker given that we know that the patient has lung cancer variable in the DAG
Once we know the outcome of Lung Cancer, FH & Smoker do not provide CPT for a variable Y specifies the conditional distribution
any additional info. about PositiveXray P(Y|parents(Y))
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
13
Neural Network Learning A Multilayer Feed Forward Network
The units in the hidden layers and output layer are sometimes referred
The inputs are fed simultaneously into the input layer. to as neurodes, due to their symbolic biological basis, or as output
units.
The weighted outputs of these units are fed into hidden layer.
A network containing two layers of output units is called a two-layer
neural network, and so on.
The weighted outputs of the last hidden layer are inputs to units
making up the output layer.
The network is feed-forward in that none of the weights cycles back to
an input unit or to an output unit of a previous layer.
Network is fully connected, i.e. each unit provides input to each unit in
the next forward layer
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
14
Defining a Network Topology Steps in Back propagation Algorithm
First decide the network topology: # of units in the input layer, # of
hidden layers (if > 1), # of units in each hidden layer, and # of units STEP ONE: initialize the weights and biases.
in the output layer
The weights in the network are initialized to random numbers
Normalizing the input values for each attribute measured in the from the interval [-1,1].
training tuples to [0.0—1.0]
One input unit per domain value, each initialized to 0 Each unit has a BIAS associated with it
Output, if for classification and more than two classes, one output
The biases are similarly initialized to random numbers from the
unit per class is used interval [-1,1].
Once a network has been trained and its accuracy is unacceptable,
repeat the training process with a different network topology or a STEP TWO: feed the training sample.
different set of initial weights
There is no clear rules for No.of hidden layers
Network design is an trial and error process.
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
Steps in Back propagation Algorithm Propagation through Hidden Layer ( One Node )
- Bias j
x0 w0j
STEP THREE: Propagate the inputs forward; we compute the
net input and output of each unit in the hidden and output layers. x1 w1j
f
STEP FOUR: back propagate the error. output y
xn wnj
STEP FIVE: update weights and biases to reflect the
propagated errors.
Input weight weighted Activation
STEP SIX: terminating conditions. vector x vector sum function
w
The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a weighted sum,
which is added to the bias associated with unit j.
A nonlinear activation function f is applied to the net input.
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
15
Propagate the inputs forward Propagate the inputs forward
For unit j in the input layer, its output is equal to its input, that is, for
input unit j. Each unit in the hidden and output layers takes its net input and then
Oj I j applies an activation function. The function symbolizes the activation
of the neuron represented by the unit. It is also called a logistic,
sigmoid, or squashing function.
The net input to each unit in the hidden and output layers is computed
as follows.
Given a net input Ij to unit j, then
Given a unit j in a hidden or output layer, the net input is
O j f (I j)
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
16
Update weights and biases Update weights and biases
Weights are updated by the following equations, where l is a
constant between 0.0 and 1.0 reflecting the learning rate, this Updating weights and biases after the presentation of each
learning rate is fixed for implementation. sample.
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
Terminating Conditions
Backpropagation Formulas
Training stops
All wij in the previous epoch are below some threshold, or Output vector
Errk Ok (1 Ok )(Tk Ok )
Output Layer
The percentage of samples misclassified in the previous epoch is
1 Err j O j (1 O j ) Errk w jk
below some threshold, or Oj k
I
1 e j
Input vector: xi
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
17
Example of Back propagation Example ( cont.. )
Input = 3,
Bias added to Hidden and
Hidden Neuron = 2 Output layers
Output =1 Initialize Bias
Initialize weights : Random Values from
-1.0 to 1.0
Random Numbers from -
1.0 to 1.0
Bias ( Random )
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
18
Calculation of weights and Bias Updating Network Pruning and Rule Extraction
wij wij (l ) Err j Oi
Learning Rate l =0.9 j j (l) Err j
Network pruning
Weight New Values
Fully connected network will be hard to articulate
w46 -0.3 + 0.9(0.1311)(0.332)
N input nodes, h hidden nodes and m output nodes lead to h(m+N)
= -0.261
weights
w56 -0.2 + (0.9)(0.1311)(0.525) Pruning: Remove some of the links without affecting classification
= -0.138
accuracy of the network
w14 0.2 + 0.9(-0.0087)(1) =
0.192
w15 -0.3 + (0.9)(-0.0065)(1) = -
0.306
……..similarly ………similarly
θ6 0.1 +(0.9)(0.1311)=0.218
……..similarly ………similarly
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
GRIET GRIET
19
Let the data set D be given as (X1, y1), (X2, y2), … , (X|D|, y|D|),where Xi is the set of
training tuples with associated class labels, yi.
Each yi can take one of two values, either +1 or -1 (i.e., yi € {+1,-1}), corresponding
to the classes buys computer = yes and buys computer = no, respectively.
GRIET GRIET
20
SVM—General Philosophy SVM—Linearly Separable
• A separating hyperplane can be written as
The fig shows two possible separating hyperplanes and their associated margins
W●X+b=0
The hyperplane with the larger margin to be more accurate at classifying future data
tuples where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
The maximum marginal hyperplane (MMH)
w0 + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining
the margin) are support vectors
• How does an SVM find the MMH and the support vectors
• This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints Quadratic
Programming (QP) Lagrangian multipliers - Karush-Kuhn-Tucker
Small Margin Large Margin (KKT) conditions
GRIET GRIET
Support Vectors
Support vectors
• The support vectors define the maximum margin hyperplane!
– All other instances can be deleted without changing its position and
orientation
21
A2
• SVM can also be used for classifying multiple (> 2) classes and for regression
analysis (with additional user parameters)
GRIET GRIET
22
CMAR (Classification based on Multiple Association Rules)
CPAR (Classification based on Predictive Association Rules)
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
What Is Prediction?
Prediction
Prediction is different from classification
Classification refers to predict categorical class label
23
Linear Regression
Linear regression: involves a response variable y and a single
predictor variable x
Major method for prediction: regression
y = w0 + w1 x
model the relationship between one or more independent or
predictor variables and a dependent or response variable where w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
Regression analysis |D |
(x i x )( y i y )
Linear and multiple regression
w i 1 w yw x
Non-linear regression
1 |D |
0 1
(x
i 1
i x )2
Other regression methods: generalized linear model, Poisson
Multiple linear regression: involves more than one predictor variable
regression, log-linear models, regression trees
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D| )
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
Solvable by extension of least square method or using SAS, SPSS,
S-Plus
Many nonlinear functions can be transformed into the above
Example
Nonlinear Regression
Table shows a set of paired data where x is the number of years of work experience of a
college graduate and y is the corresponding salary of the graduate
“How can we model data that does not show a linear dependence?
Some nonlinear models can be modeled by a polynomial function X years of Ysalary (in Y = W0 + W1 X
|D |
(x i x )( y i y )
It can be modeled by adding polynomial terms to the basic linear model experience $1000’s) w i 1
3 30 1 |D |
(x x )2
W1 = 3.5 i 1
i
24
Classifier Accuracy Measures
The accuracy of a classifier on a given test set is the percentage of test
set tuples that are correctly classified by the classifier
The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes.
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) The mean squared-error exaggerates the presence of outliers
This model can also be used for cost-benefit analysis Popularly use (square) root mean-square error, similarly, root relative
squared error
GRIET GRIET
25
Evaluating the Accuracy of a Classifier or Evaluating the Accuracy of a Classifier or
Predictor Predictor
Holdout method
Given data is randomly partitioned into two independent sets Cross-validation (k-fold, where k = 10 is most popular)
Training set (e.g., 2/3) for model construction
Randomly partition the data into k mutually exclusive subsets,
Test set (e.g., 1/3) for accuracy estimation
each approximately equal size
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies At i-th iteration, use Di as test set and others as training set
obtained from each iterations
Leave-one-out: k folds where k = # of tuples, for small sized data
GRIET GRIET
26
Boosting Technique (II) — Algorithm
Boosting and Bagging
Boosting increases classification accuracy Assign every example an equal weight 1/N
Applicable to decision trees or Bayesian classifier For t = 1, 2, …, T Do
Learn a series of classifiers, where each classifier in the series Obtain a hypothesis (classifier) h(t) under w(t)
pays more attention to the examples misclassified by its
Calculate the error of h(t) and re-weight the examples based on
predecessor
the error
Boosting requires only linear time and constant space
Normalize w(t+1) to sum to 1
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
Case-based reasoning
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
27
The k-Nearest Neighbor Algorithm
Discussion on the k-NN Algorithm
All instances correspond to points in the n-D space. The k-NN algorithm for continuous-valued target functions
The nearest neighbor are defined in terms of Calculate the mean values of the k nearest neighbors
Euclidean distance. Distance-weighted nearest neighbor algorithm
The target function could be discrete- or real- valued. Weight the contribution of each of the k neighbors according to their
For discrete-valued, the k-NN returns the most distance to the query point xq
common value among the k training examples nearest giving greater weight to closer neighbors
to xq. w 1
Similarly, for real-valued target functions
d ( xq , xi ) 2
Vonoroi diagram: the decision surface induced by 1- Robust to noisy data by averaging k-nearest neighbors
NN for a typical set of training examples.
Curse of dimensionality: distance between neighbors could
_
_
_ _ . be dominated by irrelevant attributes.
+ To overcome it, axes stretch or elimination of the least relevant
_ .
+
xq + . . . attributes.
GRIET
_ + Prepared by S.Palaniappan, Assoc Prof . GRIET Prepared by S.Palaniappan, Assoc Prof
Case-Based Reasoning
Remarks on Lazy vs. Eager Learning
Also uses: lazy evaluation + analyze similar instances Instance-based learning: lazy evaluation
Difference: Instances are not “points in a Euclidean space” Decision-tree and Bayesian classification: eager evaluation
Example: Water faucet problem in CADET (Sycara et al’92) Key differences
Lazy method may consider query instance xq when deciding how to
Methodology generalize beyond the training data D
Instances represented by rich symbolic descriptions (e.g., function Eager method cannot since they have already chosen global approximation
graphs) when seeing the query
Multiple retrieved cases may be combined Efficiency: Lazy - less time training but more time predicting
Tight coupling between case retrieval, knowledge-based reasoning,
and problem solving Accuracy
Lazy method effectively uses a richer hypothesis space since it uses many
Research issues local linear functions to form its implicit global approximation to the target
Indexing based on syntactic similarity measure, and when failure, function
backtracking, and adapting to additional cases Eager: must commit to a single hypothesis that covers the entire instance
space
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
28
Genetic Algorithms
Rough Set Approach
GA: based on an analogy to biological evolution Rough sets are used to approximately or “roughly”
Each rule is represented by a string of bits define equivalent classes
An initial population is created consisting of randomly A rough set for a given class C is approximated by two
generated rules sets: a lower approximation (certain to be in C) and an
e.g., IF A1 and Not A2 then C2 can be encoded as 100 upper approximation (cannot be described as not
Based on the notion of survival of the fittest, a new belonging to C)
population is formed to consists of the fittest rules and Finding the minimal subsets (reducts) of attributes (for
their offsprings feature reduction) is NP-hard but a discernibility matrix
The fitness of a rule is represented by its classification is used to reduce the computation intensity
accuracy on a set of training examples
Offsprings are generated by crossover and mutation
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof
Fuzzy Set
Approaches
29