Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

About The Unit

 What is classification? What is prediction?

UNIT VI 


Issues regarding classification and prediction
Classification by decision tree induction
 Bayesian Classification
 Rule-Based Classification
Classification and Prediction  Classification by back propagation
 Support Vector Machines
 Associative Classification
 Lazy Learners
 Other Classification Methods
 Prediction
 Classification accuracy

GRIET GRIET Prepared by S.Palaniappan, Assoc Prof

Classification and Prediction Classification


 Classification and Prediction are two forms of data analysis that can be
used to extract models describing important data classes or to predict
future data trends  Classification is a Two-Step Process
 Classification:

 Predicts categorical class labels


1. Learning step:
 A classifier is built
 Classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in  A classification algorithm builds the classifier by analyzing or
classifying new data “learning from” a training set made up of database tuples and
their associated class labels
 Prediction:
 Models continuous-valued functions, i.e., predicts unknown or 2. Prediction step:
missing values  The model is used for classification

 First, the predictive accuracy of the classifier is estimated


 For example:
 Predicts categorical class labels (discrete or nominal)
 Bank loan applicants are “safe” or “risky” - Credit approval

 Guess a customer will buy a new computer? - Target marketing

 Analysis cancer data to predict which one of three specific treatments


should apply - Treatment effectiveness analysis
GRIET GRIET

1
Classification—A Two-Step Process N-fold Cross-validation
1. Model construction: describing a set of predetermined classes
 In order to solve over-fitting problem, n-fold cross-
 Each tuple/sample is assumed to belong to a predefined class, as validation is usually used
determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees, or  For example, 10 fold cross validation:
mathematical formulae
 Divide the whole training dataset into 10 parts equally
2. Model usage: for classifying future or unknown objects
 Take the first part away, train the model on the rest 9 portions
 After the model is trained, feed the first part as testing dataset,
 Estimate accuracy of the model obtain the accuracy
 The known label of test sample is compared with the classified result
 Repeat step two and three, but take the second part away and
from the model so on…
 Accuracy rate is the percentage of test set samples that are correctly
classified by the model
 Test set is independent of training set, otherwise over-fitting will occur

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET

Classification Process (1): Model Construction Classification Process (2): Use the Model in Prediction

Classification
Algorithms
Training Classifier
Data
Testing
Data Unseen Data
NAME RANK YEARS Loan Sanc Classifier
Mike Assistant Prof 3 no (Model)
(Jeff, Professor, 4)
Mary Assistant Prof 7 yes
Bill Professor 2 yes NAME RANK YEARS Loan Sanc
Jim Associate Prof 7 yes Tom Assistant Prof 2 no Loan sanc?
IF rank = ‘professor’ Merlisa Associate Prof 7 no
Dave Assistant Prof 6 no
OR years > 6 George Professor 5 yes
Anne Associate Prof 3 no
THEN loan sanc = ‘yes’ Joseph Assistant Prof Prepared7by S.Palaniappan,yes
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Assoc Prof

2
Supervised vs. Unsupervised Learning Issues regarding classification and prediction
Data Preparation
 Supervised learning (classification)  Preprocessing steps may be applied to the data to help improve the
 Supervision: The training data (observations, measurements, accuracy, efficiency, and scalability of the classification or prediction
process
etc.) are accompanied by labels indicating the class of the
observations
 Data cleaning
 New data is classified based on the training set
 Preprocess data in order to reduce noise and handle missing
values - this step can help reduce confusion during learning
 Unsupervised learning (clustering)  Relevance analysis (feature selection)
 The class labels of training data is unknown  Remove the irrelevant (Attribute subset selection) or redundant
 if we did not have the Class data available for the training set, attributes (a strong correlation between attributes)
we could use clustering to try to determine “groups of like  Data transformation
tuples,”
 Generalize (Concept hierarchies may be used for this purpose)
and/or normalize data (fall within a small specified range, such
as -1.0 to 1.0, or 0.0 to 1.0.)
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Issues regarding classification and prediction


Classification by Decision Tree Induction
Evaluating Classification Methods
 Decision tree
 Accuracy  A flow-chart-like tree structure

 The accuracy of a classifier / Predictor refers to the ability of a given  Internal node denotes a test on an attribute

classifier / Preditor to correctly predict the class label / value of new  Branch represents an outcome of the test

or previously unseen data  Leaf nodes represent class labels or class distribution
 Speed
 cost to construct the model  Basic algorithm (a greedy algorithm)
 cost to use the model
 Tree is constructed in a top-down recursive divide-and-conquer
 Robustness manner
 handling noise and missing values
 1970s and early 1980s, J. Ross Quinlan, a researcher in machine
 Scalability
learning, developed a decision tree algorithm known as ID3
 efficiency in disk-resident databases - large amounts of data
 C4.5 and CART
 Interpretability:
 understanding and insight provided by the model

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

3
Decision Tree Example

 “How are decision trees used for classification?”


 Given a tuple, X, for which the associated class label is unknown,
 The attribute values of the tuple are tested against the decision tree.
 A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
 Decision trees can easily be converted to classification rules.
 If age = youth and student = yes then buys computer = yes

GRIET GRIET Prepared by S.Palaniappan, Assoc Prof

Decision Tree Algorithm - Phases


Algorithm for Decision Tree Induction
 Decision tree generation consists of two phases
Phase 1 : Tree construction  Conditions for stopping partitioning
 At start, all the training examples are at the root  All samples for a given node belong to the same class
 Attributes are categorical (if continuous-valued, they are  There are no remaining attributes for further partitioning – majority voting
discretized in advance) is employed for classifying the leaf
 There are no samples left
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Partition tuples recursively based on selected attributes
 Tree pruning

 Identify and remove branches that reflect noise or

outliers

Phase 2 : Use of decision tree: Classifying an unknown sample


 Test the attribute values of the sample against the decision
tree

GRIET GRIET Prepared by S.Palaniappan, Assoc Prof

4
Algorithm for Decision Tree Induction
Input : Training Dataset (pseudocode)
age income student credit_rating buys_computer
<=30 high no fair no Algorithm GenDecTree(Sample S, Attlist A)
<=30 high no excellent no 1. create a node N
31…40 high no fair yes 2. If all samples are of the same class C then label N with C;
>40 medium no fair yes terminate;
>40 low yes fair yes 3. If A is empty then label N with the most common class C in S
>40 low yes excellent no (majority voting); terminate;
31…40 low yes excellent yes 4. Select aA, with the highest information gain; Label N with a;
<=30 medium no fair no 5. For each value v of a:
<=30 low yes fair yes a. Grow a branch from N with condition a=v;
>40 medium yes fair yes b. Let Sv be the subset of samples in S with a=v;
<=30 medium yes excellent yes c. If Sv is empty then attach a leaf labeled with the most common class in
S;
31…40 medium no excellent yes
d. Else attach the node generated by GenDecTree(Sv, A-a)
31…40 high yes fair yes
>40 medium no excellent no
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET

Attribute Selection Measure


 Information gain (ID3/C4.5)
 All attributes are assumed to be categorical

 Can be modified for continuous-valued attributes

 Gini index (IBM IntelligentMiner)


 All attributes are assumed continuous-valued

 Assume there exist several possible split values for each attribute

 May need other tools, such as clustering, to get the possible split

values
 Can be modified for categorical attributes

(a) If A is discrete-valued, then one branch is grown for each known value of A.
 Attribute Selection Measure to determine the splitting criterion
(b) If A is continuous-valued, then two branches are grown, corresponding to A
 Tells us which branches to grow from node N with respect to the
< split point and A > split point.
outcomes of the chosen test
(c) If A is discrete-valued and a binary tree must be produced, then the test is
of the form A €SA

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

5
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain  Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|  Assume there are two classes, P and N
 Let the set of examples S contain p elements of class P and n
 Expected information (entropy) needed to classify a tuple in D: elements of class N
m
Info ( D )    p i log ( pi )
| C i ,D |  The amount of information, needed to decide if an arbitrary example
2 pj 
i 1 |D| in S belongs to P or N is defined as
 Information needed (after using A to split D into v partitions) to classify D:

v |Dj |
Info A ( D )   |D|
 I (D j ) p  p  n  n 
j 1 I ( p, n )   log 2    log 2  
pn  pn pn  pn
 Information gained by branching on attribute A

Gain(A)  Info(D)  Info A(D)


GRIET GRIET Prepared by S.Palaniappan, Assoc Prof

Splitting the samples using age


Attribute Selection: Information Gain
Class P: buys_computer = “yes” 9 9 5 5
Info( D)  I (9,5)   log2 ( )  log2 ( ) 0.940
Class N: buys_computer = “no” 14 14 14 14
age?
<=30 >40
5 4
age pi ni I(pi, ni) Info age ( D )  I ( 2 ,3 )  I ( 4 ,0 ) 30...40
14 14
<=30 2 3 0.971 income student credit_rating buys_computer
5 income student credit_rating buys_computer
31…40 4 0 0  I (3, 2)  0 .694 high no fair no medium no fair yes
14
>40 3 2 0.971 high no excellent no low yes fair yes
medium no fair no low yes excellent no
 I(2,3) = -2/5 * log(2/5) – 3/5 * log(3/5) low yes fair yes medium yes fair yes
age income student credit_rating buys_computer
medium yes excellent yes medium no excellent no
<=30 high no fair no
<=30 high no excellent no
Gain( age)  Info ( D )  Info age ( D )  0.246 income student credit_rating buys_computer
31… 40 high no fair yes
>40 medium no fair yes high no fair yes
>40 low yes fair yes
low yes excellent yes labeled yes
>40
31… 40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income)  0.029 medium no excellent yes
<=30 medium no fair no
<=30 low yes fair yes Gain( student)  0.151 high yes fair yes
>40 medium yes fair yes
<=30
31… 40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain( credit _ rating )  0.048
GRIET
31… 40 high yes fair yes GRIET
>40 medium no excellent no

6
Gain Ratio for Attribute Selection (C4.5)
Output: A Decision Tree for “buys_computer”
 Information gain measure is biased towards attributes with a large
number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
age?
v | Dj | | Dj |
SplitInfo A ( D )     log 2 ( )
j 1 |D| |D|
<=30 overcast
30..40 >40
 A test on income splits the data of into three partitions, namely low,
medium, and high
student? yes credit rating?
4 4 6 6 4 4
SplitInfo income ( D )    log 2 ( )   log 2 ( )   log 2 ( )  0 .926
14 14 14 14 14 14

no yes excellent fair  GainRatio(A) = Gain(A)/SplitInfo(A)

no yes no yes  Ex. gain_ratio(income) = 0.029/0.926 = 0.031


 The attribute with the maximum gain ratio is selected as the
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET splitting attribute

Avoid Over fitting in Classification


(Tree pruning)
 The generated tree may over-fit the training data
 Too many branches, some may reflect anomalies due to noise or
outliers
 Use statistical measures to remove the least reliable branches

 Two approaches to avoid over fitting  Examples of Post Pruning


 Cost complexity - CART

1. Prepruning: Halt tree construction early—do not split a node if this  Function of the number of leaves in the tree and the error rate of

would result in the goodness measure falling below a threshold the tree (percentage of tuples misclassified by the tree)
 It starts from the bottom of the tree. For each internal node, N, it
 Difficult to choose an appropriate threshold compares the cost complexity of the subtree at N before and after pruning
 A pruning set of class-labeled tuples is used to estimate cost complexity.
2. Postpruning: Remove branches from a “fully grown” tree— The leaf is  Pessimistic pruning – C4.5
labeled with the most frequent class among the subtree being  Use a set of data different from the training data to decide which is
replaced the “best pruned tree”

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

7
Scalability and Decision Tree Induction Scalable Decision Tree Induction Methods in Data
Mining Studies
 The efficiency of existing decision tree algorithms, such as ID3,
C4.5, and CART, has been well established for relatively small data  SLIQ (EDBT’96 — Mehta et al.)
sets.  builds an index for each attribute and only class list and the current

 In data mining applications, very large training sets of millions of attribute list reside in memory
tuples are common  SPRINT (VLDB’96 — J. Shafer et al.)
 Mining of very large real-world databases – Greater Issue  constructs an attribute list data structure

 Becomes inefficient due to swapping of the training tuples in and out  PUBLIC (VLDB’98 — Rastogi & Shim)
of main and cache memories  integrates tree splitting and tree pruning: stop growing the tree

 Scalable approaches capable of handling training data that are too earlier
large to fit in memory, are required  RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 separates the scalability aspects from the criteria that determine the

quality of the tree


 builds an AVC-list (attribute, value, class label)

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

SLIQ SPRINT
 SLIQ employs disk-resident attribute lists and a single memory-  SPRINT uses a different attribute list data structure that holds the
resident class list class and RID information

 When a node is split, the attribute lists are partitioned and distributed
among the resulting child nodes accordingly
 Each attribute has an associate attribute list, indexed by RID (a  When a list is partitioned, the order of the records in the list is
record identifier). maintained
 Each tuple is represented by a linkage
 The class list remains in memory because it is often accessed and
modified in the building and pruning phases.

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

8
Bayesian Classification Bayes Theorem

 Let X be a data sample whose class label is unknown


 Bayesian classifiers are statistical classifiers  Let H be some hypothesis that X belongs to a class C
 They can predict class membership probabilities- given tuple belongs to a  For classification determine P(H/X)
particular class.  P(H/X) is the probability that H holds given the observed
data sample X
 Bayesian classification is based on Bayes’ theorem
 P(H/X) is posterior probability
 Naive Bayes classifier assumes that effect of an attribute value on a given
class is independent of the values of the other attribute - class conditional
independence Understanding Bayes' rule
P ( x | h) P (h) x  data
p( h | x )  h  hypothesis(model)
P( x)
P ( h) : prior belief (probabili ty of hypothesis h before seeing any data)
P ( x | h) : likelihood (probabili ty of the data if the hypothesis h is true)
P ( h | x) : posterior (probabili ty of hypothesis h after having seen the data d )

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Bayes Theorem Naïve Bayesian Classification


Example: Sample space: All Fruits
Steps Involved
X is “round” and “red”
H= hypothesis that X is an Apple 1. Each data sample is of the type
P(H/X) is our confidence that X is an apple given that X is X=(xi) i =1 to n, where xi is the values of X for attribute Ai
“round” and “red”
 P(H) is Prior Probability of H, ie, the probability that any
given data sample is an apple regardless of how it looks 2. Suppose there are m classes Ci i=1 to m.
 P(H/X) is based on more information
 Note that P(H) is independent of X X  Ci if
P(Ci|X) > P(Cj|X) for 1 j  m, jI
 P(X/H) ?
 It is the probability that X is round and red given that we
know that it is true that X is an apple The class for which P(Ci|X) is maximized is called the maximum
 Here P(X) is prior probability = posterior hypothesis
 P(data sample from our set of fruits is red and round)

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

9
Naïve Bayesian Classification Naïve Bayesian Classification
From Bayes Theorem
4. Naïve assumption: attribute independence class conditional
P (C i | X )  P ( X | C i) P (C i) independence
n
P(X ) P( X | Ci) = P(x1,…,xn|C) =  P ( X | C ) k i

k 1
3. P(X) is constant. Only P( X | Ci) P(Ci) need be
maximized.
5. In order to classify an unknown sample X,evaluate P( X | Ci) P(Ci)
 If class prior probabilities not known, then assume for each class Ci. Sample X is assigned to the class Ci if
all classes to be equally likely and therefore
maximize P(X|Ci)
 Otherwise maximize P( X | Ci) P(Ci) P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j  m, ji

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Naïve Bayesian Classification Naïve Bayesian Classification


EXAMPLE
Age Income Student Credit_rating Class:Buys_comp
<=30 HIGH N FAIR N
EXAMPLE
<=30 HIGH N EXCELLENT N X= (<=30,MEDIUM, Y,FAIR, ???)
31…..40 HIGH N FAIR Y We need to max.
>40 MEDIUM N FAIR Y P(X|Ci)P(Ci) for i=1,2.
>40 LOW Y FAIR Y
P(Ci) is computed from training sample
>40 LOW Y EXCELLENT N

31…..40 LOW Y EXCELLENT Y


P(buys_comp=Y) = 9/14 = 0.643
<=30 MEDIUM N FAIR N P(buys_comp=N) = 5/14 = 0.357
<=30 LOW Y FAIR Y How to calculate P(X|Ci)P(Ci) for i=1,2?
>40 MEDIUM Y FAIR Y P(X|Ci) = P(x1, x2, x3, x4|C) = P(xk|C)
<=30 MEDIUM Y EXCELLENT Y

31….40 MEDIUM N EXCELLENT Y

31….40 HIGH Y FAIR Y

>40 MEDIUM N EXCELLENT N

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

10
Naïve Bayesian Classification Naïve Bayesian Classification

EXAMPLE EXAMPLE
P(age<=30 | buys_comp=Y)=2/9=0.222 P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044
P(age<=30 | buys_comp=N)=3/5=0.600 P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019
P(income=medium | buys_comp=Y)=4/9=0.444
P(income=medium | buys_comp=N)=2/5=0.400 P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028
P(student=Y | buys_comp=Y)=6/9=0.667 P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007
P(student=Y | buys_comp=N)=1/5=0.200
P(credit_rating=FAIR | buys_comp=Y)=6/9=0.667 CONCLUSION: X buys computer
P(credit_rating=FAIR | buys_comp=N)=2/5=0.400

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Naive Bayesian Classifier Play-tennis example: estimating P(xi|C)


 Given a training set

outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high false N P(overcast|p) = 4/9 P(overcast|n) = 0
sunny hot high true N P(p) = 9/14
P(rain|p) = 3/9 P(rain|n) = 2/5
overcast hot high false P
rain mild high false P temperature P(n) = 5/14
rain cool normal false P P(hot|p) = 2/9 P(hot|n) = 2/5
rain cool normal true N P(mild|p) = 4/9 P(mild|n) = 2/5
overcast cool normal true P P(cool|p) = 3/9 P(cool|n) = 1/5
sunny mild high false N humidity
sunny cool normal false P P(high|p) = 3/9 P(high|n) = 4/5
rain mild normal false P
P(normal|p) = 6/9 P(normal|n) = 2/5
sunny mild normal true P
overcast mild high true P windy
overcast hot normal false P P(true|p) = 3/9 P(true|n) = 3/5
rain mild high true N P(false|p) = 6/9 P(false|n) = 2/5

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

11
Name Give Birth Can Fly
Live in Water
Have Legs Class
Play-tennis example: classifying X human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
whale yes no yes no mammals
 An unseen sample X = <outlook = rain, Temperature = hot,Humidity frog no no sometimesyes non-mammals
= high, Windy = false> komodo no no no yes non-mammals
bat yes yes no yes mammals
 P(X|p)·P(p) = pigeon no yes no yes non-mammals
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = cat yes no no yes mammals
3/9 X 2/9 X 3/9 X 6/9 X 9/14 = 0.010582 leopard shark
yes no yes no non-mammals
turtle no no sometimesyes non-mammals
 P(X|n)·P(n) =
penguin no no sometimesyes non-mammals
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
porcupine yes no no yes mammals
2/5 X 2/5 X 4/5 X 2/5 X 5/14 = 0.018286 eel no no yes no non-mammals
salamander no no sometimesyes non-mammals
 Sample X is classified in class n (don’t play) gila monsterno no no yes non-mammals
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
eagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Class

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET


yes no yes no
Prepared by S.Palaniappan, Assoc Prof
?

Bayesian Belief Networks Bayesian Belief Networks


Belief Networks has 2 components
 Conditional Independence assumption made by NBC may be too rigid 1. Directed Acyclic Graph (DAG)
 Specially for classification problems in which attributes are somewhat 2. Conditional Probability Table (CPT)
correlated
 We need a more flexible approach for modeling the class conditional
A node in BBN is CI of its non-descendants, if its parents are known
probabilities
P(X|Ci) = P(x1, x2, x3,…,xn|C)

 Instead of requiring that all the attributes be CI given the class, BBN allows
us to specify which pair of attributes are CI

GRIET GRIET

12
Bayesian Belief Networks Bayesian Belief Networks
 6 Boolean variables
 Lung Cancer is CI of Emphysema, given its parents, FH &
 Arcs allow representation of causal knowledge
Smoker
 Having lung cancer is influenced by family history and smoking
 BBN has a Conditional Probability Table (CPT) for each
 PositiveXray is ind. of whether the patient has a FH or if he/she is a
smoker given that we know that the patient has lung cancer variable in the DAG
 Once we know the outcome of Lung Cancer, FH & Smoker do not provide  CPT for a variable Y specifies the conditional distribution
any additional info. about PositiveXray P(Y|parents(Y))

(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)


P(LC=Y|FH=Y,S=Y) = 0.8
LC 0.8 0.5 0.7 0.1 P(LC=N|FH=N,S=N) = 0.9

Conditional Probability ~LC 0.2 0.5 0.3 0.9


Table for LungCancer
CPT for LungCancer
GRIET GRIET

Classification by Back propagation Neural Network

 Backpropagation: A neural network learning algorithm


 Neural Network learns by adjusting the weights so as to be able to
 Started by psychologists and neurobiologists to develop and test correctly classify the training data and hence, after testing phase, to
computational analogues of neurons classify unknown data.

 Neural Network needs long time for training.


 Neural Network is a set of connected INPUT/OUTPUT UNITS,
where each connection has a WEIGHT associated with it.
 Neural Network has a high tolerance to noisy and incomplete data
 Neural Network learning is also called CONNECTIONIST learning
due to the connections between units.  Backpropagation learns by iteratively processing a data set of training
tuples
 It is a case of SUPERVISED, INDUCTIVE or CLASSIFICATION
learning.

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

13
Neural Network Learning A Multilayer Feed Forward Network
 The units in the hidden layers and output layer are sometimes referred
 The inputs are fed simultaneously into the input layer. to as neurodes, due to their symbolic biological basis, or as output
units.
 The weighted outputs of these units are fed into hidden layer.
 A network containing two layers of output units is called a two-layer
neural network, and so on.
 The weighted outputs of the last hidden layer are inputs to units
making up the output layer.
 The network is feed-forward in that none of the weights cycles back to
an input unit or to an output unit of a previous layer.

 Network is fully connected, i.e. each unit provides input to each unit in
the next forward layer

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

A Multilayer Feed-Forward Neural Network A Multilayered Feed – Forward Network


 INPUT: records without class attribute with normalized attributes
values.
Output Class
 INPUT VECTOR: X = { x1, x2, …. xn}
Ok where n is the number of (non class) attributes.
Output layer
w jk  INPUT LAYER – there are as many nodes as non-class attributes i.e.
as the length of the input vector.
Oj
Hidden layer  HIDDEN LAYER – the number of nodes in the hidden layer and the
number of hidden layers depends on implementation.
wij - weights
 OUTPUT LAYER – corresponds to the class attribute.
Input layer  There are as many nodes as classes (values of the class attribute).
Network is fully connected
Input Record : xi
O k k= 1, 2,.. #classes
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

14
Defining a Network Topology Steps in Back propagation Algorithm
 First decide the network topology: # of units in the input layer, # of
hidden layers (if > 1), # of units in each hidden layer, and # of units  STEP ONE: initialize the weights and biases.
in the output layer
 The weights in the network are initialized to random numbers
 Normalizing the input values for each attribute measured in the from the interval [-1,1].
training tuples to [0.0—1.0]
 One input unit per domain value, each initialized to 0  Each unit has a BIAS associated with it
 Output, if for classification and more than two classes, one output
 The biases are similarly initialized to random numbers from the
unit per class is used interval [-1,1].
 Once a network has been trained and its accuracy is unacceptable,
repeat the training process with a different network topology or a  STEP TWO: feed the training sample.
different set of initial weights
 There is no clear rules for No.of hidden layers
 Network design is an trial and error process.

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Steps in Back propagation Algorithm Propagation through Hidden Layer ( One Node )

- Bias j
x0 w0j
 STEP THREE: Propagate the inputs forward; we compute the
net input and output of each unit in the hidden and output layers. x1 w1j
 f
 STEP FOUR: back propagate the error. output y
xn wnj
 STEP FIVE: update weights and biases to reflect the
propagated errors.
Input weight weighted Activation
 STEP SIX: terminating conditions. vector x vector sum function
w
 The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a weighted sum,
which is added to the bias associated with unit j.
 A nonlinear activation function f is applied to the net input.

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

15
Propagate the inputs forward Propagate the inputs forward
 For unit j in the input layer, its output is equal to its input, that is, for
input unit j.  Each unit in the hidden and output layers takes its net input and then
Oj  I j applies an activation function. The function symbolizes the activation
of the neuron represented by the unit. It is also called a logistic,
sigmoid, or squashing function.
 The net input to each unit in the hidden and output layers is computed
as follows.
 Given a net input Ij to unit j, then
 Given a unit j in a hidden or output layer, the net input is
O j  f (I j)

I j   wij Oi   j the output of unit j, is computed as


i
1
 where wij is the weight of the connection from unit i in the previous Oj  I j
1 e
layer to unit j; Oi is the output of unit I from the previous layer;

j is the bias of the unit

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Back propagate the error Back propagate the error


 The error is propagated backwards by updating weights and biases
 When reaching the Output layer, the error is computed and to reflect the error of the network classification .
propagated backwards.  For a unit j in the hidden layer the error is computed by a formula:
 For a unit k in the output layer the error is computed by a formula:
Errj  O j (1  O j )  Errk w jk
k
Errk  Ok (1  Ok )(Tk  Ok )
 Where O k – actual output of unit k ( computed by activation  where w jk is the weight of the connection from unit j to unit k in the
function next higher layer, and Err k is the error of unit k.
1
Ok 
1  e Ik
 Tk – Target value of the given training tuple

 O k (1  O k ) is a Derivative ( rate of change ) of activation function.

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

16
Update weights and biases Update weights and biases
 Weights are updated by the following equations, where l is a
constant between 0.0 and 1.0 reflecting the learning rate, this  Updating weights and biases after the presentation of each
learning rate is fixed for implementation. sample.

 This is called case updating.


wij  (l ) Err j Oi
 Epoch - One iteration through the training set is called an epoch.

wij  wij  wij


 Epoch updating
 Biases are updated by the following equations  Alternatively, the weight and bias increments could be

accumulated in variables and so that the weights and biases


are updated after all of the samples of the training set have
been presented.
 j  ( l ) Err j

Case updating is more accurate


 j   j   j 

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Terminating Conditions
Backpropagation Formulas
 Training stops

 All wij in the previous epoch are below some threshold, or Output vector
Errk  Ok (1 Ok )(Tk  Ok )
Output Layer
 The percentage of samples misclassified in the previous epoch is
1 Err j  O j (1  O j )  Errk w jk
below some threshold, or Oj  k
I
1 e j

 A pre specified number of epochs has expired Hidden Layer


wij  wij  (l ) Err j Oi
 In practice, several hundreds of thousands of epochs may be I j   wij Oi   j wij  j   j  (l) Err j
i
required before the weights will converge.
Input Layer

Input vector: xi
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

17
Example of Back propagation Example ( cont.. )
 Input = 3,
 Bias added to Hidden and
 Hidden Neuron = 2 Output layers
 Output =1  Initialize Bias
 Initialize weights :  Random Values from
 -1.0 to 1.0
 Random Numbers from -
1.0 to 1.0
 Bias ( Random )

Initial Input and weight


x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 θ4 θ5 θ6

1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2


-0.4 0.2 0.1

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Net Input and Output Calculation Calculation of Error at Each Node


Errk  Ok (1  Ok )(Tk  Ok ) Err j  O j (1  O j )  Errk w jk
k

Unitj Net Input Ij Output Oj Unit j Error j


6 0.475(1-0.475)(1-0.475) =0.1311
We assume T 6 = 1
4 0.2 + 0 + 0.5 -0.4 = - 1
0.7 Oj   0.332 5 0.525 x (1- 0.525)x 0.1311x
1  e0.7 (-0.2) = 0.0065
5 -0.3 + 0 + 0.2 + 0.2 0.332 x (1-0.332) x 0.1311 x
=0.1 1 4
Oj   0.525 (-0.3) = -0.0087
1  e  0 .1
6 (-0.3)0.332-
(0.2)(0.525)+0.1= - 1
0.105 Oj   0.475
1  e 0.105

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

18
Calculation of weights and Bias Updating Network Pruning and Rule Extraction
wij  wij  (l ) Err j Oi
Learning Rate l =0.9  j   j  (l) Err j
 Network pruning
Weight New Values
 Fully connected network will be hard to articulate
w46 -0.3 + 0.9(0.1311)(0.332)
 N input nodes, h hidden nodes and m output nodes lead to h(m+N)
= -0.261
weights
w56 -0.2 + (0.9)(0.1311)(0.525)  Pruning: Remove some of the links without affecting classification
= -0.138
accuracy of the network
w14 0.2 + 0.9(-0.0087)(1) =
0.192
w15 -0.3 + (0.9)(-0.0065)(1) = -
0.306
……..similarly ………similarly
θ6 0.1 +(0.9)(0.1311)=0.218

……..similarly ………similarly
GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

SVM—Support Vector Machines SVM—History and Applications


 A new classification method for both linear and nonlinear data
 Vapnik and colleagues (1992)—groundwork from Vapnik &
 It uses a nonlinear mapping to transform the original training data into a Chervonenkis’ statistical learning theory in 1960s
higher dimension  Features: training can be slow but accuracy is high owing to their
ability to model complex nonlinear decision boundaries (margin
 With the new dimension, it searches for the linear optimal separating maximization)
hyperplane (i.e., “decision boundary”)
 Used both for classification and prediction
 Applications:
 With an appropriate nonlinear mapping to a sufficiently high dimension,
data from two classes can always be separated by a hyperplane  handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests
 SVM finds this hyperplane using support vectors (“essential” training
tuples) and margins (defined by the support vectors)

GRIET GRIET

19
Let the data set D be given as (X1, y1), (X2, y2), … , (X|D|, y|D|),where Xi is the set of
training tuples with associated class labels, yi.
Each yi can take one of two values, either +1 or -1 (i.e., yi € {+1,-1}), corresponding
to the classes buys computer = yes and buys computer = no, respectively.

GRIET GRIET

There are infinite lines (hyperplanes)


separating the two classes but we
want to find the best one (the one that
minimizes classification error on
GRIET GRIET
unseen data)

20
SVM—General Philosophy SVM—Linearly Separable
• A separating hyperplane can be written as
The fig shows two possible separating hyperplanes and their associated margins
W●X+b=0
The hyperplane with the larger margin to be more accurate at classifying future data
tuples where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
The maximum marginal hyperplane (MMH)
w0 + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining
the margin) are support vectors
• How does an SVM find the MMH and the support vectors
• This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints  Quadratic
Programming (QP)  Lagrangian multipliers - Karush-Kuhn-Tucker
Small Margin Large Margin (KKT) conditions
GRIET GRIET
Support Vectors

Support vectors
• The support vectors define the maximum margin hyperplane!
– All other instances can be deleted without changing its position and
orientation

how do I use it to classify test tuples?


 Based on the Lagrangian
l
formulation the MMH can be
rewritten as the decision boundary
x  b0   y i i X i  X T
i is supp. vector
GRIET GRIET

21
A2

SVM—Linearly Inseparable SVM—Kernel functions


• Instead of computing the dot product on the transformed data tuples, it is
• Two step Process mathematically equivalent to instead applying a kernel function K(Xi, Xj) to
A1
the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
• Transform the original input data into a higher dimensional • Typical Kernel Functions
space
• Search for a linear separating hyperplane in the new space

• SVM can also be used for classifying multiple (> 2) classes and for regression
analysis (with additional user parameters)

GRIET GRIET

Associative Classification Typical Associative Classification Methods


 Associative classification: Major steps  CBA (Classification Based on Associations)
 Mine data to find strong associations between frequent patterns  Uses an iterative approach to frequent itemset mining, similar to that
(conjunctions of attribute-value pairs) and class labels described for Apriori - Multiple passes are made over the data
 The derived frequent itemsets are used to generate and test longer
 Association rules are generated in the form of itemsets
P1 ^ p2 … ^ pl  “Aclass = C” (conf, sup)  Rules satisfying minimum confidence and minimum support thresholds
 Organize the rules to form a rule-based classifier are found and then analyzed for inclusion in the classifier

 Mine possible association rules in the form of


 Cond-set (a set of attribute-value pairs)  class label
 The percentage of tuples in D satisfying the rule antecedent that also
 Build classifier: Organize rules according to decreasing precedence
have the class label C is called the confidence of R.
based on confidence and then support
 The percentage of tuples in D satisfying the rule antecedent and having  If a set of rules has the same antecedent, then the rule with the highest
class label C is called the support of R. confidence is selected to represent the set
 The classifier also contains a default rule, having lowest precedence

GRIET  CBA, CMAR, CPAR GRIET


87 88

22
 CMAR (Classification based on Multiple Association Rules)
 CPAR (Classification based on Predictive Association Rules)

 CMAR adopts a variant of the FP-growth algorithm


 Based on a rule generation algorithm for classification known as
 FP-growth uses a tree structure, called an FP-tree to register all of FOIL
the frequent itemset information
 It builds rules to distinguish positive tuples (say, having class
 This requires only two scans of D. buys computer = yes) from negative tuples
 CMAR employs another tree structure to store and retrieve rules  For multiclass problems, FOIL is applied to each class.
efficiently and to prune rules based on confidence, correlation, and
 That is, for a class, C, all tuples of class C are considered
database coverage.
positive tuples, while the rest are considered negative tuples.
 Rule pruning strategies are triggered whenever a rule is inserted into
 Rules are generated to distinguish C tuples from all others.
the tree
 Each time a rule is generated, the positive samples it satisfies
(or covers) are removed until all the positive tuples in the data
 For example, given two rules, R1 and R2, if the antecedent of R1 set are covered.
is more general than that of R2 and conf(R1) > conf(R2), then
R2 is pruned

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

What Is Prediction?

 (Numerical) prediction or regression is similar to classification


 construct a model

 use model to predict continuous or ordered value for a given input

Prediction
 Prediction is different from classification
 Classification refers to predict categorical class label

 Prediction models continuous-valued functions

 For example, we may wish to predict the salary of college graduates


with 10 years of work experience, or
 The potential sales of a new product given its price.

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET


April 2, 2012 Data Mining: Concepts and Techniques 92

23
Linear Regression
 Linear regression: involves a response variable y and a single
predictor variable x
 Major method for prediction: regression
y = w0 + w1 x
 model the relationship between one or more independent or
predictor variables and a dependent or response variable where w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
 Regression analysis |D |

 (x i  x )( y i  y )
 Linear and multiple regression
w  i 1 w  yw x
 Non-linear regression
1 |D |
0 1
 (x
i 1
i  x )2
 Other regression methods: generalized linear model, Poisson
 Multiple linear regression: involves more than one predictor variable
regression, log-linear models, regression trees
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D| )
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method or using SAS, SPSS,
S-Plus
 Many nonlinear functions can be transformed into the above

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET

Example
Nonlinear Regression
Table shows a set of paired data where x is the number of years of work experience of a
college graduate and y is the corresponding salary of the graduate
 “How can we model data that does not show a linear dependence?
 Some nonlinear models can be modeled by a polynomial function X years of Ysalary (in Y = W0 + W1 X
|D |

 (x i  x )( y i  y )
 It can be modeled by adding polynomial terms to the basic linear model experience $1000’s) w  i 1

3 30 1 |D |

 (x  x )2
W1 = 3.5 i 1
i

 A polynomial regression model can be transformed into linear regression 8 57


model. For example, 9 64 W0 = 23.6 w  yw x
y = w0 + w1 x + w2 x2 + w3 x3
0 1
13 72
3 36
convertible to linear with new variables: x2 = x2, x3= x3 Y = 23.6 + 3.5X
6 43
y = w0 + w1 x + w2 x2 + w3 x3
11 59
21 90 Using this equation, we can predict that
 Solved by the method of least squares using software for regression the salary of a college graduate with,
analysis 1 20 say, 10 years of experience is $58.6
16 83

GRIET GRIET Prepared by S.Palaniappan, Assoc Prof

24
Classifier Accuracy Measures
 The accuracy of a classifier on a given test set is the percentage of test
set tuples that are correctly classified by the classifier

 Accuracy is better measured on a test set consisting of class-labeled


tuples that were not used to train the model.
Accuracy and Error measure
 Accuracy of a classifier M, acc(M): percentage of test set tuples that are
correctly classified by the model M
what is accuracy? How can we estimate it?
Are there strategies for increasing the accuracy of a learned model?  Error rate (misclassification rate) of M = 1 – acc(M)
 If we were to use the training set to estimate the error rate of a
model, this quantity is known as the resubstitution error

 The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes.

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Given m classes, CMi,j, an entry in a confusion


matrix, indicates # of tuples in class i that are C1 C2 Predictor Error Measures
labeled by the classifier as class j C1 True positive False negative  Measure predictor accuracy: measure how far off the predicted value is
C2 False positive True negative from the actual known value
 Loss function: measures the error betw. yi and the predicted value yi’
classes buy_computer = yes buy_computer = no total recognition(%)  Absolute error: | yi – yi’|
buy_computer = yes 6954 46 7000 99.34
 Squared error: (yi – yi’)2
buy_computer = no 412 2588 3000 86.27
 Test error (generalization error): the average loss over the test set
total 7366 2634 10000 95.52 d d
 Mean absolute error:  | y i  y i ' | Mean squared error: 2
i 1 (y  y ')
i 1
i i

 Alternative accuracy measures (e.g., for cancer diagnosis) d


d d
sensitivity = t-pos/pos /* true positive recognition rate */  | y i  y i '| d
2
 Relative absolute error: i 1
d Relative squared error:  (y  y ')i i
specificity = t-neg/neg /* true negative recognition rate */  | yi  y | i1
d
i 1 2
precision = t-pos/(t-pos + f-pos)  (y  y)
i 1
i

accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) The mean squared-error exaggerates the presence of outliers
 This model can also be used for cost-benefit analysis Popularly use (square) root mean-square error, similarly, root relative
squared error

GRIET GRIET

25
Evaluating the Accuracy of a Classifier or Evaluating the Accuracy of a Classifier or
Predictor Predictor
 Holdout method
 Given data is randomly partitioned into two independent sets  Cross-validation (k-fold, where k = 10 is most popular)
 Training set (e.g., 2/3) for model construction
 Randomly partition the data into k mutually exclusive subsets,
 Test set (e.g., 1/3) for accuracy estimation
each approximately equal size
 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies  At i-th iteration, use Di as test set and others as training set
obtained from each iterations
 Leave-one-out: k folds where k = # of tuples, for small sized data

 Stratified cross-validation: folds are stratified so that class


distribution in each fold is approx the same as that in the initial
data

GRIET GRIET

Evaluating the Accuracy of a Classifier or


Predictor
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
 Several boostrap methods, and a common one is .632 boostrap
 Suppose we are given a data set of d tuples. The data set is sampled d
times, with replacement, resulting in a training set of d samples. The data
tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data will end up in the bootstrap, and the
remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedue k times, overall accuracy of the
model:
k
acc( M )   ( 0.632  acc( M i ) test _ set 0.368  acc( M i ) train _ set )
i 1

GRIET GRIET Prepared by S.Palaniappan, Assoc Prof

26
Boosting Technique (II) — Algorithm
Boosting and Bagging

 Boosting increases classification accuracy  Assign every example an equal weight 1/N
 Applicable to decision trees or Bayesian classifier  For t = 1, 2, …, T Do
 Learn a series of classifiers, where each classifier in the series  Obtain a hypothesis (classifier) h(t) under w(t)
pays more attention to the examples misclassified by its
 Calculate the error of h(t) and re-weight the examples based on
predecessor
the error
 Boosting requires only linear time and constant space
 Normalize w(t+1) to sum to 1

 Output a weighted sum of all the hypothesis, with each hypothesis


weighted according to its accuracy on the training set

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Other Classification Methods Instance-Based Methods

 k-nearest neighbor classifier


 Instance-based learning:
 case-based reasoning
 Store training examples and delay the processing (“lazy
 Genetic algorithm evaluation”) until a new instance must be classified
 Rough set approach  Typical approaches
 k-nearest neighbor approach
 Fuzzy set approaches
 Instances represented as points in a Euclidean space.

 Locally weighted regression

 Constructs local approximation

 Case-based reasoning

 Uses symbolic representations and knowledge-based


inference

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

27
The k-Nearest Neighbor Algorithm
Discussion on the k-NN Algorithm
 All instances correspond to points in the n-D space.  The k-NN algorithm for continuous-valued target functions
 The nearest neighbor are defined in terms of  Calculate the mean values of the k nearest neighbors
Euclidean distance.  Distance-weighted nearest neighbor algorithm
 The target function could be discrete- or real- valued.  Weight the contribution of each of the k neighbors according to their
 For discrete-valued, the k-NN returns the most distance to the query point xq
common value among the k training examples nearest  giving greater weight to closer neighbors

to xq. w 1
 Similarly, for real-valued target functions
d ( xq , xi ) 2
 Vonoroi diagram: the decision surface induced by 1-  Robust to noisy data by averaging k-nearest neighbors
NN for a typical set of training examples.
 Curse of dimensionality: distance between neighbors could
_
_
_ _ . be dominated by irrelevant attributes.
+  To overcome it, axes stretch or elimination of the least relevant
_ .
+
xq + . . . attributes.

GRIET
_ + Prepared by S.Palaniappan, Assoc Prof . GRIET Prepared by S.Palaniappan, Assoc Prof

Case-Based Reasoning
Remarks on Lazy vs. Eager Learning

 Also uses: lazy evaluation + analyze similar instances  Instance-based learning: lazy evaluation
 Difference: Instances are not “points in a Euclidean space”  Decision-tree and Bayesian classification: eager evaluation
 Example: Water faucet problem in CADET (Sycara et al’92)  Key differences
 Lazy method may consider query instance xq when deciding how to
 Methodology generalize beyond the training data D
 Instances represented by rich symbolic descriptions (e.g., function  Eager method cannot since they have already chosen global approximation
graphs) when seeing the query
 Multiple retrieved cases may be combined  Efficiency: Lazy - less time training but more time predicting
 Tight coupling between case retrieval, knowledge-based reasoning,
and problem solving  Accuracy
 Lazy method effectively uses a richer hypothesis space since it uses many
 Research issues local linear functions to form its implicit global approximation to the target
 Indexing based on syntactic similarity measure, and when failure, function
backtracking, and adapting to additional cases  Eager: must commit to a single hypothesis that covers the entire instance
space

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

28
Genetic Algorithms
Rough Set Approach

 GA: based on an analogy to biological evolution  Rough sets are used to approximately or “roughly”
 Each rule is represented by a string of bits define equivalent classes
 An initial population is created consisting of randomly  A rough set for a given class C is approximated by two
generated rules sets: a lower approximation (certain to be in C) and an
 e.g., IF A1 and Not A2 then C2 can be encoded as 100 upper approximation (cannot be described as not
 Based on the notion of survival of the fittest, a new belonging to C)
population is formed to consists of the fittest rules and  Finding the minimal subsets (reducts) of attributes (for
their offsprings feature reduction) is NP-hard but a discernibility matrix
 The fitness of a rule is represented by its classification is used to reduce the computation intensity
accuracy on a set of training examples
 Offsprings are generated by crossover and mutation

GRIET Prepared by S.Palaniappan, Assoc Prof GRIET Prepared by S.Palaniappan, Assoc Prof

Fuzzy Set
Approaches

 Fuzzy logic uses truth values between 0.0 and 1.0 to


represent the degree of membership (such as using
fuzzy membership graph)
 Attribute values are converted to fuzzy values
 e.g., income is mapped into the discrete categories {low,
medium, high} with fuzzy values calculated
 For a given new sample, more than one fuzzy value may
apply
 Each applicable rule contributes a vote for membership
in the categories
 Typically, the truth values for each predicted category
are summed
GRIET Prepared by S.Palaniappan, Assoc Prof

29

You might also like