Professional Documents
Culture Documents
Data Mining: Classification-1
Data Mining: Classification-1
C LAS SI FICAT IO N - 1
Machine Learning techniques
• Algorithms for getting the underlying structural descriptions (or patterns)
from the examples
o Use the patterns to predict outcome in new situations
o Use the patterns to understand and explain from what the prediction is derived
• Although this class will not require you to implement any of these
techniques, it is good to know how each technique works so that you know
when to use it and how to extend it
Outlook Humidity Windy Play
Classification Examples sunny high false No
sunny high true No
• Given this table, can you come up with the
overcast high false Yes
most efficient rules for Play/Not Play? rain high false Yes
• If outlook = rain, humidity = high and windy = rain normal false Yes
false, play or not? rain normal true No
overcast normal true Yes
sunny high false No
sunny normal false Yes
rain normal false Yes
sunny normal true Yes
overcast high true Yes
overcast normal false Yes
rain high true No
Example Tree
sunny rain
Outlook
overcast
No Yes No Yes
Decision Trees
• An internal node is a test on an attribute
• A branch represents an outcome of the test
• A leaf node represents a class label or class label distribution
• At each node, one attribute is chosen to split training examples into
distinct classes as much as possible
• A new case is classified by following a matching path to a leaf node
o If outlook = rain, humidity = high and windy = false, play or not?
Yes
Building Decision Trees
• Top-down tree construction
• At start, all training examples are at the root
• Partition the examples recursively by choosing one attribute at a time
o Choose the attributes based on which attribute can separate the classes of the
training examples best (will result in the smallest tree)
o A goodness function
Information gain
Information gain ratio
Gain index
Information Gain
• Information required to calculate predict an event is called entropy
o…
• Information gain
o A possible way of measuring how “pure” an attribute is
o Information gain (attribute A) = information before split – information after split
Outlook Humidity Windy Play
Information Gain - Outlook sunny high false No
sunny high true No
• Before splitting the dataset based on Outlook
overcast high false Yes
o We have 9 Yes and 5 No rain high false Yes
rain normal false Yes
rain normal true No
overcast normal true Yes
sunny high false No
sunny normal false Yes
rain normal false Yes
sunny normal true Yes
overcast high true Yes
overcast normal false Yes
rain high true No
Outlook Humidity Windy Play
Information Gain - Outlook sunny high false No
sunny high true No
• After splitting the dataset based on Outlook
overcast high false Yes
o Sunny has 2 Yes and 3 No rain high false Yes
rain normal false Yes
rain normal true No
o Overcast has 4 Yes and 0 No overcast normal true Yes
sunny high false No
sunny normal false Yes
rain normal false Yes
o Rainy has 3 Yes and 2 No sunny normal true Yes
overcast high true Yes
overcast normal false Yes
rain high true No
Outlook Humidity Windy Play
Information Gain - Outlook sunny high false No
sunny high true No
• After splitting the dataset based on Outlook
overcast high false Yes
o Sunny has 2 Yes and 3 No rain high false Yes
rain normal false Yes
rain normal true No
o Overcast has 4 Yes and 0 No
overcast normal true Yes
sunny high false No
o Rainy has 3 Yes and 2 No sunny normal false Yes
rain normal false Yes
sunny normal true Yes
o Information after split () overcast high true Yes
overcast normal false Yes
rain high true No
Outlook Humidity Windy Play
Information Gain - Outlook sunny high false No
sunny high true No
• Before splitting the dataset based on Outlook
overcast high false Yes
rain high false Yes
rain normal false Yes
• After splitting the dataset based on Outlook rain normal true No
o () overcast normal true Yes
sunny high false No
• Information gain sunny normal false Yes
rain normal false Yes
sunny normal true Yes
o = 0.940 – 0.693 = 0.247 bits
overcast high true Yes
overcast normal false Yes
rain high true No
Building Decision Tree with Information Gain
• For each unselected attribute, calculate it’s information gain
o Gain(Outlook) = 0.247 bits
o Gain(Humidity)? sunny rain
Outlook
0.152 bits
o Gain(Windy)?
0.048 bits overcast
high normal
No Yes
Building Decision Tree with Information Gain
• Branch: Overcast
o The classes are pure (only 4 Yes)
o Don’t need to split and look at the gain of unselected attributes
sunny
Outlook
overcast
Yes
Building Decision Tree with Information Gain
• Branch: Sunny
• The classes are not pure (3 Yes and 2 No), so we look at the
information gain of the 2 unselected attributes
o Gain(Humidity)?
o Gain(Windy)?
true false
No Yes
Final Tree
sunny rain
Outlook
overcast
No Yes No Yes
Converting Decision Trees to Rules
• Simple way, each path from root to a leaf is a separate rule:
o If outlook = sunny and humidity = high then play = no
o If outlook = rainy and windy = true then play = no
o If outlook = overcast then play = yes
o If humidity = normal then play = yes sunny Outlook rain
o If none of the above then play = yes
overcast
Humidity Yes Windy
high normal true false
No Yes No Yes
ID Outlook Humidity Windy Play
Highly-branching attributes A sunny high false No
B sunny high true No
• Information gain is biased towards C overcast high false Yes
choosing attributes with a large number D rain high false Yes
of values E rain normal false Yes
o This may result in over-fitting F rain normal true No
G overcast normal true Yes
H sunny high false No
I sunny normal false Yes
J rain normal false Yes
K sunny normal true Yes
L overcast high true Yes
M overcast normal false Yes
N rain high true No
Highly-branching attributes
• Information gain is maximal for ID codes since each node is pure
o Won’t work for new data
Gain ratio
• Modification
of the information gain to reduce bias on high-branch
attributes
• Gain ratio takes number and size of branches into account
o Large when data is evenly spread between the branches
o Small when all data belongs to one branch
Gain Ratio for ID Code
• Gain Ratio decreases as intrinsic information gets larger
o Gain(ID) = 0.94 bits
o = 3.807 bits
o
o Compare this to Outlook
Intrinsic(Outlook) = 1.5777
Gain(Outlook) = 0.247
GainRatio(Outlook) ?
0.156 bits still smaller than ID!!!
Problems with Gain Ratio
• Sometimes,
it still cannot fix the Information Gain problem with
highly-branched attributes
• Worse, it may overcompensate. Choosing attributes just because
their intrinsic information is very low
• Fix:
o When building a tree, only consider attributes with greater than average
information gain
o Remove ID codes like attribute (1 class per branch, = )
o Then, compare them on gain ratio
Industrial-strength algorithms
• For a learning algorithm to be useful in a wide-range of real-world
applications it must
o Permit numeric attributes
o Allow missing values
o Be robust in the presence of noise
o Be able to approximate arbitrary concept descriptions
… … … … …
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
… … … … …
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
• Remedy
o Pre-discretize numeric attributes (convert numeric to nominal attributes)
o Use multi-way splits instead of binary ones
Missing values
• Simple idea: treat missing as a separate value, e.g. “?” or “unknown”
• Q: When is it not appropriate?
o When values are missing not because they are unknown
Gene expression could be missing when it is very high or very low. We should treat them
differently
When the value for the field isPregnant is missing for
A male patient we can assume that they are not pregnant
A female patent it is unknown. We genuinely don’t know if she is or is not pregnant
Avoid Over-fitting
• 2 strategies for “pruning” the decision tree
o Pre-pruning (early stopping): Stop growing a branch when information becomes
unreliable
o Post-pruning: Take a fully-grown decision tree and discard unreliable parts
• Repeat the all the steps until we have all the possible itemsets
Association Rules
• Given a set {A, B, E}, the possible association rules are
o A => {B, E}
o {A, B} => E
o {A, E} => B
o B => {A, E}
o {B, E} => A
o E => {A, B}
o _ => {A, B, E}
Rule Support and Confidence
• We are only interested in “good” rules
• Goodness measure:
o Given an association rule R: I => J
o Support
o Confidence
• We only keep rules that have support and confidence the minimum
support and confidence
Example
• Given a set {A, B, E}, what association rules have minimum support =
2 and minimum confidence = 0.5?
o A => {B, E}
Support(R) = support({A}) + support({B, E}) – support({A, B, E} = 6 + 2 – 2 = 6
Confidence(R) = support({B,E})/support(R) = 2/6 = 0.33 ID Items
1 A, B, E
Not a good rule because confidence is lower than the minimum confidence
2 B, D
3 B, C
4 A, B, D
5 A, C
6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
Example
• Given a set {A, B, E}, what association rules have minimum support =
2 and minimum confidence = 0.5?
o A => {B, E} : support = 6, confidence = 0.33
o {A, B} => E : support = ?, confidence = ?
o {A, E} => B : support = ?, confidence = ? ID Items
1 A, B, E
o B => {A, E} : support = ?, confidence = ? 2 B, D
o {B, E} => A : support = ?, confidence = ? 3 B, C
4 A, B, D
o E => {A, B} : support = ?, confidence = ?
5 A, C
o _ => {A, B, E} : support = ?, confidence = ? 6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
WEKA
• Input: Weather.nominal.arff
• Associate: weka.associations.Apriori
Filtering Association Rules
• A large dataset can lead to a very large number of association rules
o Minimum support and confidence are not sufficient to filter rules
o Probability of a set =
53