Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

Data Mining

C LAS SI FICAT IO N - 1
Machine Learning techniques
• Algorithms for getting the underlying structural descriptions (or patterns)
from the examples
o Use the patterns to predict outcome in new situations
o Use the patterns to understand and explain from what the prediction is derived

• Different algorithms use different methods to represent patterns


o Decision trees, rules, instance-based, planes, etc.
o Understanding the output is the key to understanding the underlying reasons the
predictions were made

• Although this class will not require you to implement any of these
techniques, it is good to know how each technique works so that you know
when to use it and how to extend it
Outlook Humidity Windy Play
Classification Examples sunny high false No
sunny high true No
• Given this table, can you come up with the
overcast high false Yes
most efficient rules for Play/Not Play? rain high false Yes
• If outlook = rain, humidity = high and windy = rain normal false Yes
false, play or not? rain normal true No
overcast normal true Yes
sunny high false No
sunny normal false Yes
rain normal false Yes
sunny normal true Yes
overcast high true Yes
overcast normal false Yes
rain high true No
Example Tree
sunny rain
Outlook

overcast

Humidity Yes Windy

high normal true false

No Yes No Yes
Decision Trees
• An internal node is a test on an attribute
• A branch represents an outcome of the test
• A leaf node represents a class label or class label distribution
• At each node, one attribute is chosen to split training examples into
distinct classes as much as possible
• A new case is classified by following a matching path to a leaf node
o If outlook = rain, humidity = high and windy = false, play or not?
 Yes
Building Decision Trees
• Top-down tree construction
• At start, all training examples are at the root
• Partition the examples recursively by choosing one attribute at a time
o Choose the attributes based on which attribute can separate the classes of the
training examples best (will result in the smallest tree)
o A goodness function
 Information gain
 Information gain ratio
 Gain index
Information Gain
•  Information required to calculate predict an event is called entropy
o…

• Information gain
o A possible way of measuring how “pure” an attribute is
o Information gain (attribute A) = information before split – information after split
Outlook Humidity Windy Play
Information Gain - Outlook sunny high false No
sunny high true No
•  Before splitting the dataset based on Outlook
overcast high false Yes
o We have 9 Yes and 5 No rain high false Yes
rain normal false Yes
rain normal true No
overcast normal true Yes
sunny high false No
sunny normal false Yes
rain normal false Yes
sunny normal true Yes
overcast high true Yes
overcast normal false Yes
rain high true No
Outlook Humidity Windy Play
Information Gain - Outlook sunny high false No
sunny high true No
•  After splitting the dataset based on Outlook
overcast high false Yes
o Sunny has 2 Yes and 3 No rain high false Yes
rain normal false Yes
rain normal true No
o Overcast has 4 Yes and 0 No overcast normal true Yes
sunny high false No
sunny normal false Yes
rain normal false Yes
o Rainy has 3 Yes and 2 No sunny normal true Yes
overcast high true Yes
overcast normal false Yes
rain high true No
Outlook Humidity Windy Play
Information Gain - Outlook sunny high false No
sunny high true No
•  After splitting the dataset based on Outlook
overcast high false Yes
o Sunny has 2 Yes and 3 No rain high false Yes
rain normal false Yes
rain normal true No
o Overcast has 4 Yes and 0 No
overcast normal true Yes
sunny high false No
o Rainy has 3 Yes and 2 No sunny normal false Yes
rain normal false Yes
sunny normal true Yes
o Information after split () overcast high true Yes
overcast normal false Yes
rain high true No
Outlook Humidity Windy Play
Information Gain - Outlook sunny high false No
sunny high true No
•  Before splitting the dataset based on Outlook
overcast high false Yes
rain high false Yes
rain normal false Yes
• After splitting the dataset based on Outlook rain normal true No
o () overcast normal true Yes
sunny high false No
• Information gain sunny normal false Yes
rain normal false Yes
sunny normal true Yes
o = 0.940 – 0.693 = 0.247 bits
overcast high true Yes
overcast normal false Yes
rain high true No
Building Decision Tree with Information Gain
• For each unselected attribute, calculate it’s information gain
o Gain(Outlook) = 0.247 bits
o Gain(Humidity)? sunny rain
Outlook
 0.152 bits
o Gain(Windy)?
 0.048 bits overcast

• Choose attribute with the highest information gain


o Outlook

• Go through each branch, repeat all steps


Building Decision Tree with Information Gain
• Branch: Sunny
• The classes are not pure (2 Yes and 3 No), so we look at the
information gain of the 2 unselected attributes
o Gain(Humidity)?
 0.971 bits
Building Decision Tree with Information Gain
• Branch: Sunny
• The classes are not pure (2 Yes and 3 No), so we look at the
information gain of the 2 unselected attributes
o Gain(Humidity)?
 0.971 bits
o Gain(Windy)?
 0.020 bits

• Choose attribute with the highest information gain


o Humidity
Building Decision Tree with Information Gain
• Branch: High
o The classes are pure (only 3 No)
Outlook
o Don’t need to split and look at the gain of unselected attributes

• Branch: Normal sunny


o The classes are pure (only 2 Yes)
o Don’t need to split and look at the gain of unselected attributes
Humidity

high normal

No Yes
Building Decision Tree with Information Gain
• Branch: Overcast
o The classes are pure (only 4 Yes)
o Don’t need to split and look at the gain of unselected attributes

sunny
Outlook

overcast

Yes
Building Decision Tree with Information Gain
• Branch: Sunny
• The classes are not pure (3 Yes and 2 No), so we look at the
information gain of the 2 unselected attributes
o Gain(Humidity)?
o Gain(Windy)?

• Choose attribute with the highest information gain


o Windy
Building Decision Tree with Information Gain
• Branch: True
o The classes are pure (only 2 No)
Outlook
o Don’t need to split and look at the gain of unselected attributes

• Branch: False rain


o The classes are pure (only 3 Yes)
o Don’t need to split and look at the gain of unselected attributes
Windy

true false

No Yes
Final Tree
sunny rain
Outlook

overcast

Humidity Yes Windy

high normal true false

No Yes No Yes
Converting Decision Trees to Rules
• Simple way, each path from root to a leaf is a separate rule:
o If outlook = sunny and humidity = high then play = no
o If outlook = rainy and windy = true then play = no
o If outlook = overcast then play = yes
o If humidity = normal then play = yes sunny Outlook rain
o If none of the above then play = yes
overcast
Humidity Yes Windy
high normal true false
No Yes No Yes
ID Outlook Humidity Windy Play
Highly-branching attributes A sunny high false No
B sunny high true No
• Information gain is biased towards C overcast high false Yes
choosing attributes with a large number D rain high false Yes
of values E rain normal false Yes
o This may result in over-fitting F rain normal true No
G overcast normal true Yes
H sunny high false No
I sunny normal false Yes
J rain normal false Yes
K sunny normal true Yes
L overcast high true Yes
M overcast normal false Yes
N rain high true No
Highly-branching attributes
• Information gain is maximal for ID codes since each node is pure
o Won’t work for new data
Gain ratio
• Modification
  of the information gain to reduce bias on high-branch
attributes
• Gain ratio takes number and size of branches into account
o Large when data is evenly spread between the branches
o Small when all data belongs to one branch
Gain Ratio for ID Code
•  Gain Ratio decreases as intrinsic information gets larger
o Gain(ID) = 0.94 bits
o = 3.807 bits
o
o Compare this to Outlook
 Intrinsic(Outlook) = 1.5777
 Gain(Outlook) = 0.247
 GainRatio(Outlook) ?
 0.156 bits  still smaller than ID!!!
Problems with Gain Ratio
• Sometimes,
  it still cannot fix the Information Gain problem with
highly-branched attributes
• Worse, it may overcompensate. Choosing attributes just because
their intrinsic information is very low
• Fix:
o When building a tree, only consider attributes with greater than average
information gain
o Remove ID codes like attribute (1 class per branch, = )
o Then, compare them on gain ratio
Industrial-strength algorithms
• For a learning algorithm to be useful in a wide-range of real-world
applications it must
o Permit numeric attributes
o Allow missing values
o Be robust in the presence of noise
o Be able to approximate arbitrary concept descriptions

• We need to extend Decision Trees basic schemes to fulfil these


requirements
Numeric Attributes
• Unlike nominal attributes where the value of each branch is clear,
numeric attributes have many possible split points
• Solution
o Sort by value
o Evaluate the goodness measures for every possible split point of attribute
o Choose “best” split point
o The goodness measure for that split point then becomes the goodness measure
for the attribute

• It is computationally more demanding


Weather data – nominal values
Outlook Temperature Humidity Windy Play

Sunny Hot High False No


Sunny Hot High True No
Overcast Hot High False Yes

Rainy Mild Normal False Yes

… … … … …

If outlook = sunny and humidity = high then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
Weather data – numeric values
Outlook Temperature Humidity Windy Play

Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

If outlook = sunny and humidity > 83 then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
Finding the split in a numeric attribute
•  For example on temperature attribute
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

• Place the split points halfway between values. For example,


o Temperature < 71.5 : 4 Yes, 2 No
o Temperature 71.5: 5 Yes, 3 No
o After splitting the dataset based on Temperature
 ()
Finding the split in a numeric attribute
• Sort by values
• Instead of trying every possible number as a possible split point
o Only choose points that are located between different classes

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

 E.g. 64.5, 65.5, 70.5, 72, 77.5, 80.5, 84


o Break points between values of the same class will not be optimal
Problems with numeric attributes
•  With binary splits (i.e. attr < num and attr num)
o We do not exhaust all information in that attribute
o It may have to be tested several times along each path in the tree
o Makes the tree harder to read

• Remedy
o Pre-discretize numeric attributes (convert numeric to nominal attributes)
o Use multi-way splits instead of binary ones
Missing values
• Simple idea: treat missing as a separate value, e.g. “?” or “unknown”
• Q: When is it not appropriate?
o When values are missing not because they are unknown
 Gene expression could be missing when it is very high or very low. We should treat them
differently
 When the value for the field isPregnant is missing for
 A male patient  we can assume that they are not pregnant
 A female patent  it is unknown. We genuinely don’t know if she is or is not pregnant
Avoid Over-fitting
• 2 strategies for “pruning” the decision tree
o Pre-pruning (early stopping): Stop growing a branch when information becomes
unreliable
o Post-pruning: Take a fully-grown decision tree and discard unreliable parts

• Has 2 essential parameters:


o Confidence level (default 25%): Lower values will create heavier pruning
 Use lower values when the actual error rate of pruned trees on tests set is much higher
than the estimated values
o Minimum number of instances in the two most popular branches (default 2)
 Use higher number for noisy data
Avoid Over-fitting
• 2 strategies for “pruning” the decision tree
o Pre-pruning (early stopping)
 Evaluate each split the using chi-squared test
 If the association between the attribute and the class at a particular node is statistically
significant  Split the node
 Otherwise, don’t split it
 The leaf node may not be pure
 Seems right and fast but it can stop too early!
 Hard to properly evaluate split without seeing what splits would follow it
 Some attributes useful only in combination with other attributes
 In rare cases, it happens that no single split looks good at root node (XOR case)
Avoid Over-fitting
• 2 strategies for “pruning” the decision tree
o Post-pruning
 First, build full tree
 Then, prune it by eliminating all splits that do not reduce the estimated error
significantly
 Use validation set to evaluate splits
 2 Pruning operations:
 Subtree replacement
Avoid Over-fitting
• 2 strategies for “pruning” the decision tree
o Post-pruning
 First, build full tree
 Then, prune it by eliminating all splits that do not reduce the estimated error
significantly
 Use validation set to evaluate splits
 Pruning operations:
 Subtree replacement
 Subtree raising
redistributed
 Slower than subtree replacement
Avoid Over-fitting
• 2 strategies for “pruning” the decision tree
o Post-pruning
 First, build full tree
 Then, prune it by eliminating all splits that do not reduce the estimated errors
significantly
 Use validation set to evaluate splits
 Pruning operations: subtree replacement or subtree raising
 Slower than pre-pruning but fully-grown tree shows all attribute interactions
C4.5 History
• 1960s  ID3, CHAID
• C4.5 by Quinlan
o Industrial strength algorithm

• C4.8, the latest research version. It is implemented in Weka as J4.8


• C5.0, the latest commercial version of C4.5
Decision Tree Recap
• Build a tree
o Calculate the goodness measures of all the unselected attribute
o Pick attribute with the highest goodness measure
o Go down each possible branch,
 If branch is pure, creates leaf node and stop
Otherwise, repeat step 1
o Post-prune the tree using sub-tree replacement
Higher confidence level and minimum number of instances for noisy data

• Treat missing values as separate values


• Try to pre-discretize numeric attributes
• Use combination of gain ratio and information gain as a goodness measure
Association Rules
• Sometimes, you do not have a label that you would like to predict
o You would like to know the underlying patterns in your data
o Exploration

• You need unsupervised learning


o Clustering
o Association rules
 For example, to know the list of items that most people bought together in the
supermarket
Transactions Example
ID Items
1 Milk, Bread, Eggs ID Items
ID A B C D E
1 A, B, E
2 Bread, Sugar 1 1 1 0 0 1
2 B, D 2 0 1 0 1 0
3 Bread, Cereal 3 B, C 3 0 1 1 0 0
4 1 1 0 1 0
4 Milk, Bread, Sugar 4 A, B, D 5 1 0 1 0 0
5 Milk, Cereal 5 A, C 6 0 1 1 0 0
6 B, C 7 1 0 1 0 0
6 Bread, Cereal 8 1 1 1 0 1
7 A, C 9 1 1 1 0 0
7 Milk, Cereal
8 A, B, C, E
8 Milk, Bread, Cereal, Eggs 9 A, B, C
9 Milk, Bread, Cereal
Definitions
•  We are trying to find the most frequent patterns in the dataset
o Item: Value of the attribute
o Itemset: A subset of possible items
 For example, I = {A, B, E} (order is unimportant)
ID Items
o Transaction: Tuple (ID, Itemset)
1 A, B, E
o Support of an itemset I: The total number of transactions that contain I 2 B, D
3 B, C
 sup({A, B, E}) = 2
4 A, B, D
 sup({B,C}) ? 5 A, C
o Frequent itemset: The itemset with support the minimum support count 6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
Generating Frequent Itemsets
• First, we have to generate every possible itemset
o Start with an itemset that only contains 1 item
o Remove all itemsets that are not frequent
o Next, generate the 2-items in the sets
 If {A} and {B} are frequent itemsets, {A,B} must be a frequent itemset as well
 Given: 3-item frequent itemsets {A,B,C}, {A, B, D}, {A, C, D}, {A, C, E}, {B, C, D}
 Is {A, B, C, D} a possible four-item frequent itemset?
 What about {A, C, D, E} ?

• Repeat the all the steps until we have all the possible itemsets
Association Rules
• Given a set {A, B, E}, the possible association rules are
o A => {B, E}
o {A, B} => E
o {A, E} => B
o B => {A, E}
o {B, E} => A
o E => {A, B}
o _ => {A, B, E}
Rule Support and Confidence
•  We are only interested in “good” rules
• Goodness measure:
o Given an association rule R: I => J
o Support

o Confidence

• We only keep rules that have support and confidence the minimum
support and confidence
Example
• Given a set {A, B, E}, what association rules have minimum support =
2 and minimum confidence = 0.5?
o A => {B, E}
 Support(R) = support({A}) + support({B, E}) – support({A, B, E} = 6 + 2 – 2 = 6
 Confidence(R) = support({B,E})/support(R) = 2/6 = 0.33 ID Items
1 A, B, E
 Not a good rule because confidence is lower than the minimum confidence
2 B, D
3 B, C
4 A, B, D
5 A, C
6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
Example
• Given a set {A, B, E}, what association rules have minimum support =
2 and minimum confidence = 0.5?
o A => {B, E} : support = 6, confidence = 0.33
o {A, B} => E : support = ?, confidence = ?
o {A, E} => B : support = ?, confidence = ? ID Items
1 A, B, E
o B => {A, E} : support = ?, confidence = ? 2 B, D
o {B, E} => A : support = ?, confidence = ? 3 B, C
4 A, B, D
o E => {A, B} : support = ?, confidence = ?
5 A, C
o _ => {A, B, E} : support = ?, confidence = ? 6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
WEKA
• Input: Weather.nominal.arff
• Associate: weka.associations.Apriori
Filtering Association Rules
•  A large dataset can lead to a very large number of association rules
o Minimum support and confidence are not sufficient to filter rules

• Use Lift to filter R: I => J

o Probability of a set =

• If Lift > 1, I and J are positively correlated


• If Lift = 1, I and J are independent
• If Lift < 1, I and J are negatively correlated
Association Rules Recap
•  Find all possible frequent itemsets
o Start with 1-item itemsets until N-item itemsets
 Create N-item itemsets based on the results of (N-1) item itemsets
o Remove all itemsets that are less than the minimum support count

• Generate all possible rules from all of the frequent itemsets


o Remove all rules with support and confidence lower than the minimum support
and confidence
o For a large dataset, remove all rules with lift 1
Interesting Applications
• You might find unusual associations:
o People who buy milk usually buy bread at the same time
o People who buy soymilk usually do not buy bread at the same time
o People who buy diapers usually buy beer at the same time
o Customers who buy Barbie dolls have a 60% likelihood of buying one of three
available candy bars

• We can use association rules to


o Re-arrange the store layout, customize client offers, find unusual events
Classification vs Association Rules
ClassificationRules
Association Rules
• Many
Focus target
on onefields
target field
• Applicable
Specify class
in in
some
all cases
cases
• Measures:
Measure: Accuracy,
Support, Gain
Confidence,
Ratio, Information
Lift Gain, etc.

53

You might also like