Automatic Knowledge Acquisition

Topic 4 Automatic Knowledge Acquisition PART II
Contents 5.1 The Bottleneck of Knowledge Aquisition 5.2 Inductive Learning: Decision Trees 5.3 Converting Decision Trees into Rules 5.4 Generating Decision Trees: Information gain
Deriving Decision Trees from Case Data
5. Automatic Knowledge Acquisition

Deriving Decision Trees: random-tree
There are various ways to derive decision trees from case data. A simple method is called Random-Tree, and is described on the enxt few slides. We assume that all given attributes are discrete (not continuous). We assume that the expert classification is binary (yes or no, true or false, treat or dont-treat, etc.)

Lets take a case study: will someone play tennis given the weather,
Case Outlook Temp. Humidity Wind Play?
1 2 3 4 5 6 7 8 9
sunny sunny overcast rain rain overcast sunny sunny rain
hot hot hot mild cool cool mild cool mild
high high high high normal normal high normal normal
weak strong weak weak strong strong weak weak weak
no no yes yes no yes no yes yes

4

To develop a decision tree with RandomTree, we select an attribute at random, e.g., humidity We make the root node of the tree with this attribute: Humidity?
normal high

We then list for each branch the cases which fit that branch: Humidity?
normal 5,6,8,9
Case Outlook Temp.
high 1,2,3,4,7
Humidity Wind Play?
1 2 3 4 5 6 7 8 9

6

If all of the cases of a branch share the same conclusion, we make this branch a leaf, and just show the decision. This is not the case here, since the decisions are mixed on both branches. Humidity?
normal 5,6,8,9
Case Outlook Temp.
high 1,2,3,4,7
Humidity Wind Play?
1 2 3 4 5 6 7 8 9

For non-terminal nodes, we then select a second attribute at random, and create the branch: Humidity?
normal high strong 8,9
Outlook Temp.
Wind?
weak 5,6
Case
Wind?
weak 1,3,4,7
Humidity
high 2
Wind Play?
1 2 3 4 5 6 7 8 9

And then repeat the previous steps until finished:
Humidity?
normal high
Wind?
strong weak strong
Wind?
weak
Outlook?
overcast rain
yes
no
yes
yes
Case
no
Outlook Temp. Humidity Wind Play?
1 2 3 4 5 6 7 8 9

Trees will differ depending on the order in which attributes are used Trees may be smaller or larger (number of nodes, depth) than others. Later, we will look at a means to produce compact trees as (ID3-tree).
10

Other Models: The Restaurant Problem
Will people wait for a table to be available? Assume that an expert provided the following decision tree (But ... is Sociology an exact science? Or is Medicine? ...)
11

Data from experts, their observations/diagnostics, guide us Perfection is unreachable (even for expertssee X4 and previous tree) Goal: equal (or better) sucessful prediction rate on unseen instances, compared with the human expert
12

The best tree for this data might be :
Smaller than the Experts tree (this is an advantageOccams Razor) Both trees agree on the root and two branches The best (the only?) measure of quality is prediction rate on unseen instances Later, we will use Information Theory to obtain trees as good as this one
13

For another problem, the following tree was produced from a set of training data:
Whats wrong with this tree? Experts notice it. Non-experts do not, nor would a program.
14

Whats wrong with this tree? Experts notice it. Non-experts do not, nor would a program.
The problem is, the training data had no cases of diabetic women on their first pregnancy who were renally insufficient. The tree DID cover all observed cases, but not all possible cases! The wrong recommendation would be given in these cases.
15
Deriving Rules from Decision Trees
16

RULE EXTRACTION Traversing the tree from root to leaves produces:
We focus on rules for the class NO because there are fewer of them The other class is dened using negation by default (as in Prolog)
17
Simplification of a Rule
18

RULE SIMPLIFICATION Sometimes, we can delete conditions in our rules without affecting the results produced by the rules:
Original rules:
Simplified rules:
pregnant is implied by this being the patients fiest pregnancy, so it can be dropped. Dropping renal insufficiency actually improves the working of the rules, because those with renal insufficiency and diabetic first pregnancy should also not be treated.
19

RULE SIMPLIFICATION A rule can be simplified by dropping some of the conditions, where dropping the conditions will not affect the decision the rule makes. There are two main ways to drop conditions: 1. The logical approach: Where one condition is logically implied by another, the implied condition can be dropped: pregnant & first-pregnancy: BUT first-pregnancy implies pregnant! Age > 23 and Age > 42: BUT Age > 42 implies Age > 23 ! 2. The statistical approach: Where one condition can be dropped without changing the decision made by the rule over a set of data, then drop it. OR BETTER: when dropping the condition leaves unchanged OR IMPROVES the decisions made, drop it.
20
10

RULE SIMPLIFICATION: the logical approach Algorithm: For each rule: For each condition: If another condition for this rule logically implies this one, then delete this one.
Logical implication can be derived from the training set:

A condition X is implied by a condition Y if X is true whenever Y is true. E.g., Age>23 is true whenever Age>52 is true. E.g., first pregnancy is true whenever pregnant is true.
21

RULE SIMPLIFICATION: the statistical approach We need a test data set, which is a set of data with the experts classification which was NOT used to derive the set of rules. Thus, two sets of data: One to derive the rules Another to simplify them
To test the precision of a rule set:

1. Set SCORE to 0 2. For each case in the test set, Apply the rules to the case data to produce a conclusion If the estimated conclusion is the same as the experts, increm SCORE 3. PRECISION = SCORE / No. of cases
22
11

RULE SIMPLIFICATION: the statistical approach To simplify rules: 1. Test the precision of the rules 2. For each rule: For each condition of the rule: Make a copy of the rule set with this condition deleted Test the precision of the new rule set on the test data If the precision is equal or better than the original precision: Replace the original rule set with the copy Replace original precision with this one Restart Step 2 3. We get here when no more conditions can be deleted. The rules are maximally simple.
23

RULE SIMPLIFICATION: the statistical approach The decision tree from above made a mistake in that it does not deal with cases with renal insufficiency who are diabetic and first pregnancy. Assuming there are such cases in our test database, using the statistical approach would lead to the renal insufficiency condition being dropped from our first rules, as it would improve precision.
24
12
Deleting Subsumed Rules
25

RULE Deletion More complete training data would produce a better tree:
Is this tree better or worse?

It is more complex (larger) and has redundancy But for a doctor, it has better semantics. And for a machine, it has better predictive accuracy on the test set
26
13

RULE Deletion Lets look at the rules from this case:
(renal insuff. & pregnant & diabetes & first-preg) -> no (renal insuff&high press& pregn&diabetes&first-preg) -> no (renal insuff&high press) -> no
27

Simplifying these rules using logic:
(renal insuff. & pregnant & diabetes & first-preg) -> no (renal insuff&high press& pregn&diabetes&first-preg) -> no (renal insuff&high press) -> no
Gives:
(renal insuff.&diabetes & first-preg) -> no (renal insuff&high press&diabetes&first-preg) -> no (renal insuff&high press) -> no
Looking at predicting accuracy, we see that deleting renal insuff. from the first rule does not change the predictions.
(diabetes & first-preg) -> no (renal insuff&high press&diabetes&first-preg) -> no (renal insuff&high press) -> no
Now, the second rule cannot fire unless the first does. So, we can delete the second rule (it is a subset of the cases of the first).
28
14

RULE Deletion
As with deleting conditions from a rule, we can apply the same methods to deleting rules: 1. The logical approach: Where one rule is logically implied by another, the implied rule can be dropped: 2. The statistical approach: Where one rule can be dropped without worsening the predictive accuracy of the rule-set as a whole, then delete the rule.
29
Producing Optimal Decision Trees: ID3-Tree
30
15

Psuedo-code to generate a decision tree
31
5. Automatic Knowledge Acquisition Selecting the best attribute for the root
Random-Tree selects an attribute at random for the root of the tree. This approach tries to select the best attribute for the root. We seek an attribute which most determines the experts decision. ID-3 assesses each attribute in terms of how much it helps to make a decision. Using the attribute splits the cases into smaller subsets. The closer these subsets are to being purely one of the decision classes, the better. The formula used is called Information Gain.
32
16
5. Automatic Knowledge Acquisition Information

Suppose we have a set of cases, and the expert judges whether to treat the patient or not. In 50% of the cases, the expert proposes treatment, and in the other 50% proposes no treatment. For a given new case, without looking at attributes, the probability of treatment is 50% (we have no information to favor treatment or not). Now, assume we use an attribute to split our cases into two sets:
Set 1: treatment recommended in 75% of cases Set 2: treatment recommended in 30% of cases
Now, in each subset, we have more information as to what decision to make. -> Information Gain.
33
5. Automatic Knowledge Acquisition How to Calculate Information Gain of an Atribute

Firstly, we caluclate the information contained before the split The formula we use is: H (for entropy H(p,q) = -p.log2(p) q.log2(q) ...where p is the probability of decision 1 and q is the probability of the reverse decision. In our previous case: Initially: p = 50%, q=50% H(p,q) = -0.5 * log2(0.5) 0.5 * log2(0.5) = 1 (no information) Information = 1- Entropy = 1-H(p,q)
34
17
5. Automatic Knowledge Acquisition Alternative Formulas:

Both give equal values Values always between 0 and 1
35
5. Automatic Knowledge Acquisition Special Cases:

H(1/3, 2/3) = H(2/3, 1/3) = 0.92 bits of information H(1/2, 2/2) 1 bit of information (no information) H(1,0)=0 bits (maximum information)
36
18

Initially: p = 50%, q=50% H(p,q) = -0.5 * log2(0.5) 0.5 * log2(0.5) = 1 (no information) Splitting the data, we get: H1(p,q) = -0.75 * log2(0.75) - 0.25 * log2(0.25) = 0.81 H2(p,q) = -0.25 * log2(0.25) - 0.75 * log2(0.75) = 0.81 We derive the total information of the two subsets by multiplying each information measure by the probability of the set. Lets assume the first set is 2/3 of the cases: Hnew(p,q) = 0.66*H1(p,q) + 0.34 * H2(p,q)
= 0.81
37

Given the original information of the case data was 1.0, and the information of the cases divided by the attribute is 0.811, we have an information gain of 0.189 The idea is, we look at each of the attributes in turn, and choose the attribute which gives us the highest gain in information.
38
19
5. Automatic Knowledge Acquisition The Restaurant case revisited
39
40
20
41
42
21
43
22

Automatic Knowledge Acquisition

Uploaded by

Copyright:

Available Formats

You might also like

Automatic Knowledge Acquisition

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Knowledge Acquisition

Uploaded by

Copyright:

Available Formats

Topic 4 Automatic Knowledge Acquisition PART II

Deriving Decision Trees from Case Data

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes

5. Automatic Knowledge Acquisition

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes

5. Automatic Knowledge Acquisition

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes

5. Automatic Knowledge Acquisition

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

Deriving Rules from Decision Trees

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

Logical implication can be derived from the training set:

5. Automatic Knowledge Acquisition

To test the precision of a rule set:

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

Deleting Subsumed Rules

5. Automatic Knowledge Acquisition

Is this tree better or worse?

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

5. Automatic Knowledge Acquisition

Producing Optimal Decision Trees: ID3-Tree