Automatic Knowledge Acquisition

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Topic 4 Automatic Knowledge Acquisition PART II

Contents 5.1 The Bottleneck of Knowledge Aquisition 5.2 Inductive Learning: Decision Trees 5.3 Converting Decision Trees into Rules 5.4 Generating Decision Trees: Information gain

Deriving Decision Trees from Case Data

5. Automatic Knowledge Acquisition


Deriving Decision Trees: random-tree
There are various ways to derive decision trees from case data. A simple method is called Random-Tree, and is described on the enxt few slides. We assume that all given attributes are discrete (not continuous). We assume that the expert classification is binary (yes or no, true or false, treat or dont-treat, etc.)

5. Automatic Knowledge Acquisition


Deriving Decision Trees: random-tree
Lets take a case study: will someone play tennis given the weather,
Case Outlook Temp. Humidity Wind Play?

1 2 3 4 5 6 7 8 9

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes


4

5. Automatic Knowledge Acquisition


Deriving Decision Trees: random-tree
To develop a decision tree with RandomTree, we select an attribute at random, e.g., humidity We make the root node of the tree with this attribute: Humidity?
normal high

5. Automatic Knowledge Acquisition


Deriving Decision Trees: random-tree
We then list for each branch the cases which fit that branch: Humidity?
normal 5,6,8,9
Case Outlook Temp.

high 1,2,3,4,7
Humidity Wind Play?

1 2 3 4 5 6 7 8 9

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes


6

5. Automatic Knowledge Acquisition


Deriving Decision Trees: random-tree
If all of the cases of a branch share the same conclusion, we make this branch a leaf, and just show the decision. This is not the case here, since the decisions are mixed on both branches. Humidity?
normal 5,6,8,9
Case Outlook Temp.

high 1,2,3,4,7
Humidity Wind Play?

1 2 3 4 5 6 7 8 9

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes

5. Automatic Knowledge Acquisition


Deriving Decision Trees: random-tree
For non-terminal nodes, we then select a second attribute at random, and create the branch: Humidity?
normal high strong 8,9
Outlook Temp.

Wind?
weak 5,6
Case

Wind?
weak 1,3,4,7
Humidity

high 2
Wind Play?

1 2 3 4 5 6 7 8 9

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes

5. Automatic Knowledge Acquisition


Deriving Decision Trees: random-tree
And then repeat the previous steps until finished:
Humidity?
normal high

Wind?
strong weak strong

Wind?
weak

Outlook?
overcast rain

yes

no

yes

yes
Case

no
Outlook Temp. Humidity Wind Play?

1 2 3 4 5 6 7 8 9

sunny sunny overcast rain rain overcast sunny sunny rain

hot hot hot mild cool cool mild cool mild

high high high high normal normal high normal normal

weak strong weak weak strong strong weak weak weak

no no yes yes no yes no yes yes

5. Automatic Knowledge Acquisition


Deriving Decision Trees: random-tree
Trees will differ depending on the order in which attributes are used Trees may be smaller or larger (number of nodes, depth) than others. Later, we will look at a means to produce compact trees as (ID3-tree).

10

5. Automatic Knowledge Acquisition


Other Models: The Restaurant Problem
Will people wait for a table to be available? Assume that an expert provided the following decision tree (But ... is Sociology an exact science? Or is Medicine? ...)

11

5. Automatic Knowledge Acquisition


Data from experts, their observations/diagnostics, guide us Perfection is unreachable (even for expertssee X4 and previous tree) Goal: equal (or better) sucessful prediction rate on unseen instances, compared with the human expert

12

5. Automatic Knowledge Acquisition


The best tree for this data might be :

Smaller than the Experts tree (this is an advantageOccams Razor) Both trees agree on the root and two branches The best (the only?) measure of quality is prediction rate on unseen instances Later, we will use Information Theory to obtain trees as good as this one
13

5. Automatic Knowledge Acquisition


For another problem, the following tree was produced from a set of training data:

Whats wrong with this tree? Experts notice it. Non-experts do not, nor would a program.

14

5. Automatic Knowledge Acquisition


Whats wrong with this tree? Experts notice it. Non-experts do not, nor would a program.

The problem is, the training data had no cases of diabetic women on their first pregnancy who were renally insufficient. The tree DID cover all observed cases, but not all possible cases! The wrong recommendation would be given in these cases.
15

Deriving Rules from Decision Trees

16

5. Automatic Knowledge Acquisition


RULE EXTRACTION Traversing the tree from root to leaves produces:

We focus on rules for the class NO because there are fewer of them The other class is dened using negation by default (as in Prolog)

17

Simplification of a Rule

18

5. Automatic Knowledge Acquisition


RULE SIMPLIFICATION Sometimes, we can delete conditions in our rules without affecting the results produced by the rules:
Original rules:

Simplified rules:

pregnant is implied by this being the patients fiest pregnancy, so it can be dropped. Dropping renal insufficiency actually improves the working of the rules, because those with renal insufficiency and diabetic first pregnancy should also not be treated.
19

5. Automatic Knowledge Acquisition


RULE SIMPLIFICATION A rule can be simplified by dropping some of the conditions, where dropping the conditions will not affect the decision the rule makes. There are two main ways to drop conditions: 1. The logical approach: Where one condition is logically implied by another, the implied condition can be dropped: pregnant & first-pregnancy: BUT first-pregnancy implies pregnant! Age > 23 and Age > 42: BUT Age > 42 implies Age > 23 ! 2. The statistical approach: Where one condition can be dropped without changing the decision made by the rule over a set of data, then drop it. OR BETTER: when dropping the condition leaves unchanged OR IMPROVES the decisions made, drop it.

20

10

5. Automatic Knowledge Acquisition


RULE SIMPLIFICATION: the logical approach Algorithm: For each rule: For each condition: If another condition for this rule logically implies this one, then delete this one.

Logical implication can be derived from the training set:


A condition X is implied by a condition Y if X is true whenever Y is true. E.g., Age>23 is true whenever Age>52 is true. E.g., first pregnancy is true whenever pregnant is true.

21

5. Automatic Knowledge Acquisition


RULE SIMPLIFICATION: the statistical approach We need a test data set, which is a set of data with the experts classification which was NOT used to derive the set of rules. Thus, two sets of data: One to derive the rules Another to simplify them

To test the precision of a rule set:


1. Set SCORE to 0 2. For each case in the test set, Apply the rules to the case data to produce a conclusion If the estimated conclusion is the same as the experts, increm SCORE 3. PRECISION = SCORE / No. of cases

22

11

5. Automatic Knowledge Acquisition


RULE SIMPLIFICATION: the statistical approach To simplify rules: 1. Test the precision of the rules 2. For each rule: For each condition of the rule: Make a copy of the rule set with this condition deleted Test the precision of the new rule set on the test data If the precision is equal or better than the original precision: Replace the original rule set with the copy Replace original precision with this one Restart Step 2 3. We get here when no more conditions can be deleted. The rules are maximally simple.

23

5. Automatic Knowledge Acquisition


RULE SIMPLIFICATION: the statistical approach The decision tree from above made a mistake in that it does not deal with cases with renal insufficiency who are diabetic and first pregnancy. Assuming there are such cases in our test database, using the statistical approach would lead to the renal insufficiency condition being dropped from our first rules, as it would improve precision.

24

12

Deleting Subsumed Rules

25

5. Automatic Knowledge Acquisition


RULE Deletion More complete training data would produce a better tree:

Is this tree better or worse?


It is more complex (larger) and has redundancy But for a doctor, it has better semantics. And for a machine, it has better predictive accuracy on the test set
26

13

5. Automatic Knowledge Acquisition


RULE Deletion Lets look at the rules from this case:

(renal insuff. & pregnant & diabetes & first-preg) -> no (renal insuff&high press& pregn&diabetes&first-preg) -> no (renal insuff&high press) -> no
27

5. Automatic Knowledge Acquisition


Simplifying these rules using logic:
(renal insuff. & pregnant & diabetes & first-preg) -> no (renal insuff&high press& pregn&diabetes&first-preg) -> no (renal insuff&high press) -> no

Gives:
(renal insuff.&diabetes & first-preg) -> no (renal insuff&high press&diabetes&first-preg) -> no (renal insuff&high press) -> no

Looking at predicting accuracy, we see that deleting renal insuff. from the first rule does not change the predictions.
(diabetes & first-preg) -> no (renal insuff&high press&diabetes&first-preg) -> no (renal insuff&high press) -> no

Now, the second rule cannot fire unless the first does. So, we can delete the second rule (it is a subset of the cases of the first).

28

14

5. Automatic Knowledge Acquisition


RULE Deletion

As with deleting conditions from a rule, we can apply the same methods to deleting rules: 1. The logical approach: Where one rule is logically implied by another, the implied rule can be dropped: 2. The statistical approach: Where one rule can be dropped without worsening the predictive accuracy of the rule-set as a whole, then delete the rule.

29

Producing Optimal Decision Trees: ID3-Tree

30

15

5. Automatic Knowledge Acquisition


Psuedo-code to generate a decision tree

31

5. Automatic Knowledge Acquisition Selecting the best attribute for the root
Random-Tree selects an attribute at random for the root of the tree. This approach tries to select the best attribute for the root. We seek an attribute which most determines the experts decision. ID-3 assesses each attribute in terms of how much it helps to make a decision. Using the attribute splits the cases into smaller subsets. The closer these subsets are to being purely one of the decision classes, the better. The formula used is called Information Gain.

32

16

5. Automatic Knowledge Acquisition Information


Suppose we have a set of cases, and the expert judges whether to treat the patient or not. In 50% of the cases, the expert proposes treatment, and in the other 50% proposes no treatment. For a given new case, without looking at attributes, the probability of treatment is 50% (we have no information to favor treatment or not). Now, assume we use an attribute to split our cases into two sets:
Set 1: treatment recommended in 75% of cases Set 2: treatment recommended in 30% of cases

Now, in each subset, we have more information as to what decision to make. -> Information Gain.
33

5. Automatic Knowledge Acquisition How to Calculate Information Gain of an Atribute


Firstly, we caluclate the information contained before the split The formula we use is: H (for entropy H(p,q) = -p.log2(p) q.log2(q) ...where p is the probability of decision 1 and q is the probability of the reverse decision. In our previous case: Initially: p = 50%, q=50% H(p,q) = -0.5 * log2(0.5) 0.5 * log2(0.5) = 1 (no information) Information = 1- Entropy = 1-H(p,q)
34

17

5. Automatic Knowledge Acquisition Alternative Formulas:


Both give equal values Values always between 0 and 1

35

5. Automatic Knowledge Acquisition Special Cases:


H(1/3, 2/3) = H(2/3, 1/3) = 0.92 bits of information H(1/2, 2/2) 1 bit of information (no information) H(1,0)=0 bits (maximum information)

36

18

5. Automatic Knowledge Acquisition How to Calculate Information Gain of an Atribute


Initially: p = 50%, q=50% H(p,q) = -0.5 * log2(0.5) 0.5 * log2(0.5) = 1 (no information) Splitting the data, we get: H1(p,q) = -0.75 * log2(0.75) - 0.25 * log2(0.25) = 0.81 H2(p,q) = -0.25 * log2(0.25) - 0.75 * log2(0.75) = 0.81 We derive the total information of the two subsets by multiplying each information measure by the probability of the set. Lets assume the first set is 2/3 of the cases: Hnew(p,q) = 0.66*H1(p,q) + 0.34 * H2(p,q)
= 0.81
37

5. Automatic Knowledge Acquisition How to Calculate Information Gain of an Atribute


Given the original information of the case data was 1.0, and the information of the cases divided by the attribute is 0.811, we have an information gain of 0.189 The idea is, we look at each of the attributes in turn, and choose the attribute which gives us the highest gain in information.

38

19

5. Automatic Knowledge Acquisition The Restaurant case revisited

39

5. Automatic Knowledge Acquisition The Restaurant case revisited

40

20

5. Automatic Knowledge Acquisition The Restaurant case revisited

41

5. Automatic Knowledge Acquisition The Restaurant case revisited

42

21

5. Automatic Knowledge Acquisition

43

22

You might also like