Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

UNIT 3

MACHINE LEARNING
TREE MODELS
• Feature Tree:
A compact way of representing a number of
conjunctive concepts in the hypothesis space.

• Tree:
1. Internal nodes as Features
2. Edges labelled as Literals.
3. Split : set of literals at a node.
4. Leaf: Logical expression conjunction of literals
in the path from root to that edge
TREE MODELS
• Generic Algorithms
Three functions:
1. Homogenous(D)  all instances belong to
True/False (Single Class)

2. Label(D)  returns label of D

3. BestSplit(D,F) on which feature the dataset


is divided(two classes/more
classes)
TREE MODELS
• Divide-and-Conquer algorithm:
it divides the data into subsets,builds a tree for each
of those and then combines those subtrees into a
single tree.
• Greedy:
whenever there is a choice (such as choosing the
best split), the best alternative is selected on the basis
of the information then available, and this choice is
never reconsidered.
• Backtracking search algorithm:
which can return an optimal solution, at the expense
of increased computation time and memory
requirements
DECISION TREES
• Classification Task: D
Homogenous(D)
Single Label(D)

Non- Homogenous(D) Majority Class Label


D Di =Ø (zero)
D 1 D2
DECISION TREES
• D + 1 = D+ and D -1 = Ø

• D -1 = D - and D+1 = Ø

pure

• Impurity: n+ , n- (impurity depends only on magnitude)

• Impurity is measured in Proportional format  p˙ = n+ / (n++n-)  empirical


probability of positive class.

• Aim : We need a function that returns


0 if p=0 or p=1
½ if p reaches maximum value
FUNCTIONS
1. MINORITY CLASS(Error Rate)

2. GINI INDEX(Expected Error Rate).

3. ENTROPY(Expected Information)
MINORITY CLASS
• Min(p,1-p), it returns error rate.

• Minority class is Proportion to misclassified examples.

• Spam=40 majority class,


Ham=10misclassified(minority class)

• If set of instances are Pure set  fewer(no error)

• Minority class as impurity class then ½ -|p- ½ |


GINI INDEX
• It is an expected error rate.

• Randomly assigns a label to instances.

• P(positive instances), 1-p(negative instances)

• False positive  p (1-p)

• False Negative (1-p) p


ENTROPY
• It is an Expected Information.

• Formula: -p log 2p – (1-p) log 2(1-p)


Decision Trees
Entropy
Gini Index
Decision Tree
• K>2

• One vs rest

• K class Entropy =

• K class Gini Index=


RANKING AND PROBABILITY ESTIMATION
• Grouping classifiers divide instance space into segments.
• Instance space

• Segments

• Rankers by learning an ordering algorithm

• Decision trees (can access Local class distribution) directly used


to construct Leaf ordering in an optimal way.

• Using Empirical Probability easy to calculate leaf ordering.

• Highest priority for Positives.

• Convex ROC Curve.


the empirical probability of the parent is a weighted average of the
empirical probabilities of its children; but this only tells us that p˙1 ≤ p˙ ≤ p˙2 or p˙2 ≤ p˙ ≤
p˙1.
• Tree is a feature tree with unlabelled data.

• How many ways we can label the tree and the


performance.

• If we know the number of positives and


negatives.

• L-labels, C-classes then CL ways to arrange the


leaves.

• Ex:24= 16 ways.
• Graph follows symmetry property.

• +-+-, -+-+  they are locating at same


place(symmetric property).

• Path of coverage corner contains optimal

• ----, --+-, +-+-, +-++, ++++

• L labels then L! permutations are possible.


• Feature tree is turned into

-- Rankers (Order leaves in descending


order based on Empirical
probability.

-- Probability Estimator(Predict Empirical


probability in each leaf or calculate
Laplace or m-estimate)

-- Classifier(choose operating conditions , find


the operating point that fits the
conditions
• the optimal labelling under these operating conditions
is +−++.
• use the second leaf to filter out negatives.
• In other words, the right two leaves can be merged into
one – their parent.
• the operation of merging all leaves in a subtree is called
pruning the subtree.
• The advantage of pruning is that we can simplify the
tree without affecting the chosen operating point,
which is sometimes useful
• if we want to communicate the tree model to
somebody else
• The disadvantage is that we lose ranking performance,
Sensitivity to Skewed Class Distribution
• Parent p Gini index = 2(n + / n)(n - / n)

• Average Weight of Gini index children


n1 = n1+ + n1-

n1/n * 2(n + / n)(n - / n)

• Relative impurity= sqrt(n1+ * n1- )/ (n + * n - )


How you would train decision Trees for
a dataset
• Good Ranking Estimator.
• Distributive-Insensitive data
• Disable Pruning.
• Operating Condition, Operating point ROC.
• Prune all the leaves at the same level.
Tree Learning as Variance Reduction
• Gini Index 2p(1-p)  Expected error rate.

• Label instances randomly.

• Coin---Head, tail  probability of occurring of


head is p then variance is p(1-p)
P is occurring
1-p is non-occurring
REGRESSION TREE
Regression Tree
Model(A100,B3,E112,M102,T202)
• A100[1051,1770,1900]mean=1574
• B3[4513] mean=4513
• E112[77] mean=77
• M102[870] mean=870
• T202[99,270,625] mean= 331

Calculate variance:
• A100
1/9 sq(1574-1051)+sq(1574-1770)+sq(1574-1900)=
1/9(523)+(-196)+(-326)=
273529+38416+106,276=46469
• B3
1/9sq(4513-4513)=0
• E112
1/9sq(77-77)=0
• M102
1/9sq(870-870)=0
• T202
1/9sq(331-99)+sq(331-270)+sq(331-625)=1/9(232+61+(-
294))=15997
• Calculate weigthed average of Model:-
• 3/9(46469)+0+0+0+3/9(15597)=2,686.5978
• Similarly for condition(excellent, good, fair)
excellent[1770,4513]mean=3142
good[270,870,1051,1900] mean=1023
fair[77,99,625] mean=267
Variance:-
• Excellent
1/9 sq(3142-1770)+sq(3142-4513)=1372+1371=418002
• good
1/9sq(1023-270)+sq(1023-870)+sq(1023-1051)+sq(1023-1900)
=1/9*sq(753)+sq(153)+sq(28)+sq(877)=
=1/9*567009+ 23409+ 784+769,129
=1,51,147
• fair
1/9(267-77)+(267-99)+(267-625)=190+168+358=21331
• Weighted Average of condition:-
2/9(418002)+4/9(151147)+3/9(21331)=
167,176.1111
• Similarly for Leslie(yes,no)
yes[625,870,1900] mean=1132
no[77,99,270,1051,1770,4513] mean= 1297
Variance:-
• Yes
1/9 sq(1132-625)+(1132-870)+(1132-1900)
=1/9 sq(507)+262+(-768)=101,704.11
• No
1/9 sq(1297-77)+(1297-99)+(1297-270)+(1297-1051)+(1297-
1770)+(1297-4513)
=1/9 sq(1220)+1198+1027+246+(-473)+(-3216)
=16223803.77
• Calculate weighted average of Leslie:-
• 3/9* 101,704.11 + 6/9* 16223803.77
=33901.36+10815869.180
=10849770.54

Weighted averages :
1. Model= 2,686.5978
2. Condition= 167,176.1111
3. Leslie= 10849770.54
• For A100 the splits are
Condition[excellent,good,fair]
[1770] [1051,1900] []  ignored
Leslie[yes,no]  [1900][1051,1770]calculate variance
• For T202 the splits are
Condition[excellent,good,fair][] [270][99,625]ignored
Leslie[yes,no] - [625][99,270]  calculate variance
Regression Tree
Clustering Trees
• Regressions finds an instance space segment that
target values are tightly clustered around the mean.
• Variance of set of target value is average
squared Euclidean distance to mean.

• Learning a clustering tree using
1. Dissimilarity Matrix.
2. Euclidean distance
• For A100 the means of the three numerical
features(price, reserve,bids)
11,8,13
18,15,15
19,19,1
• vectors (means) are(16,14,9.7)
• Variance is:
1/3sq(16-11)+(16-18)+(16-19)=1/3sq(5)+(-2)+(-
3)=12.7
• 1/3sq(14-8)+(14-15)+(14-19)= 20.7
• 1/3sq(9.7-13)+(9.7-15)+(9.7-1)=38.2
RULE MODELS
• Logical Models:
1. Tree models.
2. Rule models.

• Rule models consist of a collection of implications


or if–then rules.

• if-part defines a segment, and the then-part


defines the behaviour of the model in this
segment
• Two Approaches:
1. find a combination of literals – the body of the
rule, which is called a concept – that covers a
sufficiently homogeneous set of examples, and
find a label(class) to put in the head of the rule.
Ordered sequence of Rules  Rule Lists

2. first select a class you want to learn, and then find


rule bodies that cover (large subsets of ) the
examples of that class.
 Unordered collection of Rules Rule Sets
Learning Ordered Rule Lists
• Growing Rule body that improves Homogeneity
• Decision Tree Rule Lists

C1 C2 True False

Impurity for 2 classes Only for 1 children

• Separate and Conquer


many
many

1-

1-

1-

[0+,0-] [0+,0-]

1-
many

0-

0-

1-

0-
Learning Unordered Rule Sets
• Alternative approach to rule learning.

• Rules are learned for one class at a time.

• minimizing min(p˙, 1 − p˙).

• maximize p˙, the empirical probability of the


class.
Descriptive Rule Learning
• Descriptive models can be learned in either a
supervised or an unsupervised way.

• Supervised:
how to adapt the given rule learning
algorithms ---- subgroup discovery.

• Unsupervised Learning:
---frequent item sets and association rule
discovery.
Learning from Sub group Discovery

• Equal Proportion of Positives to Overall


Population.
1. Precision
|Prec – Pos|

2. Average-Recall
|avgrec – 0.5|

3. Weighted Relative Accuracy


= Pos * Neg (TPR - FPR)
Association Rule Mining

You might also like