Subject: Statistics

Subject: Statistics
Paper: Multivariate Analysis

Module: Discriminant Analysis and
Classification
1 / 21
Development Team
Principal investigator: Dr. Bhaswati Ganguli, Professor,

Department of Statistics, University of Calcutta
Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of
Statistics, University of Calcutta
Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian
Institute of Public Health, Hyderabad
Content reviewer: Dr. Kalyan Das, Professor, Department of
Statistics, University of Calcutta
2 / 21
Discriminant Analysis
Discriminant analysis is the technique of separating distinct

observations into well-defined groups or clusters.
I Its primary difference with cluster analysis is that, unlike the

latter, the characteristics of the groups are known to a certain
degree.
I The problem is more of allocating the individuals into
specified groups and not of defining the groups themselves.
I Hence whereas cluster analysis is exploratory by nature with
the clusters formed without any prior information regarding
their nature, discriminant analysis is based more on the known
distinctive features of the groups.
3 / 21
Classification
Classification is the problem of assigning new observations to one

or the other of the groups or clusters.
I Thus while discriminant analysis separates the observations

into specified groups, classification allocates individual
observations into these groups.
I The two problems are intrinsically related and the distinction
between the two is often blurred. To quote from Johnson and
Wichern (2002)
”A function that separates objects may sometimes serve as an
allocator, and, conversely, a rule that allocates objects may
suggest a discriminatory procedure.”
4 / 21
The distinction
I Let x = (x1 , x2 , ., xm )0 be the vector of the m characteristics

under study.
I Let there be n individuals in the study.
I These individuals belong to one of several groups G1 , G2 , ...,
Gr .
I The first problem is to find functions based on the parameters
of the r groups which discriminates between the groups and
hence separates the individuals into them.
I The next problem is to find a rule which classifies a new
individual to one of the r groups.
5 / 21
Two groups
I To begin with, let us consider r = 2 i.e. we have 2 groups G1

and G2 .
I Let f1 (x) and f2 (x) be the two probability density functions
that characterize the groups G1 and G2 respectively.
I Also let the probability of an individual belonging to G1 be p1
and his belonging to G2 be p2 .
p1 and p2 are known as prior probabilities.
I Let Ω be the sample space of x.
6 / 21
Misclassification
I Subdivide Ω into two mutually exclusive and exhaustive

subsets R1 and R2 = Ω − R1 such that
I if x ∈ R1 we assign x to G1
I and if x ∈ R2 we assign x to G2 .
I So every individual x is assigned to one and only one of the
two groups.
However, the split in Ω may be such that there are individuals who
actually come from G2 but are in R1 and hence classified in G1 ,
and vice versa. These are known as misclassifications.
Thus the aim is to find a good discriminator or separator of Ω such

that the probability of misclassification is minimized.
7 / 21
Conditional probability of misclassification
I Let P (j|k) denote the conditional probability that an

individual coming from Gk is classified in Gj .
I Then the conditional probability of misclassifying an individual
from G1 as coming from G2 is
Z
P (2|1) = P (x ∈ R2 |G1 ) = f1 (x)dx. (1)
R2
I Similarly,the conditional probability of misclassifying an

individual from G2 as coming from G1 is
Z
P (1|2) = P (x ∈ R1 |G2 ) = f2 (x)dx. (2)
R1
8 / 21
Unconditional probability of misclassification
The unconditional probabilities of correctly classifying an individual

from G1 in G1 and from G2 in G2 are respectively
P (correctly classified in G1 ) = P (G1 )P (x ∈ R1 |G1 ) = p1 P (1|1)

P (correctly classified in G2 ) = P (G2 )P (x ∈ R2 |G2 ) = p2 P (2|2),
while the unconditional probabilities of misclassifying an individual

from G1 in G2 and from G2 in G1 are respectively
P (misclassified as G1 ) = P (G2 )P (x ∈ R1 |G2 ) = p2 P (1|2) (3)

P (misclassified as G2 ) = P (G1 )P (x ∈ R2 |G1 ) = p1 P (2|1). (4)
9 / 21
Classification Rule
A classification rule may now be developed by minimizing the

misclassification probabilities (3) and (4).
However, very often the cost of misclassifications are not always

the same.
Example
When classifying individuals as healthy or ailing based on
pathological reports, the cost of misclassifying an ailing person as
healthy is always much greater than the cost of misclassifying a
healthy person as ailing.
Hence in deciding on the classification rule the cost needs to be

accounted for.
10 / 21
Cost of misclassification
I Let C(j|k) denote the cost of misclassifying an individual

from k th group into the j th group.
I Then the expected cost of misclassification is
ECM = p1 P (2|1)C(2|1) + p2 P (1|2)C(1|2) (5)
I The classification rule can then be obtained by minimizing the

ECM.
11 / 21
The Rule
Result 1
The subsets R1 and R2 that minimizes the ECM are as follows :
f1 (x) p2 C(1|2)
R1 : ≥ (6)
f2 (x) p1 C(2|1)
f1 (x) p2 C(1|2)
and R2 : < (7)
f2 (x) p1 C(2|1)
12 / 21
Proof
Since, R1 ∪ R2 = Ω and R1 ∩ R2 = φ, for j = 1 and 2,

Z Z Z
fj (x)dx + fj (x)dx = fj (x)dx = 1.
R1 R2 Ω
so that
Z Z
ECM = p1 C(2|1) f1 (x)dx + p2 C(1|2) f2 (x)dx
R2 R1
Z Z
= p1 C(2|1)[1 − f1 (x)dx] + p2 C(1|2) f2 (x)dx
R1 R1
Z
= p1 C(2|1) + [p2 C(1|2)f2 (x) − p1 C(2|1)f1 (x)]dx (8)
R1
13 / 21
Proof (contd.)
Now the prior probabilities p1 , p2 and the costs C(2|1), C(1|2)

are all nonnegative. Also f1 (x) and f2 (x) are nonnegative for all
x. Hence ECM will be minimized if the quantity in the third
bracket under the integral sign of (8) is negative for all x ∈ R1 , i.e.
R1 : p2 C(1|2)f2 (x) − p1 C(2|1)f1 (x) ≤ 0.
By similar logic, R2 should include all x such that
p2 C(1|2)f2 (x) − p1 C(2|1)f1 (x) > 0.
Of course, for equality the classification could be either way, but to

avoid ambiguity it is arbitrarily associated with any one of the two
subspaces. The result thus follows. ∇
14 / 21
Corollaries
The classification rule is thus primarily based on

I the density ratio f1 (x)/f2 (x)
I the prior probability ratio p2 /p1
I and the cost ratio C(1|2)/C(2|1).
Corollary 1
If the misclassification costs are equal i.e. C(1|2) = C(2|1),
f1 (x) p2 f1 (x) p2
R1 : ≥ and R2 : < (9)
f2 (x) p1 f2 (x) p1
15 / 21
Corollaries (contd.)
Corollary 2
If the prior probabilities are equal i.e. p1 = p2 ,
f1 (x) C(1|2) f1 (x) C(1|2)

R1 : ≥ and R2 : < (10)
f2 (x) C(2|1) f2 (x) C(2|1)
Corollary 3
If both the misclassification costs and the prior probabilities are
equal i.e. C(1|2) = C(2|1) and p1 = p2 ,
f1 (x) f1 (x)
R1 : ≥1 and R2 : <1 (11)
f2 (x) f2 (x)
16 / 21
The TPM
A criterion alternative to the ECM is the Total probability of

misclassification (TPM). An optimal classification is then obtained
by minimizing
Z Z
T P M = p1 f1 (x)dx + p2 f2 (x)dx (12)
R2 R1
Classification Rule
f1 (x) p2 f1 (x) p2
R1 : ≥ and R2 : < (13)
f2 (x) p1 f2 (x) p1
This is readily seen to be the same as Corollary 1, where the two

misclassification costs are assumed equal.
17 / 21
Alternative Rule
Another alternative is to allocate an observation to a group based

on the largest posterior probability, P (Gi |x), i.e. the probability of
belonging to group i given x, i = 1, 2.
By Bayes’ rule
p1 f1 (x)
P (G1 |x) =
p1 f1 (x) + p2 f2 (x)
p2 f2 (x)
and P (G2 |x) =
p1 f1 (x) + p2 f2 (x)
18 / 21
Alternative Rule
The classificton rule is then
Allocate to G1 if P (G1 |x) ≥ P (G2 |x)

and to G2 if P (G1 |x) < P (G2 |x).
I But since the denominators of the two posterior probabilities

are the same, the rule reduces to Corollary 1.
I However, the computations of the posterior probabilities are
by themselves often of interest.
19 / 21
Summary
I Distinction between Discriminant Analysis and Classification is

made.
I Clasification rules based on misclassification probabilities are
discussed.
I Rules based on the Expected Cost of misclassification are
described.
20 / 21
Thank You
21 / 21

Subject: Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Subject: Statistics

Uploaded by

Copyright:

Available Formats

Subject: Statistics

Paper: Multivariate Analysis

Principal investigator: Dr. Bhaswati Ganguli, Professor,

Discriminant analysis is the technique of separating distinct

I Its primary difference with cluster analysis is that, unlike the

Classification is the problem of assigning new observations to one

I Thus while discriminant analysis separates the observations

I Let x = (x1 , x2 , ., xm )0 be the vector of the m characteristics

I To begin with, let us consider r = 2 i.e. we have 2 groups G1

I Subdivide Ω into two mutually exclusive and exhaustive

Thus the aim is to find a good discriminator or separator of Ω such

I Let P (j|k) denote the conditional probability that an

I Similarly,the conditional probability of misclassifying an

The unconditional probabilities of correctly classifying an individual

P (correctly classified in G1 ) = P (G1 )P (x ∈ R1 |G1 ) = p1 P (1|1)

while the unconditional probabilities of misclassifying an individual

P (misclassified as G1 ) = P (G2 )P (x ∈ R1 |G2 ) = p2 P (1|2) (3)

A classification rule may now be developed by minimizing the

However, very often the cost of misclassifications are not always

Hence in deciding on the classification rule the cost needs to be

I Let C(j|k) denote the cost of misclassifying an individual

ECM = p1 P (2|1)C(2|1) + p2 P (1|2)C(1|2) (5)

I The classification rule can then be obtained by minimizing the

Since, R1 ∪ R2 = Ω and R1 ∩ R2 = φ, for j = 1 and 2,

Now the prior probabilities p1 , p2 and the costs C(2|1), C(1|2)

R1 : p2 C(1|2)f2 (x) − p1 C(2|1)f1 (x) ≤ 0.

By similar logic, R2 should include all x such that

p2 C(1|2)f2 (x) − p1 C(2|1)f1 (x) > 0.

Of course, for equality the classification could be either way, but to

The classification rule is thus primarily based on

f1 (x) C(1|2) f1 (x) C(1|2)

A criterion alternative to the ECM is the Total probability of

This is readily seen to be the same as Corollary 1, where the two

Another alternative is to allocate an observation to a group based

The classificton rule is then

Allocate to G1 if P (G1 |x) ≥ P (G2 |x)

I But since the denominators of the two posterior probabilities

I Distinction between Discriminant Analysis and Classification is

You might also like