Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

27-07-2023

Canonical Discriminant
analysis

The Good, the Bad and the Dividing Line

Good Accounts
Bad Accounts
25

20

15
Return on Investm ent

10

0
0.00 0.50 1.00 1.50 2.00 2.50

-5 Current Ratio

-10

1
27-07-2023

Why Line?

• Simple -- Ease of interpretation

• ‘Best discriminator’ in the case of NORMAL populations with equal


variance-covariance matrix

• At certain times, not good enough..

Objective criterion for choosing the ‘Best’ line

Z  a X  bY
CR ROI
Choose a,b so that Z-values of ‘good accounts’ are as
‘different’ from the Z-values of ‘bad accounts’ as possible

between group variation ( Z1  Z 2 ) 2


Max  Max
a ,b within group variation a ,b
 (Z1i  Z1 ) 2   (Z 2i  Z 2 ) 2

2
27-07-2023

Optimal Choice of discriminant coefficients


1
a   x2  xy   X1  X 2 
      
b  xy  y2   Y1  Y2 
1   y2   xy   x 
   
 x2 y2   xy xy    xy  x2   y 

 y2 x   xy y
a 
 x2 y2   xy xy
 x2 y   xy x
b 
 x2 y2   xy xy
5

Numerical Illustration: Case


Good Accounts Bad Accounts

Account Current Return on Account Current Return on


Number Ratio Investment Number Ratio Investment
1 1.10 13 11 0.70 11
2 1.50 15 12 0.90 -4
3 1.20 17 13 0.80 6
4 0.90 21 14 1.30 2
5 1.60 7 15 1.10 6
6 2.20 8 16 0.50 8
7 0.90 16 17 0.30 8
8 1.00 13 18 1.40 6
9 1.30 8 19 0.90 3
10 1.30 3 20 1.10 14

Average 1.30 12.10 0.90 6.00


overall 1.10 9.05

x  1.30  0.90  0.40


X Y
y  12.10  6.00  6.10 6

3
27-07-2023

Numerical Illustration (cont.)


X Y
Account Current Return on (X-X_bar)*
Number Ratio Investment (X-X_bar)^2 (Y-Y_bar)^2 (Y-Y_bar)
1 1.10 13 0.000 15.603 0.000
2 1.50 15 0.160 35.403 2.380
3 1.20 17 0.010 63.203 0.795
4 0.90 21 0.040 142.803 -2.390
5 1.60 7 0.250 4.203 -1.025
6 2.20 8 1.210 1.103 -1.155
7 0.90 16 0.040 48.303 -1.390
8 1.00 13 0.010 15.603 -0.395
9 1.30 8 0.040 1.103 -0.210
10 1.30 3 0.040 36.603 -1.210
11 0.70 11 0.160 3.803 -0.780
12 0.90 -4 0.040 170.303 2.610
13 0.80 6 0.090 9.303 0.915
14 1.30 2 0.040 49.703 -1.410
15 1.10 6 0.000 9.303 0.000
16 0.50 8 0.360 1.103 0.630
17 0.30 8 0.640 1.103 0.840
18 1.40 6 0.090 9.303 -0.915
19 0.90 3 0.040 36.603 1.210
20 1.10 14 0.000 24.503 0.000
7
average 1.10 9.05 0.163 33.948 -0.075

Numerical Illustration (cont.)


 y2 x   xy y 33.948  0.4  0.075  6.1
a 2 2   2.539
 x  y   xy xy 0.163  33.948  0.075 2

 x2 y   xy x 0.163  6.1  0.075  0.4


b 2 2   0.185
 x  y   xy xy 0.163  33.948  0.0752

4
27-07-2023

X
Z  a X  b Y  a b   
Y 
1
a   x2  xy   X1  X 2 
      
b  xy  y2  
 1 2 
Y Y

General form of the discriminant function


If there are p independent variables
X  X 1 X2  X p
'

The discriminator function is :

( X 1  X 2 ) '  1 X  Z
9

Classification Rule based on discriminator


function

X 2 Y2  X 1 Y1 

x y

Z2 z Z1

Z-values Classify the new observation to population 1


If z is closer to Z1 than Z2 . 10

10

5
27-07-2023

Multiple Discriminant Analysis


• When your have to discriminate between MORE than two groups

• More than one [as many as min(g-1, p) ] discriminant functions may


be used

11

11

Canonical Correlation in Discriminant analysis

(X1, X2,….,Xp) (U1, U2,….,Ug-1)

Indicators of group-memberships

Find best linear combination that predict memberships


Find best linear combination among all which are
independent of the first.

12

12

6
27-07-2023

Correlation Multiple correlation


 canonical correlation

Correlation: between two variables. What is 2


R: between Y and (X1,X2…Xp) What is R2
CC: between (Y1,Y2,…Yq) and (X1,X2…Xp) What is CC2

13

13

Example: Multiple discriminant analysis


family income attitude travel Importance fam vac HH size age- head HH amount spent on holiday
50.2 5 8 3 43 2
70.3 6 7 4 61 3
62.9 7 5 6 52 3
48.5 7 5 5 36 1
52.7 6 6 4 55 3
75 8 7 5 68 3
46.2 5 3 3 62 2
57 2 4 6 51 2
64.1 7 5 4 57 3
68.1 7 6 5 45 3
73.4 6 7 5 44 3
71.9 5 8 4 64 3
56.2 1 8 6 54 2
49.3 4 2 3 56 3
62 5 6 2 58 3

Resort visit 1: visited the resort


Amount spent on vacation: 1 (Low) 2 (medium) 3 (High)
14

14

7
27-07-2023

Data for those who did not visit the resort


resort visit family income attitude travel Importance fam vac HH size age- head HH amount spen
2 32.1 5 4 3 58 1
2 36.2 4 3 2 55 1
2 43.2 2 5 2 57 2
2 50.4 5 2 4 37 2
2 44.1 6 6 3 42 2
2 38.3 6 6 2 45 1
2 55 1 2 2 57 2
2 46.1 3 5 3 51 1
2 35 6 4 5 64 1
2 37.3 2 7 4 54 1
2 41.8 5 1 3 56 2
2 57 8 3 2 36 2
2 33.4 6 8 2 50 1
2 37.5 3 2 3 48 1

Resort visit 2: Did not visit the resort


Amount spent on vacation: 1 (Low) 2 (medium) 3
(High) 15

15

Objective
• Predict/explain different categories of amount spent on
the basis of
• Annual family income ---Attitude towards travel
• Importance given to family vacation
• Household size ---Age of the Head of HH

• Which of the above variables are ‘good’ discriminators?


• Predict expense category of families information of
which may be available in terms of the predictor
variables

16

16

8
27-07-2023

Group Statistics Mean Std. Deviation


amount spent on vacation
1 family income 38.57 5.30
importance family vacation 4.70 1.89
travel attitude 4.50 1.72
household size 3.10 1.20
age of household head 50.30 8.10
2 family income 50.11 6.00
importance family vacation 4.20 2.49
travel attitude 4.00 2.36
household size 3.40 1.51
age of household head 49.50 9.25
3 family income 64.97 8.61
importance family vacation 5.90 1.66
travel attitude 6.10 1.20
household size 4.20 1.14
age of household head 56.00 7.60
Total family income 51.22 12.80
importance family vacation 4.93 2.10
travel attitude 4.87 1.98
household size 3.57 1.33
17
age of household head 51.93 8.57

17

Within Group Correlation matrix

income travel attitude Im. fam. Vac. HH size age head HH


family income 1.00 0.05 0.31 0.38 -0.21
travel attitude 0.05 1.00 0.04 0.00 -0.34
importance family vacation 0.31 0.04 1.00 0.22 -0.01
household size 0.38 0.00 0.22 1.00 -0.03
age of household head -0.21 -0.34 -0.01 -0.03 1.00

18

18

9
27-07-2023

Discrimination power of variables individually


(amount spent)
Tests of Equality of Group Means
Wilks' Lambda F Sig.
family income 0.26 38.00 0.0000
importance family vacation 0.88 1.83 0.1797
travel attitude 0.79 3.63 0.0400
household size 0.87 1.94 0.1626
age of household head 0.88 1.80 0.1840

Wilk’s Lambda = Within group SS/ Total SS

Good discrimination between groups  Small Lambda


19

19

Results from MDA  pg 


 n   1 ln W
Wilks' Lambda W  2 
k Test of Function(s) Wilks' Lambda Chi-square df Sig.
0 1 through 2 0.166 44.831 10 0.0000
1 2 0.802 5.517 4 0.2383
q
1
 1 
i  k 1
(p-k)(g-k-1)
of W 1 B i

Eigenvalues
Function Eigenvalue % of Variance Cumulative % Canonical Correlation
1 3.82 93.93 93.93 0.89
2 0.25 6.07 100.00 0.44
a First 2 canonical discriminant functions were used in the analysi

i
=SSB/SSW
i
1  i 20

20

10
27-07-2023

Significance of discriminant functions: Justification


through W and CC
=SSW/SST

Discrim eigen CC Prop. Prop


fn. value explained unexplained

1 1 1
1 1
1  1 1  1 1  1
2 2 1 1 1
2 2
1  2 1  2 1  1 1  1 1  2
21

21

22

11
27-07-2023

Un-standardized Discriminant
Function Coefficients

Function 1 Function 2
family income 0.1543 -0.0620
importance family vacation -0.0695 0.2613
travel attitude 0.1868 0.4223
household size -0.1265 0.1003
age of household head 0.0593 0.0628
(Constant) -11.0944 -3.7916

Use this to classify future observations 23

23

Standardized Discriminant
Function Coefficients

Function 1 Function 2
family income 1.0474 -0.4208
importance family vacation -0.1420 0.5335
travel attitude 0.3399 0.7685
household size -0.1632 0.1293
age of household head 0.4947 0.5245

24

24

12
27-07-2023

Structure Matrix: Discriminant Loadings


Correlation between discriminant functions
and predictor variables

25

25

Functions at Group Centroids

amount spent on vacation Function 1 Function 2


1 -2.0410 0.4185
2 -0.4048 -0.6587
3 2.4458 0.2402
L H
Function 1 separates  1 from  3 INCOME

Function 2 separates  1 from  2 Travel, vacation & age

26

26

13
27-07-2023

plot(travel.can1)
library(heplots)
heplot(travel.can1, scale=6, fill=TRUE)

27

Territorial Map

1
* * 3

*
2

prepared by S. Das

28

14
27-07-2023

Territorial Map

1
* * 3

*
2

prepared by S. Das

29

Hold-out sample
resort visit family income attitude travel Importance fam vac HH size age- head HH amount spe
1 50.8 4 7 3 45 2
1 63.6 7 4 7 55 3
1 54.0 6 7 4 58 2
1 45.0 5 4 3 60 2
1 68.0 6 6 6 46 3
1 62.1 5 6 3 56 3
2 35.0 4 3 4 54 1
2 49.6 5 3 5 39 1
2 39.4 6 5 3 44 3
2 37.0 2 6 5 51 1
2 54.5 7 3 3 37 2
2 38.2 2 2 3 49 1

Random part of the data set aside for validation


30

30

15
27-07-2023

Estimating Misclassification probabilities /


Classification matrix
• Default (re-substitution) : use estimates from entire data to predict
classification

• Performance in Hold-out sample

• Principle of cross-validation
• While classifying a specific case use all but that observation

31

31

Classification matrix: Hit-Ratio


Can the correct classifications be attributed to chance?

Hit ratio : proportion of correct classifications


Actual group Predicted group membership
1 2 3
1 9 1 0
Hit Ratio =86.67%
2 1 9 0
3 0 2 8

Hold-out samples
Actual group Predicted group membership
1 2 3 Hit Ratio =75%
1 3 1 0
2 0 9 1
32
3 1 0 3
32

16
27-07-2023

https://www.rdocumentation.org/packages/candisc/versions/0.8-6/topics/candisc

33

Linear Discriminant Analysis (LDA) using R


Peter Nistrup

34

17
27-07-2023

Data

The dataset ‘Breast Cancer Wisconsin (Diagnostic) Data Set’ was used for the
analysis.
There are 569 observations on 32 variables.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a
breast mass. They describe characteristics of the cell nuclei present in the image.
The diagnostic classification of the breast mass is given as either Benign or
Malignant.
This dataset is suitable for understanding how characteristics of the FNA image of
the breast mass relates to diagnosis of whether the mass is benign or malignant.

35

Data

ID number
Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus; mean, standard
error and worst values are given for each of the features, thus bringing the
number of features to 30:
radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension ("coastline approximation" - 1)

36

18
27-07-2023

Why LDA?

One of the objectives is to understand what qualities in a tumor contributes to


whether or not it is malignant.
From the PC1-PC2 plot, it is evident that there is clear separation of the two
categories.

37

Breast Cancer Diagnostic- Wisconsin Data


• Is LDA post PCA better than raw LDA ?
• ROC (Receiver operative characteristic) and AUC (Area under curve)

• Try with different seeds

38

19
27-07-2023

Why LDA?

LDA will try to find the decision boundary at which the classification is most
successful.
For example, consider only two dimensions and two distinct clusters; LDA will
project these clusters down to one dimension.

39

40

20

You might also like