Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Multivariate analysis (using R)

Copy right by

Stephane Champely

CHAMPELY October

16, 2014
2
Contents

1 Using the R software 5


1.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Installing the ade4 package . . . . . . . . . . . . . . . . . . . . . 6
1.4 A short introduction to the R software . . . . . . . . . . . . . . . 6
2 Exploratory univariate statistics 11
2.1 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Numeric data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Exploratory bivariate statistics 17
3.1 Relating two numeric measurements . . . . . . . . . . . . . . . . 17
3.2 Relating a numeric measurement and a categorical one . . . . . . 20
3.3 Relating two categorical measurements . . . . . . . . . . . . . . . 23
4 An introduction to exploratory multivariate analysis 27
4.1 Principal components analysis as a correlation problem . . . . . . 27
4.2 Computing a PCA using R . . . . . . . . . . . . . . . . . . . . . 28
4.3 PCA as a projection problem . . . . . . . . . . . . . . . . . . . . 29
5 Principal Components Analysis 35
5.1 PCA: rst order solution . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 PCA: second order solution . . . . . . . . . . . . . . . . . . . . . 39
5.3 PCA: diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Decomposing information . . . . . . . . . . . . . . . . . . 43
5.3.2 Contributions to inertia . . . . . . . . . . . . . . . . . . . 43
5.3.3 Squared cosines . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Beyond the basics : the statistical triple 45
6.1 The statistical triple . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1.1 The three elements of the statistical triple . . . . . . . . . 45
6.1.2 Properties of the duality diagram . . . . . . . . . . . . . . 46
6.2 The dudi class . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2.1 Arguments of a dudi object . . . . . . . . . . . . . . . . . 48
6.2.2 Values of a dudi objects . . . . . . . . . . . . . . . . . . . 48

3
4 CONTENTS
6.2.3 Benets of using the dudi . . . . . . . . . . . . . . . . . . 48
6.3 PCA on the covariance matrix . . . . . . . . . . . . . . . . . . . 49
6.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4.1 A decomposition of the trace . . . . . . . . . . . . . . . . 50
6.4.2 Contribution to inertia . . . . . . . . . . . . . . . . . . . . 52
6.4.3 Squared cosines . . . . . . . . . . . . . . . . . . . . . . . . 52
7 Multiple Correspondence Analysis 53
7.1 Multivariate categorical data . . . . . . . . . . . . . . . . . . . . 53
7.2 Multiple Correspondence analysis . . . . . . . . . . . . . . . . . . 53
7.3 The MCA as a statistical triple . . . . . . . . . . . . . . . . . . . 54
7.3.1 The disjunctive table . . . . . . . . . . . . . . . . . . . . . 54
7.3.2 The MCA statistical triple . . . . . . . . . . . . . . . . . 56
7.4 MCA diagnostics (See exercises) . . . . . . . . . . . . . . . . . . 57
7.4.1 Correlation ratios . . . . . . . . . . . . . . . . . . . . . . . 57
7.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.5.1 Correspondence analysis . . . . . . . . . . . . . . . . . . . 57
7.5.2 Fuzzy correspondence analysis . . . . . . . . . . . . . . . 57
7.6 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Hill and Smith's analysis 59
8.1 Multivariate mixed data . . . . . . . . . . . . . . . . . . . . . . . 59
8.2 Hill and Smith analysis . . . . . . . . . . . . . . . . . . . . . . . 59
8.2.1 Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.3 H&S analysis as a duality diagram . . . . . . . . . . . . . . . . . 60
9 Correspondence analysis 65
9.1 Contingency table . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 Analyzes of a contingency table . . . . . . . . . . . . . . . . . . . 65
9.3 Correspondence analysis . . . . . . . . . . . . . . . . . . . . . . . 67
9.3.1 Correspondence analysis . . . . . . . . . . . . . . . . . . . 67
9.3.2 Correspondence analysis as a duality diagram . . . . . . . 69
9.3.3 Dierent approaches for CA . . . . . . . . . . . . . . . . . 69
9.3.4 CA diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 70
9.4 Non symmetric Correspondence Analysis . . . . . . . . . . . . . . 70
9.4.1 NSCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.4.2 NSCA as a duality diagram . . . . . . . . . . . . . . . . . 71
9.4.3 NSCA plots . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.5 Belson's Correspondence Analysis . . . . . . . . . . . . . . . . . . 73
Chapter 1

Using the R software

1.1 What is R?
R is a system for statistical computation and graphics. It is a public domain
project running on Windows, Unix or Mac. It consists of a language plus a run-
time environment with graphics, a debugger, access to certain system functions,
and the ability to run programs stored in script les.
The core of R is an interpreted computer language which allows branching
and looping as well as modular programming using functions. Most of the user-
visible functions in R are written in R. It is possible for the user to interface to
procedures written in the C, C++, or FORTRAN languages for eciency.
The R distribution contains functionality for a large number of statistical
procedures. Among these are: linear and generalized linear models, nonlinear
regression models, time series analysis, classical parametric and nonparametric
tests, clustering and smoothing. There is also a large set of functions which
provide a exible graphical environment for creating various kinds of data pre-
sentations. Additional modules ( add-on packages ) are available for a variety of
specic purposes.

1.2 Installing R
The bin/windows directory of a CRAN site contains binaries for a base distri-
bution and a large number of add-on packages from CRAN to run on Windows
2000, XP, Vista, Windows 7, Windows 8 on Intel and clones (but not on other
platforms). The current version is R 3.1.1. One can nd the executable le of
the R software on the website : http://CRAN.R-project.org. Download the le
R-3.1.1-win.exe on your computer and execute it.
The Comprehensive R Archive Network (CRAN) is a collection of sites which
carry identical material, consisting of the R distribution(s), the contributed
extensions, documentation for R, and binaries.

5
6 CHAPTER 1. USING THE R SOFTWARE
1.3 Installing the ade4 package
In order to carry out some multivariate analyses, a specic package is needed.
The ade4 package is based on a powerful theoretical model that allows many
variations. A lot of graphics are available.
You can install and automatically update packages from within R if you have
access to repositories such as CRAN. Double ckick on the R icon to begin. Use
the "Package" menu and the option : "Install R packages". Choose a suitable
mirror site (e.g. Vietnam) and select "ade4" from the list of available packages.

1.4 A short introduction to the R software


(Double click on the R blue icon to begin) Here is a sample session in order to
have a glimpse of the R software. This script is available on the spiral website :
### Working directory
getwd()

### Library of packages


library()

### R as a calculator
25+56*78+5^3+exp(25)-sin(180)+log(125)+sqrt(4)

### Naming a calculation (assigning)


result1 <- 25+56*78+5^3+exp(25)-sin(180)+log(125)
print(result1)
result1

### Calculations using names


result2<-result1+1
print(result2)

### Data structure: vectors


x <- c(1,25,4,12,44)
print(x)
x[4]
x[2:3]
x[c(1,3)]

### Vectorized computations


x+5
x*2
log(x)

### Functions for generating data


1.4. A SHORT INTRODUCTION TO THE R SOFTWARE 7

1:10
seq(0,20,by=2)
seq(0,20,l=5)

### Random simulation of the normal distribution


rnorm(50)

### Finding some help


help(rnorm)
example(rnorm)

### Assignation of a name to a random simulation


x<-rnorm(50)

### Statistics and graphics for numeric data


sort(x)
min(x)
max(x)
mean(x)
var(x)
sd(x)
summary(x)
hist(x)
hist(x,col="red",xlab="Simulated normal data",main="")
help(hist)
boxplot(x)
plot(x)

### Class of a vector


class(x)

### Categorical data (factor)


c(1,2,3)
rep(c(1,2,3),c(10,20,20))
factor(rep(c(1,2,3),c(10,20,20)))
f<-factor(rep(c(1,2,3),c(10,20,20)))
class(f)

### Statistics and graphics for categorical data


summary(f)
plot(f)

### Character vector


n<-c("hello","the","world")
8 CHAPTER 1. USING THE R SOFTWARE
print(n)
n[3]

### Missing data


m<-c(12,15,NA,16,14)
print(m)
mean(m)
mean(m,na.rm=TRUE)

### Data structure: matrix


w<-1:10
W<-matrix(w,ncol=2,byrow=TRUE)
print(W)
W+1
log(W)
class(W)
sapply(W,class)
attributes(W)

### Data struture: dataframes


y<-rnorm(50)
z<-rnorm(50)
X<-data.frame(x,y,z)
class(X)
sapply(X,class)
attributes(X)
plot(X)

### Data structure: lists


a<-rnorm(10)
b<-rnorm(5)
d<-factor(c(1,1,2,2,3))
e<-"miscellaneous"
l<-list(arg1=a,arg2=b,arg3=d,arg4=e)
print(l)
class(l)
attributes(l)

### Bivariate analysis


plot(x,y)
plot(f,x)
1.4. A SHORT INTRODUCTION TO THE R SOFTWARE 9

### Models in R
lm(y~x)
lm1<-lm(y~x)
class(lm1)
attributes(lm1)
summary(lm1)
plot(lm1)
lm2<-aov(y~f)
class(lm2)
summary(lm2)

### Hypotheses tests and confidence intervals in R


test1<-t.test(x)
class(test1)
attributes(test1)
print(test1)

### Loading datasets from packages


data()
data(cars)
print(cars)
print(speed)
attach(cars)
print(speed)
plot(speed,dist)
lm(dist~speed)
abline(lm(dist~speed))
scatter.smooth(speed, dist)
detach(cars)

### Loading datasets from text files: txtfile.txt or csv files: csvfile.csv
data1<-read.table("txtfile.txt",header=TRUE)
data2<-read.csv("csvfile.csv",header=TRUE)

### Programming (functions)


fun1<-function(arg1,arg2){
plot(arg1,arg2)
abline(lm(arg2~arg1),col="red")
result<-coef(lm(arg2~arg1))
return(result)
}
10 CHAPTER 1. USING THE R SOFTWARE
x<-1:10
y<-4*x+5+rnorm(10)
fun1(x,y)

### Programming tricks


help(for)
help(while)
help(if)
Chapter 2

Exploratory univariate

statistics

One does not begin by directly dealing with multivariate statistics. It is wiser to
start by a sequence of univariate analyzes describing each variable included in
a multivariate data le. First, it allows to have a quick view of the mainstream
of the datale. Second, it may help to detect the presence of outliers, the need
of transformations or the need of a serious data cleaning before carrying out
any multivariate examination. This chapter is a summary of the main ideas of
exploratory univariate statistics. The point of graphing data will in particular
be stressed. Modern statistic is heavily based on visualizing data.
The rst important point is to distinguish categorical data from numeric
data. They indeed need a dierent statistical approach.

2.1 Categorical data


The datale survey from the MASS package contains the responses of 237 Statis-
tics students at the University of Adelaide to a number of questions. We begin
with the variable describing the student's sex. This is clearly a categorical mea-
surement. A student is a male or a female. The analysis of categorical data is
mainly based on counting the number of units in each category. It is also possi-
ble to compute percentages. Describing only two categories, statistical graphics
are in this rst example not really useful...
### Univariate analysis of categorical data
require(MASS)
?survey
data(survey)
attach(survey)
names(survey)
print(Sex)

11
12 CHAPTER 2. EXPLORATORY UNIVARIATE STATISTICS
class(Sex)
levels(Sex)
table(Sex)
summary(Sex)
sum(table(Sex))
table(Sex)/sum(table(Sex))
pie(table(Sex))
table(is.na(Sex))

Another interesting example is the "smoking" variable. One can note that
the order of the categories is not here a very logic one (due to the alphabetical
ordering). So it is interesting to change this before a more careful examination.
Actually, this kind of data is called an ordered categorical variable (or ordinal
data). Look at the dierences between the results of the raw "Smoke" variable
and the ordered "Smoke2" variable.
### An ordinal variable
print(Smoke)
levels(Smoke)
Smoke2<-ordered(Smoke,levels=c("Never","Occas","Regul","Heavy"))
levels(Smoke2)
print(Smoke)
print(Smoke2)
class(Smoke)
class(Smoke2)
summary(Smoke)
summary(Smoke2)
pie(summary(Smoke))
pie(summary(Smoke2))
barplot(summary(Smoke))
barplot(summary(Smoke2))
dotchart(summary(Smoke))
dotchart(summary(Smoke2))

The only diculty in studying categorical data is to have good categories...


For instance, if the variable has too many levels, it is dicult to summarize the
data. Maybe in this case, it is better to change the categories and to aggregate
them... Here is an example using the Cars93 datale again from the MASS
package.
### A categorical variable with many categories
?Cars93
data(Cars93)
attach(Cars93)
print(Manufacturer)
summary(Manufacturer)
pie(summary(Manufacturer))
2.2. NUMERIC DATA 13

dotchart(summary(Manufacturer))
dotchart(sort(summary(Manufacturer)))
require(car)
Manufacturer2<-recode(Manufacturer, '"Ford"="Ford"; "Chevrolet"="Chevrolet"; "Dodge"="Dodge";
NA=NA; else="Others";', as.factor.result=TRUE)
pie(Manufacturer2)

2.2 Numeric data


The analysis of numeric data is a more sophisticated one. Those data are now
less crude, they are a continuum along a scale. We need a description of the
density of the data. The main ideas are to describe :
1. the location of the sample,
2. the dispersion of the sample,
3. the presence of outlying data ,
4. the symmetry of the data and
5. the existence of some kind of grouping .
Here, graphing the data must be emphasized. First one needs to visualize
the distribution of the data and second to choose the best statistical summaries
for the situation in hand. In particular the well known mean and standard
deviation statistics are not always the best choice.
Let us begin with the "Wr.Hnd" measurement in the survey dataset which
describes the span (distance from tip of thumb to tip of little nger of spread
hand) of writing hand, in centimeters, for the sample of students.
### Univariate analysis of a numeric variable
par(mfrow=c(2,2))
stripchart(Wr.Hnd,main="Stripchart")
hist(Wr.Hnd,freq=FALSE,main="Histogram")
lines(density(Wr.Hnd,na.rm = TRUE),col="red")
boxplot(Wr.Hnd,main="Boxplot")
qqnorm(Wr.Hnd,main="QQplot")
qqline(Wr.Hnd,col="red")

The message that conveys the plots in gure 2.1 is rather clear. Those data
are quite symmetric, there is no outlier and the distribution is not far from
being normal. In this case the best summaries are the mean and the standard
deviation. The denitions are for the mean :
n
1X
y= yi
n i=1
14 CHAPTER 2. EXPLORATORY UNIVARIATE STATISTICS

Stripchart Histogram

0.20
Density

0.10
0.00
14 16 18 20 22 14 16 18 20 22 24

Wr.Hnd

Boxplot QQplot

●●●●● ●


●●

22

22
Sample Quantiles ●


●●






●●




●●





●●





●●

●●


●●
●●

●●


●●

●●

18

18

●●

●●







●●

●●

●●
●●

●●

●●



●●

●●


●●

●●

●●●


14

14

● ●●
● ● ●

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Figure 2.1: Statistical graphics for univariate numeric data

and for the standard deviation :


v
u n
u 1 X
s=t (yi − y)2 .
n − 1 i=1

There is also another version of the standard deviation denoted σ̂ which is


the maximum likelihood estimation in a normal setting and divides the sum of
squared deviations by n (instead of n − 1). This denition is used in the french
school of multivariate analysis.

mean(Wr.Hnd)
sd(Wr.Hnd)

Now consider the "Age" measurement. The plots are rather dierent (gure
2.2). The data are not at all symmetric and two clear outliers can be detected.
In this case one has to use robust statistical summaries to describe the data.
Quantiles and especially the median and the quartiles give some interesting
2.2. NUMERIC DATA 15

information. The median breaks the sample into two equal parts and the quar-
tiles into four equal parts. The minimum and the maximum of the data are
also important quantities to consider. One can also use trimmed or winsorized
statistics.
Another possibility is to transform the data in order to normalize them. For
the "Age" measurement the transformation Age−15 1
gives good results but from
an interpretative viewpoint, we have to do with . . .
par(mfrow=c(2,2))
hist(Age,freq=FALSE,main="Histogram")
lines(density(Age), col ="red")
lines(density(Age,bw=2), col ="blue")
qqnorm(Age,main="QQplot")
qqline(Age,col="red")
box.cox.powers(Age-15)
hist(1/(Age-15),freq=FALSE,main="Transformed data")
lines(density(1/(Age-15)), col ="red")
qqnorm(1/(Age-15),main="Transformed data")
qqline(1/(Age-15),col="red")
summary(Age)
mean(x,trim=0.1)
dev.off()
require(vioplot)
vioplot(1/(Age-15))

Another intermediate type of data is the Likert's scale. It corresponds to


questions leading to responses from strongly disagree to strongly agree (using
an even or odd number of levels). These ordinal data are usually recoded as
1, 2, · · · , l and considered as numeric measurements but because they are fairly
raw numbers, the recommended statistics are the straightforward mean and
standard deviation. Several of these measurements are habitually collected in
psychology, or marketing, and those collections are mostly analyzed by multi-
variate techniques.
16 CHAPTER 2. EXPLORATORY UNIVARIATE STATISTICS

Histogram QQplot
0.15


Sample Quantiles

60
0.10
Density

●●

40
0.05



●●



●●

●●


●●

●●



20 ●
●●

●●


0.00

●●
●●



●●
● ●

●●


●●
●●

●●

●●
●●

●●

●●


●●

●●
●●

●●

●●


●●


●●

●●

●●●●


●●

● ●


●●

●●


● ●●●●●

20 30 40 50 60 70 −3 −2 −1 0 1 2 3

Age Theoretical Quantiles

Transformed data Transformed data


3.0


●●●
Sample Quantiles

●●


●●
●●



●●
●●


●●


●●
0.4



●●

2.0




●●


●●


Density


●●

●●




●●


●●






●●

●●



●●

0.2
1.0





●●

●●

●●

●●


●●




●●





●●

●●

●●
●●

●●
0.0

● ●●●
0.0

0.0 0.2 0.4 0.6 −3 −2 −1 0 1 2 3

1/(Age − 15) Theoretical Quantiles

Figure 2.2: Statistical graphics for univariate numeric data and transformed
data
Chapter 3

Exploratory bivariate

statistics

The rst step towards multivariate statistics is to simultaneously consider two


measurements. Actually, the multivariate operations are usually mere exten-
sions of bivariate ideas. This is thus important to have a clear view of the
dierent settings in bivariate analysis. Three dierent analyses are possible
depending on the nature of the two variables.

3.1 Relating two numeric measurements


Is there a relationship between the span of the writing hand and that of the
non-writing hand? The scatterplot is the graphical tool to use. The gure 3.1 is
clear: there is (non unexpectedly) a very good relationship. How to summarize
this relationship? One can verify that a straight line through the scatterplot
is a good summary. Several methods exists to determine such a line but the
classical one is the least-squares line.
Consider the data (xi , yi )i=1,··· ,n and the equation for the line
y = αx + β + 

where  is the error. The parameters α and β are estimated by minimizing


the least-squares criterion:
n
X
(yi − (αxi + β))2 .
i=1

The well-known solutions are


cov(x, y)
α̂ =
var(x)

17
18 CHAPTER 3. EXPLORATORY BIVARIATE STATISTICS
where cov(x, y) is the covariance :
n
1 X
cov(x, y) = (xi − x)(yi − y),
n − 1 i=1

var(x) is the variance that is to say the square of the standard deviation:
n
1 X
var(x) = (xi − x)2
n − 1 i=1

and
β̂ = y − α̂x.

### bivariate analysis: numeric*numeric


require(MASS)
data(survey)
attach(survey)

### scatterplot and least-squares line


plot(NW.Hnd,Wr.Hnd)
lm1<-lm(Wr.Hnd~NW.Hnd)
abline(lm1,col="red")

### Linear regression modelling


summary(lm1)

The resulting output of the linear regression modelling is:


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83610 0.37029 4.959 1.36e-06 ***
NW.Hnd 0.90584 0.01982 45.712 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5976 on 234 degrees of freedom


(1 observation deleted due to missingness)
Multiple R-squared: 0.8993, Adjusted R-squared: 0.8989
F-statistic: 2090 on 1 and 234 DF, p-value: < 2.2e-16
The strength of the relationship is estimated by the linear correlation coef-
cient :
cov(x, y)
r= p p .
var(x) var(y)
This quantity lies in the interval (−1, +1). r = −1 corresponds to a perfect
negative linear relationship, r = 0 to no linear relationship and r = 1 to a
3.1. RELATING TWO NUMERIC MEASUREMENTS 19

● ●● ●●


●● ●

22

● ●● ●

● ● ●
●●
● ● ●●● ●
● ●
● ● ●● ● ●

20

●● ●●
● ●●
● ●●
● ●
● ●● ●●●

●●●●● ●
●●● ●●
● ●●●● ● ●
●●●●● ●● ●
Wr.Hnd


●●● ●●●● ●
● ●● ● ● ● ●
18

● ● ●●●●
●● ●●● ●

● ●●●
● ● ●●●●●●●● ●● ●
● ●
●● ●
●● ●●● ●●●●●
● ●
● ● ●●
● ●●
16

● ● ●●
●● ● ●

14

● ●

● ●

14 16 18 20 22

NW.Hnd

Figure 3.1: Scatterplot for the spans and least-squares line


20 CHAPTER 3. EXPLORATORY BIVARIATE STATISTICS
perfect positive linear relationship. We nd here r = 0.948. Directly linked to r
is the coecient of determination : r2 which gives the percent of (the variability
of) Y explained by the linear relationship with X. Here r2 = 90%.
### linear correlation coefficient
cor(NW.Hnd,Wr.Hnd,use="complete.obs")
But of course we face here an uncharacteristic example of what one can
encounter as relationships in real life data. Usually, the relationship is less
strong and is frequently non linear. It is very important to understand that in a
non-linear case, we must not use these linear tools! But maybe after a suitable
transformation. . .
Let us now examine an example from the monde84 dataset in the ade4 pack-
age. Here we try to study the gross national product and the schooling rate in
48 countries. One can see on the scatterplot that the relationship between these
two measurements is denitely a non-linear one. So tting a line, even the best
one, is not a good practice! A non-linear function must be used, for instance a
linear smoother or we must think to a data transformation in order to linearize
the problem.
### Second example of numeric*numeric (transformation)
require(ade4)
data(monde84)
attach(monde84)
plot(pib,scol)
lm2<-lm(scol~pib)
abline(lm2,col="red")
scatter.smooth(pib,scol,col="blue")
scatter.smooth(log(pib),scol)
Exercise 1 Cross the variables Height and Wr.Hnd from the student dataset.

Exercise 2 Use the command

data(anscombe)
anscombe
attach(anscombe)
in order to deal with a special datale created by the American statistician
Franck Anscombe. Study the relationships : (1) x1 and y1, (2) x2 and y2, (3)
x3 and y3 and nally (4) x4 and y4.

3.2 Relating a numeric measurement and a cat-


egorical one
From the dataset survey, we can study the relationship between the span of the
writing hand and the sex of the student. It is easier to consider that we want
3.2. RELATING A NUMERIC MEASUREMENT AND A CATEGORICAL ONE 21
to explain the numeric variable (span) by the categorical variable (sex). The
method is called analysis of variance and is a golden classic in statistics. The
contrary, that is to explain sex from span, is a more dicult approach, logistic
regression, leading to generalized linear modeling . This is beyond the scope of
this course1 .
It is necessary to understand in the present situation that studying the rela-
tionship amounts to compare groups of numeric data. These groups correspond
to the dierent levels of the categorical measurement. In the chosen example,
one wants to compare the spans of males and females. The rst thing to do (as
always) is to plot the data. In this case, one has to display a graphic for each
group and to compare them along the same scale.
# plots
par(mfrow=c(1,2))
stripchart(Wr.Hnd~Sex,vertical=TRUE,method="stack")
boxplot(Wr.Hnd~Sex)
# or plot(Sex,Wr.Hnd)

The resulting plots (gure 3.2) show that the spans are higher for male
students and that there is also more variability in the male sample. One can
quantify these observations using the mean and the standard deviation in both
samples.
### summary statistics
R> tapply(Wr.Hnd,Sex,mean,na.rm=TRUE)
Female Male
17.59576 19.74188
R> tapply(Wr.Hnd,Sex,sd,na.rm=TRUE)
Female Male
1.314768 1.750775
R> tapply(Wr.Hnd,Sex,summary)

A modeling approach describes the relationship between span and sex. This
is called analysis of variance technique. Here we only try to compute a quantity
to estimate the relationship. The classical statistic for doing this is the correla-
tion ratio, thereafter denoted CR (or sometimes η ). It estimates the dierence
2

between the means of the groups and compares it to the general variability.
Let's call y the numeric variable and n the size of the sample. One considers g
groups of data and nj data in the j th group: yij with i = 1, · · · , nj ; j = 1, · · · , g .
If the general mean is denoted y and the mean in each group y j , the correlation
ratio is : Pg 2
j=1nj (y j − y)
CR = Pg Pnj 2
.
j=1 i=1 (yij − y)

This quantity lies between 0 and 1. If CR = 0 then all means are equal
and there is no relationship between the two variables. On the contrary, if CR
1 But this is one of the most useful tool in statistic.
22 CHAPTER 3. EXPLORATORY BIVARIATE STATISTICS

22

22

20

20
18

18
16

16
14

14

● ●

Female Male Female Male

Figure 3.2: Collections of lineplots and boxplots


3.3. RELATING TWO CATEGORICAL MEASUREMENTS 23

is near one, the relationship is a very good one, the means are rather dierent
and the variability inside the groups is small. We obtain here CR = 33%. This
means that 33% of the variability in the span measurement can be explained by
Sex2 .
### Analysis of variance
aov1<-aov(Wr.Hnd~Sex)
summary(aov1)
summary.lm(aov1)

The second output of the analysis of variance is:


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.5958 0.1424 123.53 <2e-16 ***
SexMale 2.1461 0.2019 10.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.547 on 233 degrees of freedom


(2 observations deleted due to missingness)
Multiple R-squared: 0.3266, Adjusted R-squared: 0.3237
F-statistic: 113 on 1 and 233 DF, p-value: < 2.2e-16

Remark 1 You can note that the relationship is only approached in terms of
variation in the means. We do not examine the possibility of variation in terms
of dispersion. More than this, the analysis of variance modeling (and the corre-
lation ratio) is based on a restrictive hypothesis : the dispersion is more or less
similar from sample the sample. And if it is not the case? The usual solutions
are to transform the data (the easy way) or to use a generalized linear modeling
approach (the hard way).

Exercise 3 Try to relate Smoking and Height in the student dataset.

3.3 Relating two categorical measurements


The last case for bivariate measurements is to relate two categorical variables.
As an example we are now trying to relate the sex of the student and how often
he (or she) exercises ("Exer" variable).
The rst step is to compute the crosstable or contingency table (cf. table
3.1). One can see that male students seem to exercise more frequently. We have
here the same number of male and female so we can directly compare the rows
but it is not usually the case and it is easier for interpretative purposes if one
2 Is it important? It is dicult to answer such a question without using statistical inference .
To say it quickly and dirty, you can look at the quantity called p-value. If it is inferior to 5%,
one traditionally considers that the relation is a statistically signicant one. Here the p-value
appears to be inferior to 2.2 × 10−16 so the relationship is really important
24 CHAPTER 3. EXPLORATORY BIVARIATE STATISTICS
computes (cf. table 3.2) the relative frequencies in each row (also called the
proles ).

Table 3.1: Crosstable: Exercise by Sex


Sex/Exercise None Some Freq
Female 11 58 49
Male 13 40 65

Table 3.2: Proles table: Exercise by Sex


Sex/Exercise None Some Freq Sum
Female 0.093 0.492 0.415 1
Male 0.110 0.339 0.551 1

Plotting the data is not very interesting in this case because we have few
categories. Anyway, we conrm seeing gure 3.3 that males exercise more fre-
quently than females.
### bivariate analysis: category*category
table(Sex,Exer)
Exer2<-ordered(Exer,levels=c("None","Some","Freq"))
table(Sex,Exer2)
cross<-table(Sex,Exer2)
sweep(cross,1,apply(cross,1,sum),"/")
round(100*sweep(cross,1,apply(cross,1,sum),"/"))

# plots
par(mfrow=c(1,2))
mosaicplot(cross,color=TRUE,main="")
barplot(t(cross),beside=TRUE,legend.text=TRUE)
assocplot(cross)
The usual statistic for estimating the quantity of relationship in this case is
the Pearson's chi-squared statistic for independence . Let nij correspond to row
i and column j in the crosstable. If ni+ is the sum in the crosstable for row
i, n+j the sum for column j and n++ the general sum, then the chi-squared
statistic is 2
X (nij − n̂ij )
X2 =
ij
n̂ij

where
ni+ × n+j
n̂ij = .
n++
3.3. RELATING TWO CATEGORICAL MEASUREMENTS 25

Female Male
None
Some
Exer2

Freq

Sex

Figure 3.3: A mosaic plot to describe the relationship between sex and exercise
26 CHAPTER 3. EXPLORATORY BIVARIATE STATISTICS
If X 2 is zero then the proles (row relative frequencies) are equal, this means
that there is no relationship. If the statistic is important then the proles are
dierent and we have a relationship to explain. What means important? Here,
we obtain X 2 = 5.7184. The problem is that the chi-squared statistic is not a
percentage of variability, it can be superior to one 3 . There is no clear agreement
in the statistical world on how to "standardize" this statistic (despite a lot of
proposals : Cohen's w, Cramer's V ). In fact, one has to use a p-value approach.
In this case, the p-value is near 5%, we face quite a signicant relationship.
### Chi-squared test
R> chisq.test(cross)
Pearson's Chi-squared test

data: cross
X-squared = 5.7184, df = 2, p-value = 0.05731

# effect sizes
Xsquared<-chisq.test(cross)$statistic
N<-sum(cross)
wCohen<-sqrt(Xsquared/N)
wCohen
k<-min(dim(cross))
VCramer<-sqrt(Xsquared/(N*(k-1)))
VCramer

Exercise 4 Cross the variables Sex and Smoking from the student dataset.

q
3 Even w = X2
can be > 1.
n
Chapter 4

An introduction to

exploratory multivariate

analysis

4.1 Principal components analysis as a correla-


tion problem
The (standardized) Principal Components Analysis (PCA) allows to summarize
a set of numeric variables. To understand this approach, we will use a very
simple problem. The DECATHLON datale from the workspace SCvn2014.RData
describes the result of 33 decathlonians. In a rst step, we only study the rst
two variables : sprint (100 meter) and long jump.
The strategy in PCA is to determine a new variable called principal compo-
nent which is the most related one with the set of original variables. Because
these two variables are numeric ones, related means in this context linearly
correlated.
Denition 1 The PCA looks for a variable which maximizes the sum of the
squares of its linear correlation coecients with the set of original variables.

In order to obtain a single solution z, one will suppose that the principal
component is centered and standardised ( z̄ = 0 and σ̂(z) = 1, this is because any
linear transformation will do the same job but note that −z is also a solution).
Let's consider p original variables xj . One wants to maximize
p
X
q= r2 (xj , z).
j=1

If one calls yj the centered and standardized variables corresponding to the


original variables xj (yi = xσ̂(x
j −xj
j)
), and D the diagonal matrix with n1 on the

27
28CHAPTER 4. AN INTRODUCTION TO EXPLORATORY MULTIVARIATE ANALYSIS
diagonal, one obtains
p
X
q= (yj0 Dz)2 .
j=1

Using a canonical inner product in Rp one can write


q = kY0 Dzk2 = z0 DYY0 Dz = z0 DWDz.

The maximum of this quadratic form q , under the constraint z0 Dz = 1 is


given by the rst eigenvector of the matrix WD1 .
Remark 2 The matrix WD is a big one (n × n). On the contrary, the matrix
V = Y0 DY is only p × p. It can be demonstrated that V = R is indeed the
correlation matrix .
2

If v is an eigenvector of R, then Yv is an eigenvector of WD with the same


eigenvalue. It can be proved that z = √1λ Yv.
Another interesting fact from a computational viewpoint is that the corre-
lations of the principal
√ component with the original variables may be directly
obtained by using λv

Denition 2 If the principal component does not give enough information, one
can produce a second principal component, maximizing the same criterion under
a constraint of null correlation with the rst principal component. This process
can be iterated.

The following principal components are obtained as the following eigenvec-


tors of the WD matrix.

4.2 Computing a PCA using R


We will now compute the dierent elements of the PCA and try to show that
the ade4 package proposes specic functions to do this more easily. One will
also see the rst plots to produce from PCA results.
### PCA computations using matrical operations
require(ade4)
data(olympic)
X<-olympic$tab[,1:2]
X
plot(X)

# correlation matrix
1 It can be easily proved by using a Lagrangian. Let's note that 1 is also an eigenvector
n
of the matrix WD (corresponding to the eigenvalue 0) so z is centered (try to prove this).
Because of the unity constraint, it is also standardized (try also to prove this).
2 The standardised PCA is sometimes called PCA on the correlation matrix
4.3. PCA AS A PROJECTION PROBLEM 29

R<-cor(X)
R

# eigendecomposition
eig.R<-eigen(R)
lambd<-eig.R$values
lambd
v<-eig.R$vectors
v

# principal components
Y<-scalewt(X)
li<-Y%*%eig.R$vectors
li
Z<-sweep(li,2,sqrt(lambd),"/")
Z

# variable correlations with principal component


co<-cor(X,Z)
co
sweep(v,2,sqrt(lambd),"*")
apply(cor(X,Z)^2,2,sum)

Now the same thing, in a more ecient way:


### Same computations using ade4 functions
acp1<-dudi.pca(X,scannf = FALSE,nf=2)
acp1
attributes(acp1)
acp1$eig
acp1$co
acp1$c1
acp1$li
acp1$l1
score(acp1)
s.corcircle(acp1$co)

Figure 4.1 gives a graphical illustration of the property of the rst principal
component : it maximises the sum of the squares of the linear correlation coef-
cients with the set of original variables. Figure 4.2 gives of plot of the linear
correlations with the two rst principal components.

4.3 PCA as a projection problem


Because it is not easy to compare times (sprint variable) and distances (long
jump), the original data must be standardized. Then every decathlonian may
30CHAPTER 4. AN INTRODUCTION TO EXPLORATORY MULTIVARIATE ANALYSIS

● ●

X100● ●

● ●

11.4





● ●

● ● ●

● ●

11.0


● ●●
● ●




10.6

−2 −1 0 1 2

score
long

7.5

● ● ●● ● ●
● ● ●

● ●



● ● ● ●
7.0

● ●
● ●●
● ●




6.5

−2 −1 0 1 2

Figure 4.1: Scatterplots of the two original variables with the rst principal
component
4.3. PCA AS A PROJECTION PROBLEM 31

X100 long

Figure 4.2: Linear correlations of the two original variables with the rst two
principal components
32CHAPTER 4. AN INTRODUCTION TO EXPLORATORY MULTIVARIATE ANALYSIS
be considered as a vector into a two dimensional space. This 33 people give a
scatterplot (cf. gure 4.3).
To depict these data, one can t a line to the scatterplot. But the two
variables are here considered in a symmetric way, so the idea is a dierent one
from that one used in the least-squares line approach. One has to look for a line
close to the data considering orthogonal projections. Using the Pythagorea's
theorem, the problem may be thought as maximizing the variability of the
orthogonal projections of the points onto the line.
Denition 3 The PCA problem may be viewed as nding a line in a p-dimensional
space close to the data. It is equivalent to say that one wants to maximize the
variance of the orthogonal projections of the points onto the line.

The solution of this old problem (Pearson 1901) is given by the preceding
eigendecomposition of the correlations matrix R. The eigenvector v gives√the
direction of projection in the p-dimensional space. The n-vector Yv = λz
gives the coordinates of the projections and the eigenvalue λ is the maximum
variability obtained from these projections.
Remark 3 One can also look for an orthogonal direction in p-dimensional space
maximizing the same criterion. The solution is obtained by the sequence of the
correlation matrix eigendecomposition.

From a graphical viewpoint, the projections onto the rst two directions
gives usually a very good idea of the closeness between the n statistical units3 .
Figure 4.3 gives the scatterplot of the two (standardized) original measurements
and the directions for projecting the points. Figure 4.4 presents the 33 projected
points. This is only a rotating view of the preceding plot.
require(MASS)
eqscplot(Y,xlab="x100 (standardized)",ylab="Long jump (standardized)",type="n")
text(Y[,1],Y[,2],rownames(Y))
arrows(0,0,acp1$c1[1,1],acp1$c1[2,1],col="red")
arrows(0,0,acp1$c1[1,2],acp1$c1[2,2],col="blue")
dev.new()
s.label(acp1$li)

3 Considering only two measurements as here, it indeed gives an exact idea.


4.3. PCA AS A PROJECTION PROBLEM 33

2 11 15
5 3 1
1

4 14 21
12
Long jump (standardised)

13 10
25
33
0

9 30
20 16 23
19 7 22
28 18
8 27 24
29
−1

26
17
−2

32
−3

31

−3 −2 −1 0 1 2

x100 (standardized)

Figure 4.3: Scatterplots of the two original variables (standardized) and the two
directions of projection
34CHAPTER 4. AN INTRODUCTION TO EXPLORATORY MULTIVARIATE ANALYSIS

d=1

31

20 4
32 8
27 19
16
26 29 7 9
25
13 2
17 24 14 11
5
15 6
23 10 12
22
28 3
18 30 1

33
21

Figure 4.4: Scatterplot of the two sets of projections


Chapter 5

Principal Components

Analysis

5.1 PCA: rst order solution


The datale WORLD1984 may be obtained from the workspace SCvn2014.RData.
It corresponds to ve numeric measurements made on 48 countries : GNP (log),
population growth, death rate (log), illiteracy(log(x+1)) and school attendance.
The goals of PCA are (1) to understand the relationships between the ve
measurements and (2) to observe the closeness between the 48 countries.
One can see that the rst eigenvalue is very large compared to the follow-
ing ones (gure 5.1). So it is probably interesting to only select one principal
component.
The relationships of the original measurements with this principal compo-
nent is visualized in gure 5.2. One can see that all variables are very well linked
to the principal component which may be interpreted as a (negative) developing
index. The scores for the projections of a set of representative countries clearly
conrm this (gure 5.3).
### Principal components analysis
# the workspace SCvn2012.RData must be loaded

print(WORLD1984)
plot(WORLD1984)
pca.WORLD<-dudi.pca(WORLD1984)

# plotting the first order solution


score(pca.WORLD)
row.scores<-pca.WORLD$li[,1]
names(row.scores)<-rownames(pca.WORLD$li)
dotchart(sort(row.scores)[c(1:5,21:25,41:45)])

35
36 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS

4
3
2
1
0

Figure 5.1: Plot of the eigenvalues from the PCA of the WORLD1984 datale
5.1. PCA: FIRST ORDER SOLUTION 37

logGNP


●● ●
PopGrowth ●

40
● ● ●

9




● ● ●

30
● ● ●
● ●
8


● ● ● ● ● ●

● ● ● ● ●
● ●
●● ● ●
●● ● ● ●
● ●
y


20

7

● ●
● ● ● ●
● ● ●

● ●

● ●
10

● ●
● ●

6

● ● ●● ● ●
● ●
●●
● ● ● ● ● ●
●● ●

0

● ●
5

● ●●

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

score score
logDeath ●
logIlliteracy ●
● ●●
5.0

● ● ●
● ●●
● ● ●● ●●
4

● ● ●

● ●●

● ● ●
●● ●●●
● ●●
● ●
● ●

3

● ●
4.0

● ●

● ●

●● ●
● ●
y

●●
● ● ●
2



● ●

3.0



1

● ●

● ●●●
●●●
● ●
● ● ● ● ●
● ●
● ● ●

2.0


0

●● ● ●

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
100

score score
School

● ●



●● ● ●
● ●
● ● ●
80




● ●
● ●
● ●
60


● ●
● ●
y





● ●
40

● ● ●


● ●


20

● ●

● ●

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Figure 5.2: Scatterplots of the ve original variables from the WORLD1984
datale with the rst principal component
38 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS

Niger ●

Mozambique ●

Ethiopie ●

Haute.Volta ●

Sénégal ●

Vietnam ●

Mexique ●

Chili ●

Venezuela ●

Corée.Sud ●

Argentine ●

France ●

RFA ●

Suisse ●

Finlande ●

Suède ●

−3 −2 −1 0 1 2 3

Figure 5.3: Dotplot for the projections of the countries, only 15 of them
are given (the 5 lowest scores, 5 medium scores, the 5 highest scores and of
course. . . Vietnam !)
5.2. PCA: SECOND ORDER SOLUTION 39

5.2 PCA: second order solution


The dataset DECATHLON from the workspace SCvn2014.RData gives the per-
formances of the Olympic 1988 men's decathlon. Our aim is to describe the
relationships between the 10 events and the closeness between the 33 athletes.

# plotting the second order solution


head(DECATHLON)
pca.decathlon<-dudi.pca(DECATHLON)
s.corcircle(pca.decathlon$co)
s.label(pca.decathlon$li)
scatter(pca.decathlon)

The correlation circle (gure 5.4) indicates that


1. the results of sprint (100m) and hurdles are positively correlated
2. the results of sprint (100m) and length jump are negatively correlated. It
may seem at rst glance quite a surprising result. But negative correlation
means: high values in jump (long distances) are probably associated with
low values in running (short times).That is to say good performances are
linked!
3. there is no correlation between the sprint results and the throwing results.
One can summarize these ndings by saying that two components emerge in
the performances: a speediness component and a strength component 1 .
The projections of the 33 points gives the closeness between the athletes
(gure 5.5). One can see that the best athletes (the number gives the nal
ranking in the Olympics) are on the right. But how can we link this with the
correlation circle? The biplot (gure 5.6) gives a simultaneous picture of the
variables and the statistical units.
When an statistical unit is located "in the direction of a variable" (let see
that number 6 is in the direction of LongJump), it means that in the dataset,
the individual has (probably) a high value for this variable (Yes, 7.72 meter
for number 6). On the contrary, if an individual is in the opposite direction
of a variable (Let say 31 again with LongJump), it must have a low value on
the corresponding variable (Yes, 6.22 meter for unit 31). The (0,0) location
corresponds to mean performance in every event. If the individual is away the
direction, one considers the orthogonal projection onto the line dened by the
variable.

Exercise 5 The datale RUNNING (in SCvn2014.RData ) corresponds to the


best results for 51 countries in the Olympics game from 1896 to 1984) in 8
running events. Perform a PCA of this dataset and try to interpret the results.

1 Sometimes a procedure called rotation such as varimax is used in order to identify these
two interesting directions with real components, especially in psychology
40 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS

Discus
PuttingShot
Run1500
Javelin
Run400

Sprint100 PoleVaulting

Hurdles
HighJump

LongJump

Figure 5.4: Correlation circle of the 10 events with the rst two principal com-
ponents from the PCA of the DECATHLON dataset
5.2. PCA: SECOND ORDER SOLUTION 41

d=2

17

18
31
28
10
20 111
7
4
23
9
21 16
32 30 22 12 8 3 2
15
24 14

25
6

29 26 19 13 5
33 27

Figure 5.5: Projections of the 33 athletes onto the rst two principal directions
42 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS

d=2

17

Discus
PuttingShot

Run1500
Javelin
Run400
18
31
28

10
11
Sprint100 20 1 PoleVaulting
7
4
Hurdles 23
9
16
HighJump
21
32 30 22 12 8 3 2
15
24 14

25
LongJump
6

13
29 26 19 5

33 27

Figure 5.6: Biplot of the DECATHLON PCA


5.3. PCA: DIAGNOSTICS 43

5.3 PCA: diagnostics


5.3.1 Decomposing information
The rst important piece of information is the ability of the principal com-
ponents to summarize overall information. How many components should we
select? The eigenvalues λ decompose the trace of the symmetric matrix WD
(or V = R). The trace of this matrix is called inertia. In the standardized
PCA setting, the trace of the matrix is p (the number of measurements). So in
order to give a percentage of the information summarized by the corresponding
principal component, one can compute tr(V)λ λ
= tr(WD) (here reduced to λp ).
The function inertia.dudi does that and one obtains in the $tot component
that the rst principal component corresponds to 34% information and 60%
cumulating with the second one.

inertia.dudi(pca.decathlon)

5.3.2 Contributions to inertia


It is crucial to understand how the principal components are constructed from
the original variables (and from the statistical units). Because the eigenvalue
is equal to the sum of the squared correlations of the original variables with
the principal component, one can estimate the importance of each variable as a
percentage. These percentages are called (column) contributions to inertia.

r2 (xj , zk )
Column j contribution to inertia of axe k = .
λk
The same inertia.dudi function using argument col=TRUE gives the results in
the $col.abs component. The most important contributions to axe 2 for instance
are PuttingShot (2338), Discus (2533), Javelin (1384) and Run1500 (1772).

inertia.dudi(pca.decathlon,col=TRUE)

Exercise 6 Prove that

2
Column j contribution to inertia of axe k = vj(k)

where vj(k) is the j-th coordinate (corresponding to the variable xj ) of the k-th
eigenvector vk of the correlation matrix.

Remark 4 We can also compute contributions to inertia for the statistical units
(using argument row=TRUE), but it is usually not very informative.
44 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS
5.3.3 Squared cosines
How well a variable lying in a n-dimensional space can be represented into a
two-dimensional plot? Beyond the general information given by the eigenvalue,
one has to estimate the quality of the representation for any variable. This is
done by computing squared cosines 2 .
In the standardised PCA setting, the squared cosines of variable xj with
axis zk may be computed as : r2 (xj , zk )
The function inertia.dudi gives these squared cosines in the $col.cum compo-
nent. For instance "PuttingShot" is a well represented variable : 25% on axe 1
and 86% cumulating with axe 2. It remains only 14% to explain (outside the
plot). On the contrary, the quality of representation of "HighJump" is very bad
(16%) onto the two axes ! A third axis is vital to interpret this event.
Exercise 7 Look at the squared cosines for the rows to see which very important
athlete is nevertheless badly represented. Why?

2 This will be later explained in details


Chapter 6

Beyond the basics : the

statistical triple

We need now to generalize the method of principal components to dierent


types of data. The theory is deeply linked to classical results in linear algebra.
The interest of this theory is to have a geometrical and unifying way to handle
multivariate data: the duality diagram. We will also see an immediate and very
classical application: the centered PCA 1 .

6.1 The statistical triple


This statistical triple is based on three components: a n × p matrix of data
Y, a way to compute inner products (or distances) between the rows using a
p × p symmetric positive matrix Q and also a n × n symmetric positive matrix
D to compute inner products between columns. Let's now examine these three
components.

6.1.1 The three elements of the statistical triple


Depending on the nature of the data, the original variables X are suitably
transformed, for instance using standardization as in the last chapter, but one
can imagine any interesting way of dealing with data: Y = f (X).
Each statistical unit (row of the matrix Y)is a vector in Rp and a particular
inner product is used to compute the distances between them in a proper way.
If we denote Q the corresponding p × p (positive symmetric) matrix, the inner
product between two p-vectors p1 and p2 is (p1 | p2 )Q = p1 Qp2 .
0

In the same way, we use a n × n positive symmetric matrix D in the space


of columns (R0n ). For two n-vectors n1 and n2 , the inner product is thus
(n1 | n2 )D = n1 Dn2 .
1 Also called PCA on the covariance matrix.

45
46 CHAPTER 6. BEYOND THE BASICS : THE STATISTICAL TRIPLE
From the statistical triple, two matrices can be computed:

WD = YQY0 D

and
VQ = Y0 DYQ.

The eigendecompositions of these matrices are deeply linked:

Remark 5 The two matrices share the same eigenvalues. The eigenvectors for
WD may be obtained from the eigenvectors of VQ.

All these results may be summarized in the duality diagram :


Q
Rp −−−−→ Rp
x 
Y0 
 
yY
Rn ←−−−− Rn
D

6.1.2 Properties of the duality diagram


A function of a triple (Y,Q,D)
The ideas underlying the PCA are thus generalized and we will consider a
function of the triple (Y = f (X),Q, D) which is called the duality diagram
(dudi) and produces three sets of outputs after the eigendecompositions of the
Q-symmetric matrix Y DYQ (whose eigenvectors are in Af ) and of the D-
0

0
symmetric matrix YQY D (whose eigenvectors are in Bf ) which share the
same set2 of eigenvalues (Λf ).

• a set of n-vectors for representing the rows, b1 , b2 up to the order f which


may be bound into a n × f matrix Bf where Bf DBf = If (they are D-
0

orthogonal and D-scaled to unity). They will be called row eigenvectors.

• a set of p-vectors for representing the columns, a1 , a2 up to the order f


which may be bound into a p × f matrix Af where Af QAf = If (they
0

are Q-orthogonal and Q-scaled to unity). They will be called column


eigenvectors .

• a set of eigenvalues λ1 , λ2 . . . which may be given in decreasing order in a


diagonal f × f matrix Λf .
2 except somme zero eigenvalues
6.2. THE DUDI CLASS 47

Main optimization properties


Maximization (and minimization) For any p-vector a, Q-scaled to unity,
one can project the n points corresponding to the rows of the transformed matrix
Y. One obtains a n-vector YQa. It will be called row projections vector . The
squared norm of these projections in the Euclidean Rn space with inner product
D is kYQakD .
2

Theorem 1 The rst column eigenvector a1 maximises the quantity kYQakD


2
2
under the constraint ka1 kQ = 1. This maximum is equal to λ1 . The second
column eigenvector a2 maximizes the same criterion under a constraint of or-
2
thonormality ((a1 | a2 )Q = 0 and ka2 kQ = 1) with a maximum in λ2 . . .

In the same way, onto any n-vector b, D-scaled to unity, one can project
the p points corresponding to the columns of the transformed matrix Y0 (or
compute a kind of non centered covariance). One obtains a p-vector Y Db
called column projections vector . The squared norm of these projections in the
2
Euclidean space Rp with inner product Q is Y Db .
0

Q
2
Theorem 2
0
The rst row eigenvector b1 maximizes the quantity Y Db

Q
2
under the constraint kb1 kD = 1. This maximum is equal to λ1 . The second row
eigenvector b2 maximizes the same criterion under the constraint of orthonor-
2
mality ((b1 | b2 )D = 0 and kb2 kD = 1) with a maximum in λ2 etc.
It is worth to notice the following relationships between the eigenvectors and
the projections vectors.
Theorem 3 The row eigenvectors are proportional to the corresponding row
projections vectors: bk = √1λ YQak . This is also true for the columns: ak =
k

transition relationships.
0
√1 Y Dbk . These properties are called the
λk

Theorem 4 The row and


 column eigenvectors maximise also the dual quan-
0 √
tities (YQa | b)D and Y Db | a . The maximum obtained is λ1 for the
Q
rst scores. The following eigenvectors maximises the same quantities under
orthonormality constraints.

Theorem 5 The matrices Af and Bf provide the best reconstitution of the


data: b = Bf Λ 12 A0
Y over all Q and D-orthonormal matrices of rank f in the
f  0   
sense of minimising Tr Y−Y Q Y−Y D .
b b This theorem is called the

Ecart-Young decomposition .

6.2 The dudi class


The dudi class of R objects is the exact translation of the duality diagram into
the R environment.
48 CHAPTER 6. BEYOND THE BASICS : THE STATISTICAL TRIPLE
6.2.1 Arguments of a dudi object
The arguments are a statistical triple whose components are:
• $tab a n × p data frame (suitably transformed) : Y=f (X)
• $cw a p-vector of column weights (the Q metric is supposed to be diagonal)
• $lw a n-vector of row weights (the D metric is also diagonal)

This triple is built by a particular method (dudi.pca, dudi.coa, dudi.mix...):


a call to dudi.pca() with the data.frame X produces by default the triple ( Xs ,Ip , n1 In )
where Xs is the standardized matrix.

6.2.2 Values of a dudi objects


The values produced by a duality diagram object are:
• $eig the sequence of eigenvalues ( λ1 , λ2 , ..., λmin(p,n) )
• $nf the number of chosen dimensions ( f )
• $c1 the matrix of column eigenvectors in Rp (Af )
• $l1 the matrix of row eigenvectors in Rn (Bf )

• $co the matrix of column projections vectors ( Y0 DBf = Af Λ 2 )


1

• $li the matrix of row projections vectors ( YQAf = Bf Λ 2 )


1

The print.dudi() function gives a quick summary of a dudi object. The score()
and scatter() functions are generic functions for plotting dudi objects.

6.2.3 Benets of using the dudi


Unifying a multiplicity of practices
The rst interest of the duality diagram is to understand the link between quite
dierent techniques of multivariate analysis (PCA, Correspondence Analysis,
Discriminant Analysis...). The choices are made explicit and the same maxi-
mization theorems are shared by all analyzes because the mathematical core is
actually the same one.
The standardized PCA for instance is a duality diagram, using the transfor-
mation Y = f (X) where the column are centered and standardized, Q is the
identity matrix and D the identity divided by the number of units n. The prop-
erties of this analysis will be clearly obtained as particular instances of general
maximization theorems.
We will see in a later chapter that the multiple approaches of Correspon-
dence Analysis can be organized considering a duality diagram and that these
approaches dier only in the maximization theorem which is presented as the
basis of the respective analyzes.
6.3. PCA ON THE COVARIANCE MATRIX 49

A road to other possibles choices


It is also conceivable to use the duality diagram to consider new situations not
handled by classical techniques such as standardized PCA. In the deug data from
the ade4 package, for example, one wants to study students marks with respect
to the midpoint and not to the marks mean. So it is interesting to consider the
transformation Y = f (X) where the columns are centered with respect to the
midpoint, Q = Ip is the identity matrix because we have homogeneous data and
D = n1 In if we want to give the same weight to each student.

Mixing categorical and numeric data


Another very powering possibility is to generate techniques which can mix cat-
egorical and numeric data. This is not very commonplace, especially in an
introduction to multivariate analysis. But we will see that a mixing of PCA
which is the common method for numeric data and Multiple Correspondence
Analysis, the most standard one for categorical data, is easily built by using a
duality diagram.

Computer operations
The last but not the least advantage of this general procedure is to allow the
use of a general algorithm to carry out dierent analyzes, even unexpected
ones. This is similar to the case of the general algorithm used for estimating
generalized linear models: the choices are made explicit, we can discuss these
choices, but we have no surprise with the computer operations (except in very
special conditions).

6.3 PCA on the covariance matrix


An interesting and useful application of this duality diagram theory may be seen
by studying the SWIMMINGPOOL dataset from the workspace SCvn2014.RData.
It contains 13 measurements giving the wishes for transforming the municipal
swimming pools in Lyon (France). The variables correspond to installing a
sauna, a jacuzzi, a space for talking, a day nursery... People can answer that
they really wish a given investment (score 5) or do not want it at all (score 1).
This kind of measurement is called a Likert scale.
So the 615 × 13 matrix is an homogeneous one: the same kind of variables is
collected. In this case, it is not clear if one ought to standardize those variables
before processing a PCA. In fact, in this case, it seems better only to center the
data, because the standard deviations of the measurements give some interesting
information. If the standard deviation is null, all people share the same opinion.
On the contrary, if the standard deviation is important, people do not agree on
the transformations. It is thus interesting not to standardize because it gives
more weight to measurements showing larger dispersion.
50 CHAPTER 6. BEYOND THE BASICS : THE STATISTICAL TRIPLE
If the original variables are only centered, then the eigendecomposition is
not performed with the correlation matrix, but with the covariance matrix.
The principal components are related in terms of covariance and not in terms
of correlation with the original variables. The computations are done by using
the argument scale=FALSE in the dudi.pca function.

### PCA on a covariance matrix


head(SWIMMINGPOOL)
dotchart(sort(sapply(SWIMMINGPOOL,mean)))
dotchart(sort(sapply(SWIMMINGPOOL,sd)))
pca.swimming<-dudi.pca(SWIMMINGPOOL,center=TRUE,scale=FALSE)
s.arrow(pca.swimming$co)

One does not plot here the 615 projections of the statistical units because
in a survey sampling the individual opinions are of no interest. The gure 6.1
gives a terse view of the wishes: a rst set of transformations related to water
equipment (and body care), another unrelated set of wishes directed toward
sociability links and a medium group oriented toward gym activities (which are
physical and sociable activities at the same time).

Exercise 8 The datale TOOTHPASTE from the workspace SCvn2014.Rdata

gives the importance (1=not important at all to 5=very important) of eight


criteria concerning toothpaste as judged by 100 people. Investigate the use of a
centered PCA on this dataset. Interpret the results.

6.4 Diagnostics
6.4.1 A decomposition of the trace
The total inertia of a triple ( Y,Q,D) is

IT = T race (Y0 DYQ) .

Note that if the inner products are diagonal ones, the inertia can be written
p
n X
X
2
IT = di qj yij .
i=1 j=1

The eigenvalues give a decomposition of the trace:


X
IT = λk ,
k

so the quantity λk
IT gives the importance of axe k.
6.4. DIAGNOSTICS 51

d = 0.5

Jacuzzi
Sauna
Hydrotherapy

Gymnastic
Aquagym

Ultraviolet

Events

RelaxingPlace
Snack
Bar
DayNursery
Garden
DiscussionSpace

Figure 6.1: Plot of the covariances of the original variables with the two rst
principal components from the centered PCA of the SWIMMINGPOOL dataset
52 CHAPTER 6. BEYOND THE BASICS : THE STATISTICAL TRIPLE
6.4.2 Contribution to inertia
Let's note lk (resp. ck ) the row (resp. column) projection vector onto axe k and
bk (resp. ak ) the corresponding row (resp. column) eigenvector. Considering
the axe k, one can write λk = klk k2D = i di lk2 (i).
P
2
The contribution to inertia of row i to axe k is diλlkk(i) . In the same way, the
2
contribution of column j to axe k is qj cλkk(j)

6.4.3 Squared cosines


If one decomposes the ith row vector yi inPthe orthonormal basis given by the
column eigenvectors aj , the result is yi = j lj (i)aj .
P 2
Considering the squared norm, one has: kyi k2Q = = j lj2 (i).
P
l (i)a

j j j
Q
The squared cosine of row i with axe k is
lk2 (i)
2 .
kyi kQ

Remark 6 One is usually more interested into the squared cosines of column
(variable) j with axe k:
c2k (j)
2
kyj kD

Exercise 9 Generate and read the diagnostics (column) for the centered PCA
of the TOOTHPASTE dataset.
Chapter 7

Multiple Correspondence

Analysis

7.1 Multivariate categorical data


We now face the problem of multivariate categorical data that is to say a dataset
where all measurements are categorical ones. The dataset CATS from the
workspace SCvn2014.RData corresponds to 134 female cats and describes their
age (5 categories), the number of kittens they had in one year (5 categories) and
the number of litters (2 categories). One can highlight that these measurements
were initially numeric ones but they were thereafter transformed into categories.
### Multiple Correspondence analysis
# the workspace SCvn2012.RData must be loaded

head(CATS)
sapply(CATS,class)
summary(CATS)

The goals are similar to those of a PCA: determining if any variables are re-
lated, how they are related and discovering the closeness between the statistical
units.

7.2 Multiple Correspondence analysis


Remember that in principal components analysis, one tries to nd a (continuous)
score of maximum correlation with the set of continuous variables. The Multiple
Correspondence Analysis (MCA) will keep going the same way, that is to say
to nd a (continuous) score of maximum correlation with the set of categorical
variables. The correlation ratio will simply be used instead of the squared linear
correlation coecient.

53
54 CHAPTER 7. MULTIPLE CORRESPONDENCE ANALYSIS
Denition 4 The rst order solution of the multiple correspondence analysis
of the p categorical variables Xj provides a vector of row scores which maximises
the following criterion:
p
1X
CR(b1 , Xj )
p j=1

This vector b1 of scores is standardized ( b1 = 0 and var(b1 ) = 1). The solu-


tion is obtained by a particular eigendecomposition, so it is called the rst row
eigenvector. The corresponding maximum is called the rst eigenvalue.

Using this solution one can separately plot for each variable Xj the means
of the corresponding categories and interpreting the link between the variables
using the locations of the categories (see gure 7.1). On this plot, the youngest
cats (category 1) share the same location as the low fecundity ones and the
one-litter ones.
The function dudi.acm carries out the necessary computations. One-dimensional
plots are obtained by using the score function and two-dimensional ones by using
the scatter function. For the CATS dataset, only one axe is important.

# MCA: first order solution


mca.cats<-dudi.acm(CATS)
score(mca.cats)

Exercise 10 The DOGS dataset (in workspace SCvn2014.RData ) gives for 27


types of dogs their measurements of size, weight, intelligence. . . as categorical
variables. Carry out a MCA of this dataset. Plot the two-dimensional solution.
Try to interpret it.

7.3 The MCA as a statistical triple


7.3.1 The disjunctive table
From the original dataset CATS a new table is created: the disjunctive table.
If Xj is a categorical variable corresponding to n measurements in a set of
mj categorical values, the corresponding coding is a n × mj matrix where the
value in row i is equal to zero or one depending on the fact that i does or
doesn't belong to the corresponding category (see table 7.1). Let m be the total
number of categories: m = pj=1 mj . The disjunctive table is a n × m matrix
P
(here n = 134 and m1 = 5, m2 = 5, m3 = 2 and m = 12).

# The disjunctive data table


X<-acm.disjonctif(CATS)
head(X)
7.3. THE MCA AS A STATISTICAL TRIPLE 55

Age
4−5
6−7
0.5

8
2−3
score

−0.5

1
−1.5

−1.5 −1.0 −0.5 0.0 0.5 1.0

score
Fecundity 9−12
13−14
7−8
0.5
score

−0.5

3−6
−1.5

1−2

−1.5 −1.0 −0.5 0.0 0.5 1.0

score
Litter
2p
0.5
score

−0.5

1p
−1.5

−1.5 −1.0 −0.5 0.0 0.5 1.0

Figure 7.1: Results of the MCA of the CATS data. The rst row eigenvector is
represented on the X-axis. Each plot corresponds to a separate variable. The
categories are plotted as means.
56 CHAPTER 7. MULTIPLE CORRESPONDENCE ANALYSIS

Table 7.1: An excerpt of the disjunctive table for the CATS data
Age.1 Age.2-3 Age.4-5 Age.6-7 Age.8 Fec.1-2 Fec.13-14
1 1 0 0 0 0 1 0
2 1 0 0 0 0 1 0
3 1 0 0 0 0 1 0
4 1 0 0 0 0 1 0
5 1 0 0 0 0 1 0
6 1 0 0 0 0 1 0

7.3.2 The MCA statistical triple


Let X be the n × m disjonctive table. Dn is the diagonal matrix giving the
rows weights (the sum of its diagonal is one). One generally uses the uniform
weighting (Dn = n1 In ).
A m-vector of columns weights is computed by dm =X0 Dn 1n where 1n is
a n-vector of ones. This vector gives the relative frequencies corresponding to
each category (they sum to unity for each variable). Dm is the corresponding
diagonal matrix. Because the sum of its diagonal is thus equal to p, one will use
the matrix p1 Dm as a weighting system for the columns.
We will nally consider the n × m matrix Y = XD−1 m − 1nm where 1nm is
a n × m matrix of ones.
The MCA triple is (Y, p1 Dm ,Dn ). Let's now prove that the corresponding
duality diagram maximizes the MCA property described in theorem 4.

Exercise 11 Prove that 1n is an eigenvector of the n×n matrix to diagonalise


from the statistical triple of the MCA.

Proof 1 First, let's show that the row eigenvectors are Dn -centered. Because
1n is also an eigenvector (corresponding to a zero eigenvalue), and because the
eigenvectors are Dn -orthogonal (the matrix to diagonalize is symmetric) the row
eigenvectors are centered.
Then, as stated in theorem 2, the rst row eigenvector b1 maximizes the
0 2 0 0
−1
quadratic form Y Dn b . The coordinates of the vector Y Dn b = XDm − 1nm Dn b

1
p D m

are exactly the mean values of scores b corresponding to each category minus
2
the general mean. When we compute the squared norm kk 1 Dm of this vector
p

of means, one obtains exactly the mean of the between variances1 of score b
with respect to each variable. Because this score is scaled to unity and because
we have shown that it is centered, the total variance of the score is one, so
one maximises indeed the mean of the correlation ratios with the rst vector of
scores!

1 as they are called in the analysis of variance method


7.4. MCA DIAGNOSTICS (SEE EXERCISES) 57

7.4 MCA diagnostics (See exercises)


Let us compute in the case of the MCA triple Pn the percentage of total inertia for
the j th category.
P
It is proportional 
to 
2
qj i=1 di yij that is

to say for the MCA
triple p dm(j)
1 n xij
= p dm(j) dm(j) − 1 = p 1 − dm(j) . So
1 1 1
2

i=1 di ( dm(j) − 1)
if a category is a rare one, it has an important contribution to the total inertia.
This may be a potential pitfall for the analysis.
Remark 7 One has to balance the number of units in each category before
carrying out a MCA.

In the same way, if we try to compute


Pqj 1 the inertia for1the j th variable with
qj categories, it is proportional to j=1 p 1 − dm(j) = p (qj − 1). The impor-
tance of a variable is proportional to the number of its categories.
Remark 8 One has also to balance the number of categories per variable before
carrying out a MCA.

7.4.1 Correlation ratios


The correlation ratios are the most important diagnostics to understand the
importance of each variable in the interpretation of the MCA plots. You can
read them in the $cr component of the dudi.mca object.
# Correlation ratios
print(mca.cats$cr)

7.5 Miscellaneous
7.5.1 Correspondence analysis
The application of multiple correspondence analysis to two categorical variables
is exactly equivalent to a very popular multivariate technique: correspondence
analysis of the corresponding contingency table.

7.5.2 Fuzzy correspondence analysis


Beyond the scope of this course (see the function dudi.fca in the ade4 package).

7.6 Exercise
The datale ROXY in the workspace SCvn2014.RData corresponds to 308 females
practicing ski or snowboard. They were asked about their goals when doing the
activity: to take risks, to enjoy the landscape, to be in great form? The answer
can only be yes or no leading to categorical (binary) variables. Use a MCA in
order to study the ROXY data.
58 CHAPTER 7. MULTIPLE CORRESPONDENCE ANALYSIS
# the workspace SCvn2014.RData must be loaded
# Converting numeric data into factors
sapply(ROXY,class)
for(i in 1:10){
ROXY[,i]<-factor(ROXY[,i],labels=c("NO","YES"))
}
sapply(ROXY,class)
sapply(ROXY,summary)
Chapter 8

Hill and Smith's analysis

8.1 Multivariate mixed data


It is indeed quite rare to only measure either continuous data or categorical data.
Usually, both are obtained during a study and it is a pity to carry out separate
analyzes because of technical problems or to transform the numeric variables
into categorical ones in order to perform a MCA. The Hill & Smith's analysis
allows to mix both types of data and thus to pick up the variables because of
their intrinsic interest and not because of their scale of measurement.
The LIFE dataset is extracted from a survey sampling of french people. The
questionnaire included very dierent questions about sex, age, TV, money and
health problems... Some of these measurements are numeric ones and some are
categorical ones.
### Hill & Smith's analysis
# the workspace SCvn2014.RData must be loaded

head(LIFE)
sapply(LIFE,class)
summary(LIFE)

The goal remains the same one: trying to understand the relationships be-
tween the measurements.

8.2 Hill and Smith analysis


8.2.1 Property
On the one side, standardized PCA maximizes the sum of squared linear cor-
relation coecients of the (continuous) variables with a row score (principal
component). On the other side, multiple correspondence analysis maximizes
the mean of the correlation ratios of the (categorical) variables with a row

59
60 CHAPTER 8. HILL AND SMITH'S ANALYSIS
score, thus also the sum. Because squared linear correlation coecients and
correlation ratios are of the same mathematical type (squared cosines), we can
mix these two analyses if they have the same row weights .
Denition 5 The Hill and Smith's analysis (H&S analysis) maximizes the sum
of the squared cosines of the (categorical or continuous) variables with a row
score.

Of course, the procedure may be iterated, looking for a second uncorrelated


score, optimizing the same criterion.
# HSA: first order solution
mix.life<-dudi.mix(LIFE)
score(mix.life)

The plot of one score is obtained using the usual function score. But for two
dimensions, the function scatter is not ecient. Two functions are proposed in
the workspace SCvn2014.RData in order to obtain a correlations circle for the
numeric variables and the classical MCA separate plots for categorical variables.
# HSA: second order solution
scatter(mix.life)
scatter.mix.numeric(mix.life)
scatter.mix.categorical(mix.life)

8.3 H&S analysis as a duality diagram


Starting from two triples, the rst one ( Xs ,Ip ,Dn ) corresponds to a standardized
PCA of a n × p matrix X and the second one (ZD−1 m − 1nm , q Dm ,Dn ) to a
1

multiple correspondence analysis 1 of a set of q categorical measurements of n


units coding by a n × m disjunctive table Z.
Remark 9 The weighting systems for the rows Dn must be identical for the
two triples.

The Hill and Smith's analysis is the analysis of the triple ( Y,Q,Dn ) where
Y is the n × (p + m) matrix built by binding the two datasets from the triples:
Y = Xs | ZD−1 and the
 weighting  system for the columns is a (p +

m − 1 nm
Ip 0
m) × (p + m) diagonal matrix .
0 Dm
From the general property of a duality diagram applied to this triple, one
can see that a PCA and a MCA are simultaneously performed, so one maximizes
• the sum of the squared linear correlation coecients for the continuous
variables
1 The constant 1/q is indeed omitted.
8.3. H&S ANALYSIS AS A DUALITY DIAGRAM 61

Housing

TV

Age

Figure 8.1: Results of the H&S analysis of the LIFE data. The numeric variables
are represented by their correlations with the rst two solutions
62 CHAPTER 8. HILL AND SMITH'S ANALYSIS

Sex ●
Family ●
● ●
● ●
●● ●●
● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●● ● ●●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●

● ● ● ● ●
● ● ● ●
● ●
● ●
Female ● ●





● ●
Yes ● ●





● ●

Male

● ●●




● ●


● ●●



●●
● ●● ● ●




● ●●
● ●● ● ●

No ●



● ●
● ● ● ●
● ● ● ●
● ●
● ● ●● ● ● ●●
● ●
●● ●●
● ● ● ●
● ●

● ●
● ●
●● ●●

● ●
● ● ● ●

● ●

● ●

DVD ●
Headache ●
● ●
● ●
●● ●●
● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●● ● ●●
● ● ● ●
● ● ● ● ● ● ● ●

● ●

● ●



No ●








● ●


Yes ●











● ● ● ● ● ●
● ●



● ●●




● ●




No ●●



●● ● ● ●● ● ●
● ●● ● ● ●● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ●● ● ● ●●


●●

Yes ● ●
●●


● ●

● ●
● ●
●● ●●

● ●
● ● ● ●

● ●

● ●

Backache ●
FoodLimitations ●
● ●
● ●
●● ●●
● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●




● ● Yes
● ●



●●



● ●




● ●
● ●

Yes ●

●●



● ●

● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ●
● ●
●●
● ●● ● ●

No ●

● ●●
● ●● ● ●



● ● ● ●



● ●●


No ●

● ●●
● ●
●● ●●
● ● ● ●
● ●

● ●
● ●
●● ●●

● ●
● ● ● ●

● ●

● ●

Figure 8.2: Results of the H&S analysis of the LIFE data. The categories of the
variables are located by the means of the rst two solutions
8.3. H&S ANALYSIS AS A DUALITY DIAGRAM 63

• and the sum of the correlation ratios for the categorical variables,

both with the rst row score of the analysis.


The squared cosines are the most important diagnostics for the H&S analysis.
They can be obtained from the $cr component or even plotted by the command:
scatter.mix.cr(mix.life) .

Exercise 12 The dataset ASVEL from the workspace SCvn2014.RData corre-


sponds to fans of a famous french basket-ball team. They were asked about their
sex, (log) age, (log) distance of the stadium from their home, the importance
of the team in their life, with whom they come to the stadium and if they have
bought a commercial product sold by the team. Perform a H&S analysis of this
dataset.

Exercise 13 Test yourself with the 2010-2011 examination (on the website).
64 CHAPTER 8. HILL AND SMITH'S ANALYSIS
Chapter 9

Correspondence analysis

9.1 Contingency table


The housetasks data from the ade4 package gives thirteen housetasks and their
distribution in a sample of married couples: wife, alternating, husband and
jointly. This table is a contingency table, that is to say it corresponds to 1744
measurements of two categorical variables: (1) nature of the task and (2) who
does it?
We have previously seen that the classical test to study such a relationship
is the chis-squared test. Here one has a heavy relationship. How can it be
described?

require(ade4)
data(housetasks)
chisq.test(housetasks)
mosaicplot(housetasks,las=2,main="Housetasks",color=TRUE)

9.2 Analyzes of a contingency table


The rst approach for analyzing a contingency table is to compute the proles
table (row conditional frequencies) and to perform a simple centered PCA of
these proles.
Though the preceding analysis is practically easy to proceed, it is not clearly
related to a well-known statistical criterion (as for instance the chi-squared
statistic). We thus begin by describing three of these criteria.
Instead of the I × J contingency table N = (nij ), we will consider the
relative P
frequencies table P: P = N/n++ . The row P frequencies will be denoted
pi+ = p
j ij , the column frequencies p+j = i pij and the corresponding
diagonal matrices DI and DJ .

65
66 CHAPTER 9. CORRESPONDENCE ANALYSIS

Housetasks

Breakfeast

Holidays
Insurance
Main_meal

Dinner

Tidying

Dishes

Shopping

Official

Finances

Repairs
Driving
Laundry

Wife

Alternating

Husband
Jointly

Figure 9.1: Mosaic plot of the housetasks data


9.3. CORRESPONDENCE ANALYSIS 67

The chi-squared statistic studies the hypothesis of independence that is to


say the dierence between pij and pi+ × p+j in the sense of
X (pij − pi+ × p+j )2
X 2 = n++ .
i,j
pi+ × p+j

For studying independence, two other criteria exist which compete the chi-
squared statistic:
• the Belson's criterion:
X 2
B2 = (pij − pi+ × p+j )
i,j

and
• the Goodman-Kruskall's criterion:
X (pij − pi+ × p+j )2
K2 = .
i,j
pi+

9.3 Correspondence analysis


9.3.1 Correspondence analysis
The idea is to begin with a system of column scores and then to compute for
each row prole a corresponding row mean.0
Let a be a column score
 oflength one ( a DJ a = 1). One can compute for the
ith row prole (pj/i = ni+ ) the corresponding mean ai = j pj/i aj .
nij P
j=1...J
Let a = i pi+ ai (= j p+j aj ) be the mean of the row means. One can compute
P P

the variance: i pi+ (ai − a)2 which quanties the discrimination of the rows.
P

Theorem 6 The correspondence analysis column scores maximize the variance


of the corresponding row conditional means. The column scores are constrained
to be DJ -standardized.

The correspondence analysis is a symmetric method. The row scores, ob-


tained in the same way, are the conditional row means up to a constant.
The function dudi.coa compute the row and column scores. Plots can be
obtained using score and scatter. On can observe a clear distribution of the
housetasks by sex.
# CA
coa.housetasks<-dudi.coa(housetasks)
scatter(coa.housetasks)
68 CHAPTER 9. CORRESPONDENCE ANALYSIS

d = 0.5

Holidays

Jointly

Finances

Insurance
Tidying Dishes
Shopping

Official
Dinner Alternating
Wife
Breakfeast
Main_meal
Laundry

Driving Husband

Repairs

Figure 9.2: Two-dimensional plot from the correspondence analysis of the house-
tasks data
9.3. CORRESPONDENCE ANALYSIS 69

Exercise 14 The dataset BIRDS crosses time and space information about birds
(winter teals) for which the rings were retrieved in the area i during the month
j.
Perform a correspondence analysis of this datale. Interpret the result. Why
would it be interesting to have a map of the locations?

9.3.2 Correspondence analysis as a duality diagram


Let us consider the table of the dierences between the frequency and the pre-
diction under the hypothesis of independence yij =pij − pi+ × p+j . The corre-
sponding matricial operation is Y = P − DI 1I×J DJ where 1I×J is the usual
I × J matrix of ones.
The correspondence analysis statistical triple is ( D−1
I YDJ ,DJ ,DI ). The
−1

theorem 6 is a direct consequence of the general optimization property 1 of


a duality diagram. It is obvious from the triple that the analysis is totally
symmetric.
Remark 10 The change of constraint on the column scores has a profound im-
pact on the stability of the analysis. It can be disrupted by the presence of
low-frequency columns (or low-frequency rows) by contrast with the Non Sym-
metric Correspondence Analysis (see thereafter).

One of the diculties in correspondence analysis is that very dierent ap-


proaches are possible. We can organize these theoretical ideas with the help of
the duality diagram.
Exercise 15 Prove that 1J is an column eigenvector. What is the consequence
for the others column eigenvectors?

9.3.3 Dierent approaches for CA


Reciprocal averaging
The correspondence analysis is sometimes presented as a method of reciprocal
averaging: using a vector of column scores a, one can compute the corresponding
vector of row conditional means b. This vector can be considered as a system of
row scores, it can thus be used to compute the corresponding column conditional
means. Can we nd row and column scores such that the corresponding means
are proportional up to the smallest possible constant?
The solutions are the rst row and column scores of the correspondence
analysis and the constant is √1λ1 .This reciprocal averaging property is used in
the common correspondence analysis plot which displays simultaneously row
and column conditional means (gure 9.2).

PCA using the chi-squared metric


The correspondence analysis may also be seen as a centered PCA of the row
proles (D−1
I P) using the diagonal matrix DI as row weights and a particular
70 CHAPTER 9. CORRESPONDENCE ANALYSIS
metric called the chi-squared metric: for any pair of proles z1 and z2 the
corresponding scalar product is z1 D−1J z2 . This inner product emphasizes the
rare categories.
The corresponding triple ( D−1I P − 1I×J DJ ,DJ ,DI ) is equivalent to the
−1

classical triple of correspondence analysis.


Remark 11 Many other approaches exist. In particular, correspondence anal-
ysis may be viewed as a discriminant analysis, a canonical analysis (or a way to
maximize and linearize the correlation between two categorical variables linked
by a contingency table).

Links between MCA and correspondence analysis


The I × J contingency table N corresponding to two categorical measurements
on n units is the cross product of two disjunctive table ZI (n×I ) and ZJ (n×J ):
0
N = ZI ZJ .
It can be shown that
• the column scores of the
 multiple correspondence analysis of the n × (I +
J) disjunctive table ZI ZJ are proportional to the scores of √the
correspondence analysis of the table N (the multiplicative constant is 2)
and there is a direct link between the eigenvalues of the two analyzes.
• The correspondence analysis of the table is equivalent to the
 
ZI ZJ
MCA of the same table (and this is true also for 3 or more variables).

9.3.4 CA diagnostics
In correspondence analysis, due to the weighting of rows and columns, the plots
may sometimes be misleading . It is thus important to scrutinize carefully this
information.
Exercise 16 Prove that Inertia = X2
n++ .

The correspondence analysis is an eigendecomposition of the chi-squared


criterion.

9.4 Non symmetric Correspondence Analysis


The Non Symmetric Correspondence Analysis (NSCA) is an eigendecomposition
of the Goodman-Kruskall's criterion.

9.4.1 NSCA
The only dierence with Correspondence analysis is that the column scores are
scaled to unity using the canonical metric: a a = 1.
0
9.4. NON SYMMETRIC CORRESPONDENCE ANALYSIS 71

Theorem 7 The rst solution of the NSCA a1 maximizes the variance of the
corresponding rows means. The second solution does the same thing under or-
thogonality constraints.

# Non symmetric correspondence analysis


nsc1 <- dudi.nsc(housetasks)

9.4.2 NSCA as a duality diagram


The corresponding triple is ( D−1 I Y ,II ,DI ) where Y = P − DI 1I×J DJ . The
matrix D−1 I Y = DI P − 1I×J DJ is the (centered) matrix of row proles.
−1

The theorem 7 is a direct consequence of the general optimisation property


1 of a duality diagram.
The column scores are centered in the usual sense and the row scores of this
0
analysis bk are DI -centered. The corresponding column coordinates ( D−1 I Y DI bk =
Y bk ) are not the conditional mean of bk using the column proles as frequency
0

distribution, but the conditional mean multiplied by the weight of the corre-
sponding column (p+j ). The sum of squared of these products is maximized by
this analysis (see theorem 2). A column with little importance is thus discarded
by the analysis. So this analysis may not be biased by a contingency table
showing some low-frequency columns.
This is a very dierent approach to that of correspondence analysis. This
analysis is not symmetric at all, the analysis of the transposed contingency table
is usually quite dierent. It is important to choose the rows and the columns.

9.4.3 NSCA plots


The canonical plot of the NSCA is the translation of the theorem 7. One plots
the rst column scores a1 (and a2 for a second-order plot) to represent the
columns and then the rows are displayed as conditional means a1i (and a2i ).
# plotting NSCA
weights<-t(sweep(housetasks,1,apply(housetasks,1,sum),"/"))
s.label(nsc1$c1)
s.label(t(as.matrix(weights))%*%as.matrix(nsc1$c1[,1:2]),
lab=dimnames(weights)[[2]],add.plot=T,clabel=0.75)

It must be noted (1) that the row coordinates are not centered because
their mean is a1 (and a2 ), so the scatterplot is decentred and (2) that the row
coordinates are not the row scores of the triple of the NSCA, which are a1i − a1
(and a2i − a2 ). These row scores are DI -centered. So be careful in reading the
classical plot 9.3 from the duality diagram because the row scores are not the
means of the column scores.

scatter(nsc1)
72 CHAPTER 9. CORRESPONDENCE ANALYSIS

Jointly d = 0.5

Holidays

Tidying
Dishes
Shopping Finances

Insurance
Wife
Dinner
Laundry
Main_meal
Breakfeast Alternating
Official

Driving

Husband
Repairs

Figure 9.3: Two-dimensional plot from the non symmetric correspondence anal-
ysis of the housetasks data
9.5. BELSON'S CORRESPONDENCE ANALYSIS 73

9.5 Belson's Correspondence Analysis


The duality diagram of the triple ( Y,II ,IJ ) provides an eigendecomposition of
the Belson's criterion which happens to be exactly the total inertia. This is not a
classical multivariate analysis of a contingency table but it has the advantage of
providing representation in the usual euclidean spaces (the metrics are identity
in RI and RJ ).
This possibility is not well-known. A function dudi.belson is given in the
workspace SCvn2014.RData to compute it. It respects the framework of the
statistical triple, so one can use the classical scatter function to plot the results.
# A specific CA
belson.housetasks<-dudi.belson(housetasks)
scatter(belson.housetasks)
74 CHAPTER 9. CORRESPONDENCE ANALYSIS
Bibliography

[1] Baccini A. et Besse Ph. (1999) Statistique descriptive multidimensionnelle


Publications du Laboratoire de Statistique et Probabilités, Université Paul
Sabatier Toulouse .
[2] Box, G. E. P. et Cox, D. R. (1964) An analysis of transformations (with
discussion). Journal of the Royal Statistical Society B, 26, 211-252.
[3] Escouer, Y. (1987) The duality diagram: a means of better practical appli-
cations In Development in numerical ecology, Legendre, P. & Legendre, L.
(Eds.) NATO advanced Institute, Serie G. Springer Verlag, Berlin, 139-156.
[4] Everitt B. (2005) An R and S-Plus companion to multivariate analysis.
Springer-Verlag.
[5] Foucart T. (1997) L'analyse des données mode d'emploi. Presses Universi-
taires de Rennes.
[6] Lebart L., Morineau A. et Piron M. (1995) Statistique exploratoire multidi-
mensionnelle. Dunod.
[7] Mardia K., Kent J. et Bibby J. (1979) Multivariate analysis. Academic Press.
[8] Ramsay J.O. et Silverman (19??) Functional data analysis. Springer-Verlag.
[9] Tenenhaus M. et Young F.W. (1985) An analysis and synthesis of multiple
correspondence analysis; optimalsclaing, dual scaling homogeneity analysis
and other methods for quantifying categorical multivariate data. Psychome-
trika, 50, p.91-119.
[10] R Development Core Team (2005) R: A Language and Environment for
Statistical Computing, R Foundation for Statistical Computing,(ISBN: 3-
900051-07-0), http://www.R-project.org.
[11] G.A.F. Seber (2004) Multivariate observations. Hoboken NJ (USA): Wiley.
[12] Venables et Ripley (2002) Modern applied statistics with S+. Springer-
Verlag.

75

You might also like