Professional Documents
Culture Documents
Uuullltttiiivvvaaarrriiiaaattteee Aaannnaaalllyyysssiiisss ( ( (Uuusssiiinnnggg R R R) ) )
Uuullltttiiivvvaaarrriiiaaattteee Aaannnaaalllyyysssiiisss ( ( (Uuusssiiinnnggg R R R) ) )
Copy right by
Stephane Champely
CHAMPELY October
16, 2014
2
Contents
3
4 CONTENTS
6.2.3 Benets of using the dudi . . . . . . . . . . . . . . . . . . 48
6.3 PCA on the covariance matrix . . . . . . . . . . . . . . . . . . . 49
6.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4.1 A decomposition of the trace . . . . . . . . . . . . . . . . 50
6.4.2 Contribution to inertia . . . . . . . . . . . . . . . . . . . . 52
6.4.3 Squared cosines . . . . . . . . . . . . . . . . . . . . . . . . 52
7 Multiple Correspondence Analysis 53
7.1 Multivariate categorical data . . . . . . . . . . . . . . . . . . . . 53
7.2 Multiple Correspondence analysis . . . . . . . . . . . . . . . . . . 53
7.3 The MCA as a statistical triple . . . . . . . . . . . . . . . . . . . 54
7.3.1 The disjunctive table . . . . . . . . . . . . . . . . . . . . . 54
7.3.2 The MCA statistical triple . . . . . . . . . . . . . . . . . 56
7.4 MCA diagnostics (See exercises) . . . . . . . . . . . . . . . . . . 57
7.4.1 Correlation ratios . . . . . . . . . . . . . . . . . . . . . . . 57
7.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.5.1 Correspondence analysis . . . . . . . . . . . . . . . . . . . 57
7.5.2 Fuzzy correspondence analysis . . . . . . . . . . . . . . . 57
7.6 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Hill and Smith's analysis 59
8.1 Multivariate mixed data . . . . . . . . . . . . . . . . . . . . . . . 59
8.2 Hill and Smith analysis . . . . . . . . . . . . . . . . . . . . . . . 59
8.2.1 Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.3 H&S analysis as a duality diagram . . . . . . . . . . . . . . . . . 60
9 Correspondence analysis 65
9.1 Contingency table . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 Analyzes of a contingency table . . . . . . . . . . . . . . . . . . . 65
9.3 Correspondence analysis . . . . . . . . . . . . . . . . . . . . . . . 67
9.3.1 Correspondence analysis . . . . . . . . . . . . . . . . . . . 67
9.3.2 Correspondence analysis as a duality diagram . . . . . . . 69
9.3.3 Dierent approaches for CA . . . . . . . . . . . . . . . . . 69
9.3.4 CA diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 70
9.4 Non symmetric Correspondence Analysis . . . . . . . . . . . . . . 70
9.4.1 NSCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.4.2 NSCA as a duality diagram . . . . . . . . . . . . . . . . . 71
9.4.3 NSCA plots . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.5 Belson's Correspondence Analysis . . . . . . . . . . . . . . . . . . 73
Chapter 1
1.1 What is R?
R is a system for statistical computation and graphics. It is a public domain
project running on Windows, Unix or Mac. It consists of a language plus a run-
time environment with graphics, a debugger, access to certain system functions,
and the ability to run programs stored in script les.
The core of R is an interpreted computer language which allows branching
and looping as well as modular programming using functions. Most of the user-
visible functions in R are written in R. It is possible for the user to interface to
procedures written in the C, C++, or FORTRAN languages for eciency.
The R distribution contains functionality for a large number of statistical
procedures. Among these are: linear and generalized linear models, nonlinear
regression models, time series analysis, classical parametric and nonparametric
tests, clustering and smoothing. There is also a large set of functions which
provide a exible graphical environment for creating various kinds of data pre-
sentations. Additional modules ( add-on packages ) are available for a variety of
specic purposes.
1.2 Installing R
The bin/windows directory of a CRAN site contains binaries for a base distri-
bution and a large number of add-on packages from CRAN to run on Windows
2000, XP, Vista, Windows 7, Windows 8 on Intel and clones (but not on other
platforms). The current version is R 3.1.1. One can nd the executable le of
the R software on the website : http://CRAN.R-project.org. Download the le
R-3.1.1-win.exe on your computer and execute it.
The Comprehensive R Archive Network (CRAN) is a collection of sites which
carry identical material, consisting of the R distribution(s), the contributed
extensions, documentation for R, and binaries.
5
6 CHAPTER 1. USING THE R SOFTWARE
1.3 Installing the ade4 package
In order to carry out some multivariate analyses, a specic package is needed.
The ade4 package is based on a powerful theoretical model that allows many
variations. A lot of graphics are available.
You can install and automatically update packages from within R if you have
access to repositories such as CRAN. Double ckick on the R icon to begin. Use
the "Package" menu and the option : "Install R packages". Choose a suitable
mirror site (e.g. Vietnam) and select "ade4" from the list of available packages.
### R as a calculator
25+56*78+5^3+exp(25)-sin(180)+log(125)+sqrt(4)
1:10
seq(0,20,by=2)
seq(0,20,l=5)
### Models in R
lm(y~x)
lm1<-lm(y~x)
class(lm1)
attributes(lm1)
summary(lm1)
plot(lm1)
lm2<-aov(y~f)
class(lm2)
summary(lm2)
### Loading datasets from text files: txtfile.txt or csv files: csvfile.csv
data1<-read.table("txtfile.txt",header=TRUE)
data2<-read.csv("csvfile.csv",header=TRUE)
Exploratory univariate
statistics
One does not begin by directly dealing with multivariate statistics. It is wiser to
start by a sequence of univariate analyzes describing each variable included in
a multivariate data le. First, it allows to have a quick view of the mainstream
of the datale. Second, it may help to detect the presence of outliers, the need
of transformations or the need of a serious data cleaning before carrying out
any multivariate examination. This chapter is a summary of the main ideas of
exploratory univariate statistics. The point of graphing data will in particular
be stressed. Modern statistic is heavily based on visualizing data.
The rst important point is to distinguish categorical data from numeric
data. They indeed need a dierent statistical approach.
11
12 CHAPTER 2. EXPLORATORY UNIVARIATE STATISTICS
class(Sex)
levels(Sex)
table(Sex)
summary(Sex)
sum(table(Sex))
table(Sex)/sum(table(Sex))
pie(table(Sex))
table(is.na(Sex))
Another interesting example is the "smoking" variable. One can note that
the order of the categories is not here a very logic one (due to the alphabetical
ordering). So it is interesting to change this before a more careful examination.
Actually, this kind of data is called an ordered categorical variable (or ordinal
data). Look at the dierences between the results of the raw "Smoke" variable
and the ordered "Smoke2" variable.
### An ordinal variable
print(Smoke)
levels(Smoke)
Smoke2<-ordered(Smoke,levels=c("Never","Occas","Regul","Heavy"))
levels(Smoke2)
print(Smoke)
print(Smoke2)
class(Smoke)
class(Smoke2)
summary(Smoke)
summary(Smoke2)
pie(summary(Smoke))
pie(summary(Smoke2))
barplot(summary(Smoke))
barplot(summary(Smoke2))
dotchart(summary(Smoke))
dotchart(summary(Smoke2))
dotchart(summary(Manufacturer))
dotchart(sort(summary(Manufacturer)))
require(car)
Manufacturer2<-recode(Manufacturer, '"Ford"="Ford"; "Chevrolet"="Chevrolet"; "Dodge"="Dodge";
NA=NA; else="Others";', as.factor.result=TRUE)
pie(Manufacturer2)
The message that conveys the plots in gure 2.1 is rather clear. Those data
are quite symmetric, there is no outlier and the distribution is not far from
being normal. In this case the best summaries are the mean and the standard
deviation. The denitions are for the mean :
n
1X
y= yi
n i=1
14 CHAPTER 2. EXPLORATORY UNIVARIATE STATISTICS
Stripchart Histogram
0.20
Density
0.10
0.00
14 16 18 20 22 14 16 18 20 22 24
Wr.Hnd
Boxplot QQplot
●●●●● ●
●
●
●●
●
22
22
Sample Quantiles ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●
18
18
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●●
●
●
14
14
● ●●
● ● ●
−3 −2 −1 0 1 2 3
Theoretical Quantiles
mean(Wr.Hnd)
sd(Wr.Hnd)
Now consider the "Age" measurement. The plots are rather dierent (gure
2.2). The data are not at all symmetric and two clear outliers can be detected.
In this case one has to use robust statistical summaries to describe the data.
Quantiles and especially the median and the quartiles give some interesting
2.2. NUMERIC DATA 15
information. The median breaks the sample into two equal parts and the quar-
tiles into four equal parts. The minimum and the maximum of the data are
also important quantities to consider. One can also use trimmed or winsorized
statistics.
Another possibility is to transform the data in order to normalize them. For
the "Age" measurement the transformation Age−15 1
gives good results but from
an interpretative viewpoint, we have to do with . . .
par(mfrow=c(2,2))
hist(Age,freq=FALSE,main="Histogram")
lines(density(Age), col ="red")
lines(density(Age,bw=2), col ="blue")
qqnorm(Age,main="QQplot")
qqline(Age,col="red")
box.cox.powers(Age-15)
hist(1/(Age-15),freq=FALSE,main="Transformed data")
lines(density(1/(Age-15)), col ="red")
qqnorm(1/(Age-15),main="Transformed data")
qqline(1/(Age-15),col="red")
summary(Age)
mean(x,trim=0.1)
dev.off()
require(vioplot)
vioplot(1/(Age-15))
Histogram QQplot
0.15
●
●
Sample Quantiles
60
0.10
Density
●●
●
40
0.05
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
20 ●
●●
●
●●
●
●
0.00
●●
●●
●
●
●
●●
● ●
●
●●
●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●●●
●
●
●●
●
● ●
●
●
●●
●
●●
●
●
● ●●●●●
20 30 40 50 60 70 −3 −2 −1 0 1 2 3
●
●●●
Sample Quantiles
●●
●
●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●●
0.4
●
●
●●
●
2.0
●
●
●
●●
●
●
●●
●
●
Density
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
0.2
1.0
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
0.0
● ●●●
0.0
Figure 2.2: Statistical graphics for univariate numeric data and transformed
data
Chapter 3
Exploratory bivariate
statistics
17
18 CHAPTER 3. EXPLORATORY BIVARIATE STATISTICS
where cov(x, y) is the covariance :
n
1 X
cov(x, y) = (xi − x)(yi − y),
n − 1 i=1
var(x) is the variance that is to say the square of the standard deviation:
n
1 X
var(x) = (xi − x)2
n − 1 i=1
and
β̂ = y − α̂x.
● ●● ●●
●
●
●● ●
●
22
● ●● ●
●
● ● ●
●●
● ● ●●● ●
● ●
● ● ●● ● ●
●
20
●● ●●
● ●●
● ●●
● ●
● ●● ●●●
●
●●●●● ●
●●● ●●
● ●●●● ● ●
●●●●● ●● ●
Wr.Hnd
●
●●● ●●●● ●
● ●● ● ● ● ●
18
● ● ●●●●
●● ●●● ●
●
● ●●●
● ● ●●●●●●●● ●● ●
● ●
●● ●
●● ●●● ●●●●●
● ●
● ● ●●
● ●●
16
● ● ●●
●● ● ●
●
14
● ●
● ●
14 16 18 20 22
NW.Hnd
data(anscombe)
anscombe
attach(anscombe)
in order to deal with a special datale created by the American statistician
Franck Anscombe. Study the relationships : (1) x1 and y1, (2) x2 and y2, (3)
x3 and y3 and nally (4) x4 and y4.
The resulting plots (gure 3.2) show that the spans are higher for male
students and that there is also more variability in the male sample. One can
quantify these observations using the mean and the standard deviation in both
samples.
### summary statistics
R> tapply(Wr.Hnd,Sex,mean,na.rm=TRUE)
Female Male
17.59576 19.74188
R> tapply(Wr.Hnd,Sex,sd,na.rm=TRUE)
Female Male
1.314768 1.750775
R> tapply(Wr.Hnd,Sex,summary)
A modeling approach describes the relationship between span and sex. This
is called analysis of variance technique. Here we only try to compute a quantity
to estimate the relationship. The classical statistic for doing this is the correla-
tion ratio, thereafter denoted CR (or sometimes η ). It estimates the dierence
2
between the means of the groups and compares it to the general variability.
Let's call y the numeric variable and n the size of the sample. One considers g
groups of data and nj data in the j th group: yij with i = 1, · · · , nj ; j = 1, · · · , g .
If the general mean is denoted y and the mean in each group y j , the correlation
ratio is : Pg 2
j=1nj (y j − y)
CR = Pg Pnj 2
.
j=1 i=1 (yij − y)
This quantity lies between 0 and 1. If CR = 0 then all means are equal
and there is no relationship between the two variables. On the contrary, if CR
1 But this is one of the most useful tool in statistic.
22 CHAPTER 3. EXPLORATORY BIVARIATE STATISTICS
22
22
●
20
20
18
18
16
16
14
14
● ●
is near one, the relationship is a very good one, the means are rather dierent
and the variability inside the groups is small. We obtain here CR = 33%. This
means that 33% of the variability in the span measurement can be explained by
Sex2 .
### Analysis of variance
aov1<-aov(Wr.Hnd~Sex)
summary(aov1)
summary.lm(aov1)
Remark 1 You can note that the relationship is only approached in terms of
variation in the means. We do not examine the possibility of variation in terms
of dispersion. More than this, the analysis of variance modeling (and the corre-
lation ratio) is based on a restrictive hypothesis : the dispersion is more or less
similar from sample the sample. And if it is not the case? The usual solutions
are to transform the data (the easy way) or to use a generalized linear modeling
approach (the hard way).
Plotting the data is not very interesting in this case because we have few
categories. Anyway, we conrm seeing gure 3.3 that males exercise more fre-
quently than females.
### bivariate analysis: category*category
table(Sex,Exer)
Exer2<-ordered(Exer,levels=c("None","Some","Freq"))
table(Sex,Exer2)
cross<-table(Sex,Exer2)
sweep(cross,1,apply(cross,1,sum),"/")
round(100*sweep(cross,1,apply(cross,1,sum),"/"))
# plots
par(mfrow=c(1,2))
mosaicplot(cross,color=TRUE,main="")
barplot(t(cross),beside=TRUE,legend.text=TRUE)
assocplot(cross)
The usual statistic for estimating the quantity of relationship in this case is
the Pearson's chi-squared statistic for independence . Let nij correspond to row
i and column j in the crosstable. If ni+ is the sum in the crosstable for row
i, n+j the sum for column j and n++ the general sum, then the chi-squared
statistic is 2
X (nij − n̂ij )
X2 =
ij
n̂ij
where
ni+ × n+j
n̂ij = .
n++
3.3. RELATING TWO CATEGORICAL MEASUREMENTS 25
Female Male
None
Some
Exer2
Freq
Sex
Figure 3.3: A mosaic plot to describe the relationship between sex and exercise
26 CHAPTER 3. EXPLORATORY BIVARIATE STATISTICS
If X 2 is zero then the proles (row relative frequencies) are equal, this means
that there is no relationship. If the statistic is important then the proles are
dierent and we have a relationship to explain. What means important? Here,
we obtain X 2 = 5.7184. The problem is that the chi-squared statistic is not a
percentage of variability, it can be superior to one 3 . There is no clear agreement
in the statistical world on how to "standardize" this statistic (despite a lot of
proposals : Cohen's w, Cramer's V ). In fact, one has to use a p-value approach.
In this case, the p-value is near 5%, we face quite a signicant relationship.
### Chi-squared test
R> chisq.test(cross)
Pearson's Chi-squared test
data: cross
X-squared = 5.7184, df = 2, p-value = 0.05731
# effect sizes
Xsquared<-chisq.test(cross)$statistic
N<-sum(cross)
wCohen<-sqrt(Xsquared/N)
wCohen
k<-min(dim(cross))
VCramer<-sqrt(Xsquared/(N*(k-1)))
VCramer
Exercise 4 Cross the variables Sex and Smoking from the student dataset.
q
3 Even w = X2
can be > 1.
n
Chapter 4
An introduction to
exploratory multivariate
analysis
In order to obtain a single solution z, one will suppose that the principal
component is centered and standardised ( z̄ = 0 and σ̂(z) = 1, this is because any
linear transformation will do the same job but note that −z is also a solution).
Let's consider p original variables xj . One wants to maximize
p
X
q= r2 (xj , z).
j=1
27
28CHAPTER 4. AN INTRODUCTION TO EXPLORATORY MULTIVARIATE ANALYSIS
diagonal, one obtains
p
X
q= (yj0 Dz)2 .
j=1
Denition 2 If the principal component does not give enough information, one
can produce a second principal component, maximizing the same criterion under
a constraint of null correlation with the rst principal component. This process
can be iterated.
# correlation matrix
1 It can be easily proved by using a Lagrangian. Let's note that 1 is also an eigenvector
n
of the matrix WD (corresponding to the eigenvalue 0) so z is centered (try to prove this).
Because of the unity constraint, it is also standardized (try also to prove this).
2 The standardised PCA is sometimes called PCA on the correlation matrix
4.3. PCA AS A PROJECTION PROBLEM 29
R<-cor(X)
R
# eigendecomposition
eig.R<-eigen(R)
lambd<-eig.R$values
lambd
v<-eig.R$vectors
v
# principal components
Y<-scalewt(X)
li<-Y%*%eig.R$vectors
li
Z<-sweep(li,2,sqrt(lambd),"/")
Z
Figure 4.1 gives a graphical illustration of the property of the rst principal
component : it maximises the sum of the squares of the linear correlation coef-
cients with the set of original variables. Figure 4.2 gives of plot of the linear
correlations with the two rst principal components.
● ●
X100● ●
●
● ●
●
11.4
●
●
●
●
● ●
●
● ● ●
●
● ●
●
11.0
●
● ●●
● ●
●
●
●
●
10.6
−2 −1 0 1 2
score
long
●
7.5
● ● ●● ● ●
● ● ●
●
● ●
●
●
●
● ● ● ●
7.0
● ●
● ●●
● ●
●
●
●
●
6.5
−2 −1 0 1 2
Figure 4.1: Scatterplots of the two original variables with the rst principal
component
4.3. PCA AS A PROJECTION PROBLEM 31
X100 long
Figure 4.2: Linear correlations of the two original variables with the rst two
principal components
32CHAPTER 4. AN INTRODUCTION TO EXPLORATORY MULTIVARIATE ANALYSIS
be considered as a vector into a two dimensional space. This 33 people give a
scatterplot (cf. gure 4.3).
To depict these data, one can t a line to the scatterplot. But the two
variables are here considered in a symmetric way, so the idea is a dierent one
from that one used in the least-squares line approach. One has to look for a line
close to the data considering orthogonal projections. Using the Pythagorea's
theorem, the problem may be thought as maximizing the variability of the
orthogonal projections of the points onto the line.
Denition 3 The PCA problem may be viewed as nding a line in a p-dimensional
space close to the data. It is equivalent to say that one wants to maximize the
variance of the orthogonal projections of the points onto the line.
The solution of this old problem (Pearson 1901) is given by the preceding
eigendecomposition of the correlations matrix R. The eigenvector v gives√the
direction of projection in the p-dimensional space. The n-vector Yv = λz
gives the coordinates of the projections and the eigenvalue λ is the maximum
variability obtained from these projections.
Remark 3 One can also look for an orthogonal direction in p-dimensional space
maximizing the same criterion. The solution is obtained by the sequence of the
correlation matrix eigendecomposition.
From a graphical viewpoint, the projections onto the rst two directions
gives usually a very good idea of the closeness between the n statistical units3 .
Figure 4.3 gives the scatterplot of the two (standardized) original measurements
and the directions for projecting the points. Figure 4.4 presents the 33 projected
points. This is only a rotating view of the preceding plot.
require(MASS)
eqscplot(Y,xlab="x100 (standardized)",ylab="Long jump (standardized)",type="n")
text(Y[,1],Y[,2],rownames(Y))
arrows(0,0,acp1$c1[1,1],acp1$c1[2,1],col="red")
arrows(0,0,acp1$c1[1,2],acp1$c1[2,2],col="blue")
dev.new()
s.label(acp1$li)
2 11 15
5 3 1
1
4 14 21
12
Long jump (standardised)
13 10
25
33
0
9 30
20 16 23
19 7 22
28 18
8 27 24
29
−1
26
17
−2
32
−3
31
−3 −2 −1 0 1 2
x100 (standardized)
Figure 4.3: Scatterplots of the two original variables (standardized) and the two
directions of projection
34CHAPTER 4. AN INTRODUCTION TO EXPLORATORY MULTIVARIATE ANALYSIS
d=1
31
20 4
32 8
27 19
16
26 29 7 9
25
13 2
17 24 14 11
5
15 6
23 10 12
22
28 3
18 30 1
33
21
Principal Components
Analysis
print(WORLD1984)
plot(WORLD1984)
pca.WORLD<-dudi.pca(WORLD1984)
35
36 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS
4
3
2
1
0
Figure 5.1: Plot of the eigenvalues from the PCA of the WORLD1984 datale
5.1. PCA: FIRST ORDER SOLUTION 37
logGNP
●
●
●● ●
PopGrowth ●
40
● ● ●
●
9
●
●
●
● ● ●
●
30
● ● ●
● ●
8
●
● ● ● ● ● ●
●
● ● ● ● ●
● ●
●● ● ●
●● ● ● ●
● ●
y
●
20
●
7
● ●
● ● ● ●
● ● ●
●
● ●
●
● ●
10
● ●
● ●
●
6
● ● ●● ● ●
● ●
●●
● ● ● ● ● ●
●● ●
●
0
● ●
5
● ●●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
score score
logDeath ●
logIlliteracy ●
● ●●
5.0
● ● ●
● ●●
● ● ●● ●●
4
● ● ●
●
● ●●
●
● ● ●
●● ●●●
● ●●
● ●
● ●
●
3
● ●
4.0
● ●
●
● ●
●
●● ●
● ●
y
●●
● ● ●
2
●
●
● ●
●
3.0
●
●
1
● ●
●
● ●●●
●●●
● ●
● ● ● ● ●
● ●
● ● ●
●
2.0
●
0
●● ● ●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
100
score score
School
●
● ●
●
●
●
●
●● ● ●
● ●
● ● ●
80
●
●
●
● ●
● ●
● ●
60
●
● ●
● ●
y
●
●
●
●
● ●
40
● ● ●
●
● ●
●
20
● ●
●
● ●
Figure 5.2: Scatterplots of the ve original variables from the WORLD1984
datale with the rst principal component
38 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS
Niger ●
Mozambique ●
Ethiopie ●
Haute.Volta ●
Sénégal ●
Vietnam ●
Mexique ●
Chili ●
Venezuela ●
Corée.Sud ●
Argentine ●
France ●
RFA ●
Suisse ●
Finlande ●
Suède ●
−3 −2 −1 0 1 2 3
Figure 5.3: Dotplot for the projections of the countries, only 15 of them
are given (the 5 lowest scores, 5 medium scores, the 5 highest scores and of
course. . . Vietnam !)
5.2. PCA: SECOND ORDER SOLUTION 39
1 Sometimes a procedure called rotation such as varimax is used in order to identify these
two interesting directions with real components, especially in psychology
40 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS
Discus
PuttingShot
Run1500
Javelin
Run400
Sprint100 PoleVaulting
Hurdles
HighJump
LongJump
Figure 5.4: Correlation circle of the 10 events with the rst two principal com-
ponents from the PCA of the DECATHLON dataset
5.2. PCA: SECOND ORDER SOLUTION 41
d=2
17
18
31
28
10
20 111
7
4
23
9
21 16
32 30 22 12 8 3 2
15
24 14
25
6
29 26 19 13 5
33 27
Figure 5.5: Projections of the 33 athletes onto the rst two principal directions
42 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS
d=2
17
Discus
PuttingShot
Run1500
Javelin
Run400
18
31
28
10
11
Sprint100 20 1 PoleVaulting
7
4
Hurdles 23
9
16
HighJump
21
32 30 22 12 8 3 2
15
24 14
25
LongJump
6
13
29 26 19 5
33 27
inertia.dudi(pca.decathlon)
r2 (xj , zk )
Column j contribution to inertia of axe k = .
λk
The same inertia.dudi function using argument col=TRUE gives the results in
the $col.abs component. The most important contributions to axe 2 for instance
are PuttingShot (2338), Discus (2533), Javelin (1384) and Run1500 (1772).
inertia.dudi(pca.decathlon,col=TRUE)
2
Column j contribution to inertia of axe k = vj(k)
where vj(k) is the j-th coordinate (corresponding to the variable xj ) of the k-th
eigenvector vk of the correlation matrix.
Remark 4 We can also compute contributions to inertia for the statistical units
(using argument row=TRUE), but it is usually not very informative.
44 CHAPTER 5. PRINCIPAL COMPONENTS ANALYSIS
5.3.3 Squared cosines
How well a variable lying in a n-dimensional space can be represented into a
two-dimensional plot? Beyond the general information given by the eigenvalue,
one has to estimate the quality of the representation for any variable. This is
done by computing squared cosines 2 .
In the standardised PCA setting, the squared cosines of variable xj with
axis zk may be computed as : r2 (xj , zk )
The function inertia.dudi gives these squared cosines in the $col.cum compo-
nent. For instance "PuttingShot" is a well represented variable : 25% on axe 1
and 86% cumulating with axe 2. It remains only 14% to explain (outside the
plot). On the contrary, the quality of representation of "HighJump" is very bad
(16%) onto the two axes ! A third axis is vital to interpret this event.
Exercise 7 Look at the squared cosines for the rows to see which very important
athlete is nevertheless badly represented. Why?
statistical triple
45
46 CHAPTER 6. BEYOND THE BASICS : THE STATISTICAL TRIPLE
From the statistical triple, two matrices can be computed:
WD = YQY0 D
and
VQ = Y0 DYQ.
Remark 5 The two matrices share the same eigenvalues. The eigenvectors for
WD may be obtained from the eigenvectors of VQ.
0
symmetric matrix YQY D (whose eigenvectors are in Bf ) which share the
same set2 of eigenvalues (Λf ).
In the same way, onto any n-vector b, D-scaled to unity, one can project
the p points corresponding to the columns of the transformed matrix Y0 (or
compute a kind of non centered covariance). One obtains a p-vector Y Db
called column projections vector . The squared
norm
of these projections in the
2
Euclidean space Rp with inner product Q is
Y Db
.
0
Q
2
Theorem 2
0
The rst row eigenvector b1 maximizes the quantity
Y Db
Q
2
under the constraint kb1 kD = 1. This maximum is equal to λ1 . The second row
eigenvector b2 maximizes the same criterion under the constraint of orthonor-
2
mality ((b1 | b2 )D = 0 and kb2 kD = 1) with a maximum in λ2 etc.
It is worth to notice the following relationships between the eigenvectors and
the projections vectors.
Theorem 3 The row eigenvectors are proportional to the corresponding row
projections vectors: bk = √1λ YQak . This is also true for the columns: ak =
k
transition relationships.
0
√1 Y Dbk . These properties are called the
λk
Ecart-Young decomposition .
The print.dudi() function gives a quick summary of a dudi object. The score()
and scatter() functions are generic functions for plotting dudi objects.
Computer operations
The last but not the least advantage of this general procedure is to allow the
use of a general algorithm to carry out dierent analyzes, even unexpected
ones. This is similar to the case of the general algorithm used for estimating
generalized linear models: the choices are made explicit, we can discuss these
choices, but we have no surprise with the computer operations (except in very
special conditions).
One does not plot here the 615 projections of the statistical units because
in a survey sampling the individual opinions are of no interest. The gure 6.1
gives a terse view of the wishes: a rst set of transformations related to water
equipment (and body care), another unrelated set of wishes directed toward
sociability links and a medium group oriented toward gym activities (which are
physical and sociable activities at the same time).
6.4 Diagnostics
6.4.1 A decomposition of the trace
The total inertia of a triple ( Y,Q,D) is
Note that if the inner products are diagonal ones, the inertia can be written
p
n X
X
2
IT = di qj yij .
i=1 j=1
so the quantity λk
IT gives the importance of axe k.
6.4. DIAGNOSTICS 51
d = 0.5
Jacuzzi
Sauna
Hydrotherapy
Gymnastic
Aquagym
Ultraviolet
Events
RelaxingPlace
Snack
Bar
DayNursery
Garden
DiscussionSpace
Figure 6.1: Plot of the covariances of the original variables with the two rst
principal components from the centered PCA of the SWIMMINGPOOL dataset
52 CHAPTER 6. BEYOND THE BASICS : THE STATISTICAL TRIPLE
6.4.2 Contribution to inertia
Let's note lk (resp. ck ) the row (resp. column) projection vector onto axe k and
bk (resp. ak ) the corresponding row (resp. column) eigenvector. Considering
the axe k, one can write λk = klk k2D = i di lk2 (i).
P
2
The contribution to inertia of row i to axe k is diλlkk(i) . In the same way, the
2
contribution of column j to axe k is qj cλkk(j)
Remark 6 One is usually more interested into the squared cosines of column
(variable) j with axe k:
c2k (j)
2
kyj kD
Exercise 9 Generate and read the diagnostics (column) for the centered PCA
of the TOOTHPASTE dataset.
Chapter 7
Multiple Correspondence
Analysis
head(CATS)
sapply(CATS,class)
summary(CATS)
The goals are similar to those of a PCA: determining if any variables are re-
lated, how they are related and discovering the closeness between the statistical
units.
53
54 CHAPTER 7. MULTIPLE CORRESPONDENCE ANALYSIS
Denition 4 The rst order solution of the multiple correspondence analysis
of the p categorical variables Xj provides a vector of row scores which maximises
the following criterion:
p
1X
CR(b1 , Xj )
p j=1
Using this solution one can separately plot for each variable Xj the means
of the corresponding categories and interpreting the link between the variables
using the locations of the categories (see gure 7.1). On this plot, the youngest
cats (category 1) share the same location as the low fecundity ones and the
one-litter ones.
The function dudi.acm carries out the necessary computations. One-dimensional
plots are obtained by using the score function and two-dimensional ones by using
the scatter function. For the CATS dataset, only one axe is important.
Age
4−5
6−7
0.5
8
2−3
score
−0.5
1
−1.5
score
Fecundity 9−12
13−14
7−8
0.5
score
−0.5
3−6
−1.5
1−2
score
Litter
2p
0.5
score
−0.5
1p
−1.5
Figure 7.1: Results of the MCA of the CATS data. The rst row eigenvector is
represented on the X-axis. Each plot corresponds to a separate variable. The
categories are plotted as means.
56 CHAPTER 7. MULTIPLE CORRESPONDENCE ANALYSIS
Table 7.1: An excerpt of the disjunctive table for the CATS data
Age.1 Age.2-3 Age.4-5 Age.6-7 Age.8 Fec.1-2 Fec.13-14
1 1 0 0 0 0 1 0
2 1 0 0 0 0 1 0
3 1 0 0 0 0 1 0
4 1 0 0 0 0 1 0
5 1 0 0 0 0 1 0
6 1 0 0 0 0 1 0
Proof 1 First, let's show that the row eigenvectors are Dn -centered. Because
1n is also an eigenvector (corresponding to a zero eigenvalue), and because the
eigenvectors are Dn -orthogonal (the matrix to diagonalize is symmetric) the row
eigenvectors are centered.
Then, as stated in theorem 2, the rst row eigenvector b1 maximizes the
0
2 0 0
−1
quadratic form
Y Dn b
. The coordinates of the vector Y Dn b = XDm − 1nm Dn b
1
p D m
are exactly the mean values of scores b corresponding to each category minus
2
the general mean. When we compute the squared norm kk 1 Dm of this vector
p
of means, one obtains exactly the mean of the between variances1 of score b
with respect to each variable. Because this score is scaled to unity and because
we have shown that it is centered, the total variance of the score is one, so
one maximises indeed the mean of the correlation ratios with the rst vector of
scores!
7.5 Miscellaneous
7.5.1 Correspondence analysis
The application of multiple correspondence analysis to two categorical variables
is exactly equivalent to a very popular multivariate technique: correspondence
analysis of the corresponding contingency table.
7.6 Exercise
The datale ROXY in the workspace SCvn2014.RData corresponds to 308 females
practicing ski or snowboard. They were asked about their goals when doing the
activity: to take risks, to enjoy the landscape, to be in great form? The answer
can only be yes or no leading to categorical (binary) variables. Use a MCA in
order to study the ROXY data.
58 CHAPTER 7. MULTIPLE CORRESPONDENCE ANALYSIS
# the workspace SCvn2014.RData must be loaded
# Converting numeric data into factors
sapply(ROXY,class)
for(i in 1:10){
ROXY[,i]<-factor(ROXY[,i],labels=c("NO","YES"))
}
sapply(ROXY,class)
sapply(ROXY,summary)
Chapter 8
head(LIFE)
sapply(LIFE,class)
summary(LIFE)
The goal remains the same one: trying to understand the relationships be-
tween the measurements.
59
60 CHAPTER 8. HILL AND SMITH'S ANALYSIS
score, thus also the sum. Because squared linear correlation coecients and
correlation ratios are of the same mathematical type (squared cosines), we can
mix these two analyses if they have the same row weights .
Denition 5 The Hill and Smith's analysis (H&S analysis) maximizes the sum
of the squared cosines of the (categorical or continuous) variables with a row
score.
The plot of one score is obtained using the usual function score. But for two
dimensions, the function scatter is not ecient. Two functions are proposed in
the workspace SCvn2014.RData in order to obtain a correlations circle for the
numeric variables and the classical MCA separate plots for categorical variables.
# HSA: second order solution
scatter(mix.life)
scatter.mix.numeric(mix.life)
scatter.mix.categorical(mix.life)
The Hill and Smith's analysis is the analysis of the triple ( Y,Q,Dn ) where
Y is the n × (p + m) matrix built by binding the two datasets from the triples:
Y = Xs | ZD−1 and the
weighting system for the columns is a (p +
m − 1 nm
Ip 0
m) × (p + m) diagonal matrix .
0 Dm
From the general property of a duality diagram applied to this triple, one
can see that a PCA and a MCA are simultaneously performed, so one maximizes
• the sum of the squared linear correlation coecients for the continuous
variables
1 The constant 1/q is indeed omitted.
8.3. H&S ANALYSIS AS A DUALITY DIAGRAM 61
Housing
TV
Age
Figure 8.1: Results of the H&S analysis of the LIFE data. The numeric variables
are represented by their correlations with the rst two solutions
62 CHAPTER 8. HILL AND SMITH'S ANALYSIS
Sex ●
Family ●
● ●
● ●
●● ●●
● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●● ● ●●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ●
● ● ● ●
● ●
● ●
Female ● ●
●
●
●
●
●
●
●
● ●
Yes ● ●
●
●
●
●
●
●
●
● ●
●
Male
●
● ●●
●
●
●
●
● ●
●
●
● ●●
●
●
●
●
●●
● ●● ● ●
●
●
●
●
● ●●
● ●● ● ●
●
No ●
●
●
●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ●● ● ● ●●
● ●
●● ●●
● ● ● ●
● ●
● ●
● ●
●● ●●
● ●
● ● ● ●
● ●
● ●
DVD ●
Headache ●
● ●
● ●
●● ●●
● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●● ● ●●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
●
● ●
●
●
●
No ●
●
●
●
●
●
●
●
●
● ●
●
●
Yes ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ●
● ●
●
●
●
● ●●
●
●
●
●
●
● ●
●
●
●
●
No ●●
●
●
●
●
●
●● ● ● ●● ● ●
● ●● ● ● ●● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ●● ● ● ●●
●
●●
●
Yes ● ●
●●
●
●
● ●
● ●
● ●
●● ●●
● ●
● ● ● ●
● ●
● ●
Backache ●
FoodLimitations ●
● ●
● ●
●● ●●
● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
●
●
●
● ● Yes
● ●
●
●
●
●●
●
●
● ●
●
●
●
●
● ●
● ●
●
Yes ●
●
●●
●
●
● ●
●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ●
● ●
●●
● ●● ● ●
●
No ●
●
● ●●
● ●● ● ●
●
●
●
●
● ● ● ●
●
●
●
●
● ●●
●
●
No ●
●
● ●●
● ●
●● ●●
● ● ● ●
● ●
● ●
● ●
●● ●●
● ●
● ● ● ●
● ●
● ●
Figure 8.2: Results of the H&S analysis of the LIFE data. The categories of the
variables are located by the means of the rst two solutions
8.3. H&S ANALYSIS AS A DUALITY DIAGRAM 63
• and the sum of the correlation ratios for the categorical variables,
Exercise 13 Test yourself with the 2010-2011 examination (on the website).
64 CHAPTER 8. HILL AND SMITH'S ANALYSIS
Chapter 9
Correspondence analysis
require(ade4)
data(housetasks)
chisq.test(housetasks)
mosaicplot(housetasks,las=2,main="Housetasks",color=TRUE)
65
66 CHAPTER 9. CORRESPONDENCE ANALYSIS
Housetasks
Breakfeast
Holidays
Insurance
Main_meal
Dinner
Tidying
Dishes
Shopping
Official
Finances
Repairs
Driving
Laundry
Wife
Alternating
Husband
Jointly
For studying independence, two other criteria exist which compete the chi-
squared statistic:
• the Belson's criterion:
X 2
B2 = (pij − pi+ × p+j )
i,j
and
• the Goodman-Kruskall's criterion:
X (pij − pi+ × p+j )2
K2 = .
i,j
pi+
the variance: i pi+ (ai − a)2 which quanties the discrimination of the rows.
P
d = 0.5
Holidays
Jointly
Finances
Insurance
Tidying Dishes
Shopping
Official
Dinner Alternating
Wife
Breakfeast
Main_meal
Laundry
Driving Husband
Repairs
Figure 9.2: Two-dimensional plot from the correspondence analysis of the house-
tasks data
9.3. CORRESPONDENCE ANALYSIS 69
Exercise 14 The dataset BIRDS crosses time and space information about birds
(winter teals) for which the rings were retrieved in the area i during the month
j.
Perform a correspondence analysis of this datale. Interpret the result. Why
would it be interesting to have a map of the locations?
9.3.4 CA diagnostics
In correspondence analysis, due to the weighting of rows and columns, the plots
may sometimes be misleading . It is thus important to scrutinize carefully this
information.
Exercise 16 Prove that Inertia = X2
n++ .
9.4.1 NSCA
The only dierence with Correspondence analysis is that the column scores are
scaled to unity using the canonical metric: a a = 1.
0
9.4. NON SYMMETRIC CORRESPONDENCE ANALYSIS 71
Theorem 7 The rst solution of the NSCA a1 maximizes the variance of the
corresponding rows means. The second solution does the same thing under or-
thogonality constraints.
distribution, but the conditional mean multiplied by the weight of the corre-
sponding column (p+j ). The sum of squared of these products is maximized by
this analysis (see theorem 2). A column with little importance is thus discarded
by the analysis. So this analysis may not be biased by a contingency table
showing some low-frequency columns.
This is a very dierent approach to that of correspondence analysis. This
analysis is not symmetric at all, the analysis of the transposed contingency table
is usually quite dierent. It is important to choose the rows and the columns.
It must be noted (1) that the row coordinates are not centered because
their mean is a1 (and a2 ), so the scatterplot is decentred and (2) that the row
coordinates are not the row scores of the triple of the NSCA, which are a1i − a1
(and a2i − a2 ). These row scores are DI -centered. So be careful in reading the
classical plot 9.3 from the duality diagram because the row scores are not the
means of the column scores.
scatter(nsc1)
72 CHAPTER 9. CORRESPONDENCE ANALYSIS
Jointly d = 0.5
Holidays
Tidying
Dishes
Shopping Finances
Insurance
Wife
Dinner
Laundry
Main_meal
Breakfeast Alternating
Official
Driving
Husband
Repairs
Figure 9.3: Two-dimensional plot from the non symmetric correspondence anal-
ysis of the housetasks data
9.5. BELSON'S CORRESPONDENCE ANALYSIS 73
75