Professional Documents
Culture Documents
Principal Components Analysis (Part II) : Predictive Modeling & Statistical Learning
Principal Components Analysis (Part II) : Predictive Modeling & Statistical Learning
Principal Components Analysis (Part II) : Predictive Modeling & Statistical Learning
CC BY-SA 4.0
Applying PCA
2 / 68
Painstaking PCA
But you should know that there are more things to look at in
a PCA.
3 / 68
Exhaustive PCA
Roadmap
I Determine active individuals and variables.
I Determine supplementary individuals and variables.
I Data transformation (e.g. usually standardized data).
I EVD or SVD of some data-based matrix.
I Computation of eigenvalues, loadings, and PCs.
I Determine how many dimensions (i.e. PCs) to retain.
I Interpretation tools:
Plots, Quality measures, Contributions, stability, etc
4 / 68
PCA of NBA Team Stats
5 / 68
NBA Team Stats
6 / 68
# variables
dat <- read.csv('data/nba-teams-2017.csv')
names(dat)
8 / 68
Active and Supplementary
Elements
9 / 68
Active vs Supplementary
Active
Selected individuals (rows) and variables (columns) that are
used to compute eigenvalues, loadings and PCs.
Supplementary
Additional individuals and variables not used in the
computation steps, but taken into account for interpretation
purposes, and/or further data exploration.
10 / 68
Which Variables?
I wins I assists
I losses I turnovers
I points I steals
I field goals I blocks
“Active” means these are the variables used to compute PCs.
11 / 68
Which Variables?
Supplementary Variables
Among the rest of the variables, we are going to consider three
supplementary variables:
I points3
I rebounds
I personal fouls
12 / 68
Which Individuals?
Active Individuals
All of the teams in season 2016-2017
Supplementary Individuals
Warriors and Cavaliers 2015-2016
13 / 68
Transformation of Variables
14 / 68
Importance of Variables
To standardize or not?
I A key issue has to do with the scale of the variables.
I If variables have different units of measurement, then we
should standardize them to avoid variables with larger
scales dominate the analysis.
I If variables have the same units:
– you could leave them unstandardized
– or you could standardize them (strongly suggested)
15 / 68
Raw values
120
100
80
60
40
20
0
wins losses points field_goals assists turnovers steals blocks
If you use the raw scales, wins and losses will dominate the
analysis due to their larger scales.
16 / 68
Standardized values
−1
−2
17 / 68
PCA Basic Results
(via EVD or SVD)
18 / 68
PCA via EVD
R = VΛVT
where:
I V is the matrix of eigenvectors
19 / 68
PCA Essential Results
20 / 68
Essential Results: Eigenvalues
21 / 68
Essential Results: PCs or Scores
Z = XV
22 / 68
How many PCs to retain?
23 / 68
Various criteria
24 / 68
Table of Eigenvalues
25 / 68
Table of Eigenvalues
26 / 68
Screeplot of eigenvalues
3
values
1 ●
●
● ●
●
0 ●
1 2 3 4 5 6 7 8
number of components
28 / 68
Kaiser’s Rule
29 / 68
Jollife’s Rule
30 / 68
Studying the Individuals
31 / 68
Studying the Individuals
32 / 68
Principal Components
PCs are typically given by: Z = XV, although it is possible to
rescale them (e.g. variance = 1 or unit-norm)
33 / 68
Scatterplot of individuals on PC1 and PC2
76ers
●
Nets●
Suns●
2
Lakers ●
1 Timberwolves ●
Warriors
● Hawks ●
Nuggets
●
Kings ●
Knicks ●
PC2
RocketsWizards
●
●
Bucks
Thunder
Pacers
Pelicans
● ●
● ●
0 Magic
●
Bulls ●
Blazers
●
Heat ●
−1
Cavaliers Grizzlies ●
Celtics
Clippers
●
Raptors
●● ●
Pistons
Hornets ● ●
Mavericks
●
Spurs ●
−2 Jazz●
34 / 68
Scatterplot of individuals on PC1 and PC3
Spurs ●
Warriors
●
76ers ●
1 Bucks ●
Heat Grizzlies
Pelicans ●● ●
Hawks Jazz ● ●
PacersBulls
Raptors ●
● ●
Thunder ●
0 Hornets ●
Suns ●
TimberwolvesKings ●
Nets ● ●
PC3
Blazers ●
Lakers ●
Celtics ●
−1 Wizards Clippers
● ●
Pistons ●
Rockets
●
−2
Cavaliers●
Nuggets
●
35 / 68
Scatterplot of individuals on PC2 and PC3
Spurs
●
Warriors●
76ers
●
1 Bucks ●
Grizzlies ● Heat
●
Pelicans ●
Jazz
● Hawks ●
Mavericks
●
Knicks ●
Magic●
Thunder ●
0 Hornets
●
Suns
●
Timberwolves
Kings
●
●
Nets
●
PC3
Blazers
●
Lakers
●
Celtics ●
−1 Clippers
Pistons
●
●
Wizards ●
Rockets ●
−2
Cavaliers●
Nuggets ●
−2 −1 0 1 2
PC2
36 / 68
Quality of Representation
xi
d(xi , g)
axis k
zik
g
37 / 68
Quality of Representation
xi
d(xi , g)
axis k
zik
g
2
zik
cos2 (i, k) =
d2 (xi , g)
38 / 68
Quality of Representation
2
zik
cos2 (i, k) =
d2 (xi , g)
where:
I zik is the value of k-th PC for individual i
39 / 68
Quality of Representation
Adding the squared cosines over all principal axes for a given
individual, we get:
r
X
cos2 (i, k) = 1
k=1
40 / 68
Quality of Representation
41 / 68
Quality of Representation
42 / 68
Individual’s Contributions
43 / 68
Contributions
44 / 68
Contributions
45 / 68
Contributions
46 / 68
PC1
0
10
20
30
40
50
●
Warriors
●
Spurs
●
Rockets
●
Celtics
●
Jazz
●
Raptors
●
Cavaliers
●
Clippers
●
Wizards
●
Thunder
●
Grizzlies
Contributions of individuals to PC1
●
Hawks
●
Pacers
●
Bucks
●
Bulls
●
teams
Blazers
●
Heat
●
Nuggets
●
Pistons
●
Hornets
●
Pelicans
●
Mavericks
●
Kings
●
Timberwolves
●
Knicks
●
Magic
●
76ers
●
Lakers
●
Suns
●
Nets
47 / 68
PC2
0
5
10
15
●
Warriors
●
Spurs
●
Rockets
●
Celtics
●
Jazz
●
Raptors
●
Cavaliers
●
Clippers
●
Wizards
●
Thunder
●
Grizzlies
Contributions of individuals to PC2
●
Hawks
●
Pacers
●
Bucks
●
Bulls ●
teams
Blazers
●
Heat
●
Nuggets
●
Pistons
●
Hornets
●
Pelicans
●
Mavericks
●
Kings
●
Timberwolves
●
Knicks
●
Magic
●
76ers
●
Lakers
●
Suns
●
Nets
48 / 68
More about Contributions
49 / 68
Representing
Supplementary Individuals
50 / 68
Supplementary Individuals?
# supplementary individuals
datsup <- read.csv('data/nba-teams-2016.csv')
51 / 68
Representing Supplementary Individuals
52 / 68
Representing Supplementary Individuals
x̂i∗ = xTi∗ vk
53 / 68
PCA plot with supplementary teams
3
76ers
●
Nets
●
Suns
●
2
Lakers
●
Timberwolves
●
1
Warriors
● Hawks
●
Nuggets
●
Kings
●
Knicks
●
PC2
Rockets
●
Wizards
●
Bucks
Thunder
● ●
Pacers
●
Pelicans
●
Magic
●
0
Warriors16 Bulls
●
Blazers
●
Heat●
−1
Cavaliers
Clippers
● Raptors
Celtics
●● ●
Grizzlies
●
Spurs
●
Pistons
Hornets
● ●
Mavericks
●
−2
Jazz
●
Cavaliers16
−6 −4 −2 0 2
PC1
54 / 68
Study of cloud of Variables
55 / 68
Studying the Variables
56 / 68
Loadings (or eigenvectors)
v1 v2 v3 v4 v5 v6 v7 v8
wins -0.412 -0.4375 0.0541 -0.187 0.13813 0.2555 0.1291 7.07e-01
losses 0.412 0.4375 -0.0541 0.187 -0.13813 -0.2555 -0.1291 7.07e-01
points -0.425 0.1378 -0.4490 0.160 0.16307 0.0476 -0.7379 0.00e+00
field_goals -0.405 0.1642 -0.3302 0.412 0.20321 -0.3998 0.5734 1.67e-16
assists -0.398 0.1270 -0.0301 -0.127 -0.89704 -0.0468 0.0422 -2.78e-17
turnovers -0.102 0.6692 -0.0487 -0.191 0.14633 0.6489 0.2461 6.94e-17
steals -0.297 0.3128 0.4178 -0.544 0.26045 -0.5119 -0.1176 0.00e+00
blocks -0.243 0.0972 0.7111 0.622 -0.00466 0.1487 -0.1315 0.00e+00
57 / 68
Interpreting PCs
58 / 68
Scatterplot of loadings
turnovers
●
0.50
losses
●
steals
●
0.25
field_goals
●
points
●
assists
v2
●
blocks
●
0.00
−0.25
wins
●
59 / 68
Correlations between Variables and PCs
60 / 68
Circle of correlations
1.0
turnovers
losses
0.5
steals
field_goals
points
assists blocks
PC2 axis
0.0
−0.5
wins
−1.0
61 / 68
Squared Correlations
62 / 68
Squared Correlations
63 / 68
Representing
Supplementary Variables
64 / 68
Correlations between all Variables and PCs
65 / 68
Circle of correlations
1.0
turnovers
losses
0.5 personal_fouls
steals
field_goals
points
assists blocksrebounds type
PC2 axis
0.0 a active
points3 a supp
−0.5
wins
−1.0
66 / 68
References
67 / 68
References (French Literature)
68 / 68