Lecture 6 - PCA - Lecturefin

Lecture 7
Principal Component Analysis (PCA)
• Innocent Ndoh Mbue, PhD
• E-mail: dndoh2009@gmail.com
• Tel: 653754070
Prof Dr.Ndoh Mbue 1 11/8/2022

Outline
In today's lecture
• Definition and meaning of the principal component.

• objectives of Principal Components Analysis
• Data type
• Communalities, extraction, variance.
• Information content and the distribution of the principal
component.
• Omission of variables with insufficient communalities.
• Practical example using SPSS

Motivating Questions
•How can we explore structure in our dataset?
•How can we reduce complexity and see the pattern?

PCA, Definition
 Involves a mathematical procedure that transforms a number of
(possibly) correlated variables into a (smaller) number of uncorrelated
variables called principal components. The first principal component
accounts for as much of the variability in the data as possible, and
each succeeding component accounts for as much of the remaining
variability as possible.
Objectives of principal component analysis

 To discover or to reduce the dimensionality of the data set.
 To identify new meaningful underlying variables.

Data Reduction
 summarization of data with many (p) variables by a smaller
set of (k) derived (synthetic, composite) variables.
p k
n A n X
In other words, we wish to reduce a set of p variables to a set of m underlying

superordinate dimensions
These underlying factors are inferred from the correlations among the p
variables. Each factor is estimated as a weighted sum of the p variables.

Step by step explanation
Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that
each one of them contributes equally to the analysis.
If there are large differences between the ranges of initial variables, those variables with
larger ranges will dominate over those with small ranges (For example, a variable that
ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1),
which will lead to biased results. So, transforming the data to comparable scales can
prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.

Step 2: Covariance Matrix computation
The aim of this step is to understand how the variables of the input data set are
varying from the mean with respect to each other, or in other words, to see if there
is any relationship between them. Because sometimes, variables are highly
correlated in such a way that they contain redundant information. So, in order to
identify these correlations, we compute the covariance matrix.
A 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a

3×3 matrix of this from:
Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in

the main diagonal (Top left to bottom right) we actually have the variances of each
initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the
entries of the covariance matrix are symmetric with respect to the main diagonal,
which means that the upper and the lower triangular portions are equal.
 degree to which the variables are linearly correlated is represented
by their covariances.
 X i m  X i  X 
n
1
C ij  
n  1 m 1
jm  X j
Covariance of
variables i and j
Value of Value of Mean of
Mean of
Sum over all variable i variable j variable j
variable i
n objects in object m in object m
 if positive then : the two variables increase or decrease together (correlated)

 if negative then : One increases when the other decreases (Inversely
correlated)
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to
identify the principal components
 computed from the covariance matrix in order to determine the principal

components of the data.
 Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables. These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated and most of the information
within the initial variables is squeezed or compressed into the first components, then
maximum remaining information in the second and so on, until having something like
shown in the scree plot below.
Organizing information in principal components this

way, will allow you to reduce dimensionality
without losing much information, and this by
discarding the components with low information and
considering the remaining components as your new
variables.
Geometric Rationale of PCA
Geometrically speaking, principal components represent the

directions of the data that explain a maximal amount of variance,
that is to say, the lines that capture most information of the data.
The relationship between variance and information here, is that,

the larger the variance carried by a line, the larger the dispersion of
the data points along it, and the larger the dispersion along a line,
the more the information it has.
To put all this simply, just think of principal components as new

axes that provide the best angle to see and evaluate the data, so that
the differences between the observations are better visible.

 PC axes are a rigid rotation of the original variables
 PC 1 is simultaneously the direction of maximum variance and a least-squares “line of

best fit” (squared distances of points away from PC 1 are minimized).
PC 1
PC 2

Generalization to p-dimensions
 In practice nobody uses PCA with only 2 variables
 The algebra for finding principal axes readily generalizes to p variables
 PC 1 is the direction of maximum variance in the p-dimensional cloud of

points
 PC2 is in the direction of the next highest variance, subject to the

constraint that it has zero covariance with PC 1.
 PC 3 is in the direction of the next highest variance, subject to the

constraint that it has zero covariance with both PC 1 and PC 2 and so
on... up to PC p

The Algebra of PCA
 finding the principal axes involves eigenanalysis of the cross-

products matrix (S)
 the eigenvalues (latent roots) of S are solutions () to the
characteristic equation
S  I  0

…The Algebra of PCA
 the eigenvalues, 1, 2, ... p are the variances of the
coordinates on each principal component axis
 the sum of all p eigenvalues equals the trace of S (the sum
of the variances of the original variables).
X1 X2 1 = 9.8783
X1 6.6707 3.4170 2 = 3.0308
X2 3.4170 6.2384
Note: 1+2 =12.9091

Trace = 12.9091

 each eigenvector consists of p values which represent the
“contribution” of each variable to the principal component
axis
 eigenvectors are uncorrelated (orthogonal)
 their cross-products are zero.
Eigenvectors
u1 u2
X1 0.7291 -0.6844
X2 0.6844 0.7291
0.7291*(-0.6844) + 0.6844*0.7291 = 0

 variance of the scores on each PC axis is equal to the
corresponding eigenvalue for that axis
 the eigenvalue represents the variance displayed

(“explained” or “extracted”) by the kth axis
 the sum of the first k eigenvalues is the variance explained

by the k-dimensional ordination.

1 = 9.8783 2 = 3.0308 Trace = 12.9091
PC 1 displays (“explains”):
9.8783/12.9091 = 76.5% of the total variance
6
2
PC 2
0
-8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
PC 1
 The cross-products matrix computed among the p principal axes has a
simple form:
 all off-diagonal values are zero (the principal axes are uncorrelated)
 the diagonal values are the eigenvalues.
PC1 PC2
PC1 9.8783 0.0000
PC2 0.0000 3.0308
Variance-covariance Matrix
of the PC axes

Exercises for next class
1. What are the objectives of Principal Components Analysis (PCA)?
2. What type of data should be used for PCA? (Standardized or mean-corrected?)
3. What is the difference between a variance-covariance matrix and a correlation
matrix?
4. How many principal components should be retained?
5. Given the data below:
Compute the principal factors

6. Given the correlation matrix
(a) Compute the eigenvalues λ1 and λ2 of R and the corresponding eigenvectors v1 and v2 of
R:
(b) Show that λ1+λ2=tr(R) where the trace of a matrix equals the sum of its diagonal
components.
(c) Show that λ1 λ2 = where is the determinant of the matrix.
(d) Compute the weights of the principal components w1 and w2 that sets the scales of the
components and ensures that they are orthogonal.
(e) Compute the loadings of the variables.
(f) What proportion of the total variance in the data does the first principal component
account for?

Deciding How Many Components to Retain
Another device for deciding on the number of

components to retain is the Cattell's scree test.
This is a plot with eigenvalues on the ordinate and
component number on the abscissa.
Scree is the rubble at the base of a sloping

cliff. In a scree plot, scree is those components
that are at the bottom of the sloping plot of
eigenvalues versus component number.
The plot provides a visual aid for deciding at

what point including additional components no The base of the cliff composed of
longer increases the amount of variance components 1 and 2.
accounted for by a nontrivial amount.
Together components 1 and 2
account for most of the total
variance. We shall retain only the
first two components.

What are the assumptions of PCA?
 assumes relationships among variables are LINEAR
 cloud of points in p-dimensional space has linear
dimensions that can be effectively summarized by the
principal axes
 if the structure in the data is NONLINEAR (the cloud of

points twists and curves its way through p-dimensional
space), the principal axes will not be an efficient and
informative summary of the data.

Extracting an Initial Solution
A variety of methods have been developed to extract factors from an intercorrelation matrix.
SPSS offers the following methods …
1) Principle components method (probably the most commonly used method)
2) Maximum likelihood method (a commonly used method)
3) Principal axis method also know as common factor analysis
4) Unweighted least-squares method
5) Generalized least squares method
6) Alpha method
7) Image factoring

Explained Variance
Which variables load (correlate)

highest on Factor I and low on the
other two factors?
Ans = All of them
The total proportion of the variance in sentence explained by the

factors is simply the sum of its squared factor loadings.
(0.762 + - 0.5762) = .910?
This is called the communality of the variable sentence
What is a Factor Loading?
A factor loading is the correlation between a variable and a factor
that has been extracted from the data.
In an experiment, the factor (also called an independent variable) is an

explanatory variable manipulated by the experimenter. Each factor has two or
more levels (i.e., different values of the factor). Combinations of factor levels are
called treatments.
Communalities
Communalities
A communality refers to the percent of Initial Extraction
variance in an observed variable that
is accounted for by the retained COST 1.000 .842
components (or factors). SIZE 1.000 .901
ALCOHOL 1.000 .889
REPUTAT 1.000 .546
A given variable will display a large COLOR 1.000 .910
communality if it loads heavily on at AROMA
least one of the study’s retained 1.000 .918
components. Although communalities TASTE 1.000 .922
are computed in both procedures, the Extraction Method: Principal Component Analysis.
concept of variable communality is
more relevant in a factor analysis
than in principal component analysis.

Strategy for solving problems
• A principal component factor analysis requires:
1) The variables included must be metric level or dichotomous (dummy-
coded) nominal level
2) The sample size must be greater than 50 (preferably 100)
3) The ratio of cases to variables must be 5 to 1 or larger
4) The correlation matrix for the variables must contain 2 or more
correlations of 0.30 or greater
5) Variables with measures of sampling adequacy (MSA)less than 0.50
must be removed
6) The overall measure of sampling adequacy is 0.50 or higher
7) The Bartlett test of sphericity is statistically significant.
• The first phase of a principal component analysis is devoted to verifying

that we meet these requirements. If we do not meet these requirements,
factor analysis is not appropriate.
Notes - 1
• When evaluating measures of sampling adequacy,

communalities, or factor loadings, we ignore the sign of the
numeric value and base our decision on the size or
magnitude of the value.
• The sign of the number indicates the direction of the

relationship.
• A loading of -0.732 is just as strong as a loading of 0.732.

The minus sign indicates an inverse or negative
relationship; the absence of a sign is meant to imply a plus
sign indicating a direct or positive relationship.
Notes - 2
• If there are two or more components in the component

matrix, the pattern of loadings is based on the SPSS Rotated
Component Matrix. If there is only one component in the
solution, the Rotated Component Matrix is not computed,
and the pattern of loadings is based on the Component
Matrix.
• It is possible that the analysis will break down and we will

have too few variables in the analysis to support the use of
principal component analysis.

Step-By-Step Tutorials
Two set of data to be used in this tutorial:
gs2000.sav
Make sure these are already in your working folder

Tutorials I: gs2000.sav datafile
 How To Open the datafile:
 [File > Open > Data > (Choose the C/D/E: drive, the SPSS
folder, then gs2000.sav]

Exercise
In the dataset gs2000.sav, is the following statement true, false, or an
incorrect application of a statistic? Answer the question based on the
results of a principal component analysis prior to testing for outliers,
split sample validation, and a test of reliability. Assume that there is no
problematic pattern of missing data. Use a level of significance of 0.05.
A principal component analysis of the variables "father's highest

academic degree" [padeg], "mother's highest academic degree" [madeg],
“Souses highest degree" [spdeg], "general happiness" [happy],
“Happiness of marriage" [hapmar], "condition of health" [health], and
“Is life exciting or dull" [life] are ordinal level variables does not result
in any usable components that reduce the number of variable needed to
represent the information in the original variables.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
Computing a principal component analysis
To compute a principal
component analysis in SPSS,
select the Data Reduction |
Factor… command from the
Analyze menu.

Add the variables to the analysis
First, move the

variables listed in
the problem to the
Variables list box.
Second, click on the

Descriptives… button to
specify statistics to
include in the output.

Compete the descriptives dialog box
First, mark the Univariate

descriptives checkbox to get a
tally of valid cases.
Sixth, click
on the
Continue
Second, keep the Initial button.
solution checkbox to get
the statistics needed to
determine the number
of factors to extract. Fifth, mark the Anti-image
checkbox to get more
outputs used to assess the
appropriateness of factor
analysis for the variables.
Third, mark the

Coefficients checkbox to get
a correlation matrix, one of
Fourth, mark the KMO and Bartlett’s test
the outputs needed to
of sphericity checkbox to get more outputs
assess the appropriateness
used to assess the appropriateness of
of factor analysis for the
factor analysis for the variables.
variables.
Select the extraction method
First, click on the The extraction method refers

Extraction… button to to the mathematical method
specify statistics to that SPSS uses to compute the
include in the output. factors or components.

Compete the extraction dialog box
First, retain the default

method Principal components.
Second, click
on the
Continue
button.

Select the rotation method
The rotation method refers to

First, click on the the mathematical method that
Rotation… button to SPSS rotate the axes in
specify statistics to geometric space. This makes
include in the output. it easier to determine which
variables are loaded on which
components.
Compete the rotation dialog
box
First, mark the Second, click

Varimax method on the
as the type of Continue
rotation to used button.
in the analysis.

Complete the request for the
analysis
First, click on the

OK button to
request the output.

Level of measurement requirement
"Highest academic degree" [degree], "father's highest
academic degree" [padeg], "mother's highest academic degree"
[madeg], "spouse's highest academic degree" [spdeg], "general
happiness" [happy], "happiness of marriage" [hapmar],
"condition of health" [health], and "attitude toward life" [life]
are ordinal level variables.
If we follow the convention of treating ordinal level variables

as metric variables, the level of measurement requirement for
principal component analysis is satisfied. Since some data
analysts do not agree with this convention, a note of caution
should be included in our interpretation.

Sample size requirement:
minimum number of cases
Descriptive Statistics
Mean Std. Deviation Analysis N

RS HIGHEST DEGREE 1.68 1.085 68
FATHERS HIGHEST
.96 .984 68
DEGREE
MOTHERS HIGHEST
.85 .797 68
DEGREE
SPOUSES HIGHEST
1.97 1.233 68
DEGREE
The number of valid cases for this
GENERAL HAPPINESS 1.65 .617 68
set of variables is 68.
HAPPINESS OF
1.47 .532 68
MARRIAGE
While principal component analysis
CONDITION OF HEALTH
can be conducted on a1.76sample that .848 68
IS LIFE EXCITING
has fewer OR
than 100 cases,
1.53 but more.532 68
DULL
than 50 cases, we should be
cautious about its interpretation.

Sample size requirement:
ratio of cases to variables
Descriptive Statistics
Mean Std. Deviation Analysis N

RS HIGHEST DEGREE 1.68 1.085 68
FATHERS HIGHEST
.96 .984 68
DEGREE
MOTHERS HIGHEST
.85 .797 68
DEGREE
SPOUSES HIGHEST
1.97 1.233 68
DEGREE
The ratio of cases to
GENERAL HAPPINESS 1.65 .617 68
variables in a principal
HAPPINESS OF
component analysis
1.47 should .532 68
MARRIAGE
be at least 5 to 1.
CONDITION OF HEALTH 1.76 .848 68
With
IS LIFE EXCITING OR 68 and 8 variables,
1.53 .532 68
DULL the ratio of cases to
variables is 8.5 to 1, which
exceeds the requirement
for the ratio of cases to
variables.

Appropriateness of factor analysis:
Presence of substantial correlations
Principal components analysis requires that there be

some correlations greater than 0.30 between the
variables included in the analysis.
For this set of variables, there are 7 correlations in

the matrix greater than 0.30, satisfying this
requirement. The correlations greater than 0.30 are
highlighted in yellow.
Correlation Matrix
FATHERS MOTHERS SPOUSES HAPPINESS IS LIFE

RS HIGHEST HIGHEST HIGHEST HIGHEST GENERAL OF CONDITION EXCITING
DEGREE DEGREE DEGREE DEGREE HAPPINESS MARRIAGE OF HEALTH OR DUL
Correlation RS HIGHEST DEGREE 1.000 .490 .410 .595 -.017 -.172 -.246 -.13
FATHERS HIGHEST
.490 1.000 .677 .319 -.100 -.131 -.174 -.01
DEGREE
MOTHERS HIGHEST
.410 .677 1.000 .208 .105 -.046 -.008 .15
DEGREE
SPOUSES HIGHEST
.595 .319 .208 1.000 -.053 -.138 -.392 -.09
DEGREE
GENERAL HAPPINESS -.017 -.100 .105 -.053 1.000 .514 .267 .21
HAPPINESS OF
-.172 -.131 -.046 -.138 .514 1.000 .282 .16
MARRIAGE
CONDITION OF HEALTH -.246 -.174 -.008 -.392 .267 .282 1.000 .21
IS LIFE EXCITING OR
-.138 -.012 .151 -.090 .214 .161 .214 1.00
DULL

Removing the variable from
the list of variables
First, highlight
the life variable.
Second, click on the left

arrow button to remove
the variable from the
Variables list box.

Replicating the factor
analysis
The dialog recall command opens

the dialog box with all of the
settings that we had selected the
last time we used factor analysis.
To replicate the analysis without

the variable that we just removed,
click on the OK button.

Sampling adequacy of individual variables
Anti-image Matrices
FATHERS MOTHERS SPOUSES HAPPINESS IS LIFE

RS HIGHEST HIGHEST HIGHEST HIGHEST GENERAL OF CONDITION EXCITING
DEGREE DEGREE DEGREE DEGREE HAPPINESS MARRIAGE OF HEALTH OR DULL
Anti-image Covariance RS HIGHEST DEGREE .511 -.101 -.079 -.274 -.058 .067 -.008 .108
FATHERS HIGHEST
-.101 .455 -.290 -.024 .103 -.028 .050 .028
There
DEGREE
are two anti-image
matrices:
MOTHERS HIGHEST
DEGREE
the anti-image-.079 -.290 .476 .028 -.102 .043 -.052 -.121
covariance
SPOUSES HIGHEST
matrix and the Principal component analysis requires
anti-image
DEGREE correlation -.274 -.024 .028 that .578
the Kaiser-Meyer-Olkin
-.014 -.012 Measure of
.203 -.039
matrix. We
GENERAL HAPPINESS are interested
-.058 in .103 -.102 Sampling
-.014 Adequacy
.666 be greater
-.325 than 0.50
-.085 -.085
the anti-image
HAPPINESS OF correlation for each individual variable as well as the
matrix.
MARRIAGE
.067 -.028 .043 set of variables.
-.012 -.325 .692 -.099 -.024
CONDITION OF HEALTH -.008 .050 -.052 .203 -.085 -.099 .749 -.102
IS LIFE EXCITING OR On iteration 1, the MSA for all of the
DULL
.108 .028 -.121
individual
-.039
variables
-.085
included
-.024
in the-.102 .876
Anti-image Correlation RS HIGHEST DEGREE .701a -.210 -.161 analysis

-.503 was greater
-.099 than.113
0.5, supporting
-.012 .162
FATHERS HIGHEST
-.210 .640
a
-.623
their-.048
retention .187
in the analysis.
-.049 .086 .044
DEGREE
MOTHERS HIGHEST a
-.161 -.623 .586 .053 -.181 .076 -.087 -.188
DEGREE
SPOUSES HIGHEST a
-.503 -.048 .053 .656 -.023 -.018 .309 -.055
DEGREE
GENERAL HAPPINESS -.099 .187 -.181 -.023 .549a -.478 -.120 -.111
HAPPINESS OF a
.113 -.049 .076 -.018 -.478 .619 -.137 -.030
MARRIAGE
CONDITION OF HEALTH a
-.012 .086 -.087 .309 -.120 -.137 .734 -.126
IS LIFE EXCITING OR a
.162 .044 -.188 -.055 -.111 -.030 -.126 .638
DULL
a. Measures of Sampling Adequacy(MSA)
Sampling adequacy for set of variables
KMO and Bartlett's Test

Kais er-Meyer-Olkin Measure of Sampling
Adequacy. .640
Bartlett's Test of Approx. Chi-Square 137.823

Sphericity df 28
Sig. .000
In addition, the overall
MSA for the set of variables
included in the analysis
was 0.640, which exceeds
the minimum requirement
of 0.50 for overall MSA.

Bartlett test of sphericity
KMO and Bartlett's Test

Kais er-Meyer-Olkin Measure of Sampling
Adequacy. .640
Bartlett's Test of Approx. Chi-Square 137.823

Sphericity df 28
Sig. .000
Principal component analysis requires

that the probability associated with
Bartlett's Test of Sphericity be less
than the level of significance.
The probability associated with the

Bartlett test is <0.001, which satisfies
this requirement.

Number of factors to extract:
Latent root criterion
Total Variance Explained
Initial Eigenvalues Extraction Sums of Squared Loadings

Component Total % of Variance Cumulative % Total % of Variance Cumulative %
1 2.600 32.502 32.502 2.600 32.502 32.502
2 1.772 22.149 54.651 1.772 22.149 54.651
3 1.079 13.486 68.137 1.079 13.486 68.137
4 .827 10.332 78.469
5 .631 7.888 86.358
6 .487 6.087 92.445
7 .333 4.161 96.606
8 .272 3.394 100.000
Extraction Method: Principal Component Analysis.
Using the output from iteration 1,
there were 3 eigenvalues greater
than 1.0.
See Scree plot
The latent root criterion for
number of factors to derive would
indicate that there were 3
components to be extracted for
these variables.

Number of factors to extract:
Percentage of variance criterion
T otal Variance Explained
Initial Eigenvalues Extraction Sums of Squared

Component Total % of Variance Cumulative % Total % of Variance C
1 2.600 32.502 32.502 2.600 32.502
2 1.772 22.149 54.651 1.772 22.149
3 1.079 13.486 68.137 1.079 13.486
4 .827 10.332 78.469
5 .631 7.888 86.358
6 .487 6.087 92.445
7 .333 4.161 96.606
8 .272 3.394 100.000
In addition, the cumulative
proportion of variance criteria can
be met with 3 components to
satisfy the criterion of explaining
60% or more of the total variance.
Since the SPSS default is to extract
the number of components indicated A 3 components solution would
by the latent root criterion, our explain 68.137% of the total
initial factor solution was based on variance.
the extraction of 3 components.

Communality requiring
variable removal
Communalities
Initial Extraction On iteration 2, the

RS HIGHEST DEGREE 1.000 .642 communality for the
variable "condition of
FATHERS HIGHEST
1.000 .623 health" [health] was
DEGREE 0.477. Since this is less
MOTHERS HIGHEST than 0.50, the variable
1.000 .592
DEGREE should be removed from
SPOUSES HIGHEST the next iteration of the
DEGREE
1.000 .516 principal component
analysis.
GENERAL HAPPINESS 1.000 .638
HAPPINESS OF The variable was removed
1.000 .594 and the principal
MARRIAGE
CONDITION OF HEALTH 1.000 .477 component analysis was
computed again.

Repeating the factor analysis
In the drop down menu,

select Factor Analysis to
reopen the factor analysis
dialog box.

First, highlight
the health
variable.

Variables list box.

analysis



Communality requiring
variable removal
On iteration 3, the
communality for the
variable "spouse's highest
academic degree" [spdeg]
Communalities was 0.491. Since this is
less than 0.50, the
Initial Extraction variable should be
RS HIGHEST DEGREE 1.000 .674 removed from the next
FATHERS HIGHEST iteration of the principal
DEGREE
1.000 .640 component analysis.
MOTHERS HIGHEST
1.000 .577 The variable was removed
DEGREE
and the principal
SPOUSES HIGHEST
1.000 .491 component analysis was
DEGREE computed again.
HAPPINESS OF
1.000 .741
MARRIAGE

Repeating the factor analysis
In the drop down menu,

select Factor Analysis to
reopen the factor analysis
dialog box.

First, highlight the

spdeg variable.

Variables list box.

analysis



Communality satisfactory for
all variables
Communalities
Once any variables with
Initial Extraction communalities less than
RS HIGHEST DEGREE 1.000 .577 0.50 have been removed
from the analysis, the
FATHERS HIGHEST pattern of factor loadings
1.000 .720
DEGREE should be examined to
MOTHERS HIGHEST identify variables that
1.000 .684 have complex structure.
DEGREE
HAPPINESS OF
1.000 .782
MARRIAGE
Complex structure occurs when

one variable has high loadings or
correlations (0.40 or greater) on
more than one component. If a
variable has complex structure, it
should be removed from the
analysis.
Variables are only checked for

complex structure if there is more
than one component in the
solution. Variables that load on
only one component are described
as having simple structure.

Identifying complex structure
If only one component has been
extracted, each variable can only
load on that one factor, so
complex structure is not an issue.
Rotated Component Matrixa
Component
1 2
On iteration 4, none of the
RS HIGHEST DEGREE .732 -.202
variables demonstrated
complex structure. It is not
FATHERS HIGHEST
DEGREE
.848 .031 necessary to remove any
additional variables because
MOTHERS HIGHEST
DEGREE
.810 .169 of complex structure.
GENERAL HAPPINESS .145 .851
HAPPINESS OF
-.145 .872
MARRIAGE
Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 3 iterations.

Variable loadings on
components
On iteration 4, the 2
components in the
analysis had more than
one variable loading on
each of them.
No variables need to be
Rotated Component Matrixa
removed because they
Component are the only variable
1 2 loading on a component.
FATHERS HIGHEST
.031
DEGREE .848
MOTHERS HIGHEST
.169
DEGREE .810
GENERAL HAPPINESS .145 .851
HAPPINESS OF
-.145
MARRIAGE .872
Rotation Method: Varimax with Kaiser Normalization.

Final check of communalities
Once we have resolved any
problems with complex
structure, we check the
communalities one last time to
make certain that we are
explaining a sufficient portion
of the variance of all of the
original variables.
Communalities
Initial Extraction
RS HIGHEST DEGREE 1.000 .577
FATHERS HIGHEST
1.000 .720
DEGREE
MOTHERS HIGHEST The communalities for all of the
1.000 .684 variables included on the
DEGREE
GENERAL HAPPINESS 1.000 .745 components were greater than
HAPPINESS OF 0.50 and all variables had
MARRIAGE
1.000 .782 simple structure.
The principal component
analysis has been completed.

Interpreting the principal
components
The information in 5 of the
variables can be represented
by 2 components.
Component 1 includes the
variables
•"highest academic degree"

[degree],
Rotated Component Matrixa •"father's highest academic
degree" [padeg], and
Component •"mother's highest
academic degree" [madeg].
1 2
FATHERS HIGHEST
.031
DEGREE .848
MOTHERS HIGHEST
.169
DEGREE .810 Component 2 includes the
GENERAL HAPPINESS .145 .851 variables
HAPPINESS OF
MARRIAGE
-.145 •"general happiness"
.872
[happy] and
Extraction Method: Principal Component Analysis. •"happiness of marriage"
Rotation Method: Varimax with Kaiser Normalization. [hapmar].

Total variance explained
Total Variance Explained
Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums

Component Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of V
1 1.953 39.061 39.061 1.953 39.061 39.061 1.953
2 1.555 31.109 70.169 1.555 31.109 70.169 1.556
3 .649 12.989 83.158
4 .441 8.820 91.977
5 .401 8.023 100.000
The 2 components explain
70.169% of the total
variance in the variables
which are included on the
components.

General Steps in principal component analysis - 1
The following is a guide to the decision process for

answering problems about principal component analysis:
Are the variables

No Incorrect application
included in the analysis
metric? of a statistic
Yes
Is the number of valid No

cases ≥50 Incorrect application
of a statistic
Yes
No
Is the ratio of cases to Incorrect application
variables at least 5 to 1? of a statistic
Yes

Steps in principal component analysis - 2
Are there two or more No Incorrect application

correlations that are of a statistic
0.30 or greater?
Yes
Is the measure of
No Remove variable with
sampling adequacy larger
lowest MSA and repeat
than 0.50 for each analysis
variable?
Yes
Overall measure of No
sampling adequacy Incorrect application
greater than 0.50? of a statistic
Yes
Probability for Bartlett

No Incorrect application
test of sphericity less
than level of of a statistic
significance?
Yes
Communality for each

No Remove variable with
variable greater than lowest communality
0.50? and repeat analysis
Yes

Does all variables show No Remove variable with

simple structure complex structure and
(only 1 loading > 0.40)? repeat analysis
Yes
Do all of the components No Remove single variable

have more than one loading on component
variable loading on it? and repeat analysis
Yes
Are the number of No

components and pattern False
of loadings correct?
Yes

Are all of the metric No

variables included in the
final analysis interval
True with caution
level?
Yes
Is the sample size for the No

final analysis include 100 True with caution
cases or more?
Yes
Is the cumulative
proportion of variance No
for variables 60% or True with caution
higher?
Yes
Prof Dr.Ndoh Mbue True 70 11/8/2022


Lecture 6 - PCA - Lecturefin

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 6 - PCA - Lecturefin

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 6 - PCA - Lecturefin

Uploaded by

Copyright:

Available Formats

Lecture 7

Principal Component Analysis (PCA)

• Innocent Ndoh Mbue, PhD

Prof Dr.Ndoh Mbue 1 11/8/2022

• Definition and meaning of the principal component.

Prof Dr.Ndoh Mbue 2 11/8/2022

•How can we explore structure in our dataset?

•How can we reduce complexity and see the pattern?

Prof Dr.Ndoh Mbue 3 11/8/2022

Objectives of principal component analysis

Prof Dr.Ndoh Mbue 4 11/8/2022

In other words, we wish to reduce a set of p variables to a set of m underlying

Prof Dr.Ndoh Mbue 5 11/8/2022

Prof Dr.Ndoh Mbue 6 11/8/2022

A 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a

Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in

 if positive then : the two variables increase or decrease together (correlated)

 computed from the covariance matrix in order to determine the principal

Organizing information in principal components this

Geometrically speaking, principal components represent the

The relationship between variance and information here, is that,

To put all this simply, just think of principal components as new

Prof Dr.Ndoh Mbue 10 11/8/2022

 PC 1 is simultaneously the direction of maximum variance and a least-squares “line of

Prof Dr.Ndoh Mbue 11 11/8/2022

 In practice nobody uses PCA with only 2 variables

 The algebra for finding principal axes readily generalizes to p variables

 PC 1 is the direction of maximum variance in the p-dimensional cloud of

 PC2 is in the direction of the next highest variance, subject to the

 PC 3 is in the direction of the next highest variance, subject to the

Prof Dr.Ndoh Mbue 12 11/8/2022

 finding the principal axes involves eigenanalysis of the cross-

Prof Dr.Ndoh Mbue 13 11/8/2022

Note: 1+2 =12.9091

Prof Dr.Ndoh Mbue 14 11/8/2022

Prof Dr.Ndoh Mbue 15 11/8/2022

 the eigenvalue represents the variance displayed

 the sum of the first k eigenvalues is the variance explained

Prof Dr.Ndoh Mbue 16 11/8/2022

PC1 9.8783 0.0000

PC2 0.0000 3.0308

Prof Dr.Ndoh Mbue 18 11/8/2022

Compute the principal factors

Prof Dr.Ndoh Mbue 20 11/8/2022

Another device for deciding on the number of

Scree is the rubble at the base of a sloping

The plot provides a visual aid for deciding at

Prof Dr.Ndoh Mbue 21 11/8/2022

 if the structure in the data is NONLINEAR (the cloud of

Prof Dr.Ndoh Mbue 22 11/8/2022

Prof Dr.Ndoh Mbue 23 11/8/2022

Which variables load (correlate)

Ans = All of them

The total proportion of the variance in sentence explained by the

In an experiment, the factor (also called an independent variable) is an

Prof Dr.Ndoh Mbue 26 11/8/2022

• The first phase of a principal component analysis is devoted to verifying