Lecture 6 - PCA - Lecturefin

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Lecture 7

Principal Component Analysis (PCA)

• Innocent Ndoh Mbue, PhD

• E-mail: dndoh2009@gmail.com
• Tel: 653754070

Prof Dr.Ndoh Mbue 1 11/8/2022


Outline
In today's lecture

• Definition and meaning of the principal component.


• objectives of Principal Components Analysis
• Data type
• Communalities, extraction, variance.
• Information content and the distribution of the principal
component.
• Omission of variables with insufficient communalities.
• Practical example using SPSS

Prof Dr.Ndoh Mbue 2 11/8/2022


Motivating Questions

•How can we explore structure in our dataset?

•How can we reduce complexity and see the pattern?

Prof Dr.Ndoh Mbue 3 11/8/2022


PCA, Definition
 Involves a mathematical procedure that transforms a number of
(possibly) correlated variables into a (smaller) number of uncorrelated
variables called principal components. The first principal component
accounts for as much of the variability in the data as possible, and
each succeeding component accounts for as much of the remaining
variability as possible.

Objectives of principal component analysis


 To discover or to reduce the dimensionality of the data set.
 To identify new meaningful underlying variables.

Prof Dr.Ndoh Mbue 4 11/8/2022


Data Reduction
 summarization of data with many (p) variables by a smaller
set of (k) derived (synthetic, composite) variables.

p k

n A n X

In other words, we wish to reduce a set of p variables to a set of m underlying


superordinate dimensions

These underlying factors are inferred from the correlations among the p
variables. Each factor is estimated as a weighted sum of the p variables.

Prof Dr.Ndoh Mbue 5 11/8/2022


Step by step explanation

Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that
each one of them contributes equally to the analysis.

If there are large differences between the ranges of initial variables, those variables with
larger ranges will dominate over those with small ranges (For example, a variable that
ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1),
which will lead to biased results. So, transforming the data to comparable scales can
prevent this problem.

Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.

Prof Dr.Ndoh Mbue 6 11/8/2022


Step 2: Covariance Matrix computation
The aim of this step is to understand how the variables of the input data set are
varying from the mean with respect to each other, or in other words, to see if there
is any relationship between them. Because sometimes, variables are highly
correlated in such a way that they contain redundant information. So, in order to
identify these correlations, we compute the covariance matrix.

A 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a


3×3 matrix of this from:

Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in


the main diagonal (Top left to bottom right) we actually have the variances of each
initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the
entries of the covariance matrix are symmetric with respect to the main diagonal,
which means that the upper and the lower triangular portions are equal.
Prof Dr.Ndoh Mbue 7 11/8/2022
 degree to which the variables are linearly correlated is represented
by their covariances.

 X i m  X i  X 
n
1
C ij  
n  1 m 1
jm  X j

Covariance of
variables i and j
Value of Value of Mean of
Mean of
Sum over all variable i variable j variable j
variable i
n objects in object m in object m

 if positive then : the two variables increase or decrease together (correlated)


 if negative then : One increases when the other decreases (Inversely
correlated)
Prof Dr.Ndoh Mbue 8 11/8/2022
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to
identify the principal components

 computed from the covariance matrix in order to determine the principal


components of the data.
 Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables. These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated and most of the information
within the initial variables is squeezed or compressed into the first components, then
maximum remaining information in the second and so on, until having something like
shown in the scree plot below.

Organizing information in principal components this


way, will allow you to reduce dimensionality
without losing much information, and this by
discarding the components with low information and
considering the remaining components as your new
variables.
Prof Dr.Ndoh Mbue 9 11/8/2022
Geometric Rationale of PCA

Geometrically speaking, principal components represent the


directions of the data that explain a maximal amount of variance,
that is to say, the lines that capture most information of the data.

The relationship between variance and information here, is that,


the larger the variance carried by a line, the larger the dispersion of
the data points along it, and the larger the dispersion along a line,
the more the information it has.

To put all this simply, just think of principal components as new


axes that provide the best angle to see and evaluate the data, so that
the differences between the observations are better visible.

Prof Dr.Ndoh Mbue 10 11/8/2022


 PC axes are a rigid rotation of the original variables

 PC 1 is simultaneously the direction of maximum variance and a least-squares “line of


best fit” (squared distances of points away from PC 1 are minimized).

PC 1

PC 2

Prof Dr.Ndoh Mbue 11 11/8/2022


Generalization to p-dimensions

 In practice nobody uses PCA with only 2 variables

 The algebra for finding principal axes readily generalizes to p variables

 PC 1 is the direction of maximum variance in the p-dimensional cloud of


points

 PC2 is in the direction of the next highest variance, subject to the


constraint that it has zero covariance with PC 1.

 PC 3 is in the direction of the next highest variance, subject to the


constraint that it has zero covariance with both PC 1 and PC 2 and so
on... up to PC p

Prof Dr.Ndoh Mbue 12 11/8/2022


The Algebra of PCA

 finding the principal axes involves eigenanalysis of the cross-


products matrix (S)
 the eigenvalues (latent roots) of S are solutions () to the
characteristic equation

S  I  0

Prof Dr.Ndoh Mbue 13 11/8/2022


…The Algebra of PCA
 the eigenvalues, 1, 2, ... p are the variances of the
coordinates on each principal component axis
 the sum of all p eigenvalues equals the trace of S (the sum
of the variances of the original variables).

X1 X2 1 = 9.8783
X1 6.6707 3.4170 2 = 3.0308
X2 3.4170 6.2384

Note: 1+2 =12.9091


Trace = 12.9091

Prof Dr.Ndoh Mbue 14 11/8/2022


…The Algebra of PCA
 each eigenvector consists of p values which represent the
“contribution” of each variable to the principal component
axis
 eigenvectors are uncorrelated (orthogonal)
 their cross-products are zero.

Eigenvectors
u1 u2

X1 0.7291 -0.6844

X2 0.6844 0.7291

0.7291*(-0.6844) + 0.6844*0.7291 = 0

Prof Dr.Ndoh Mbue 15 11/8/2022


…The Algebra of PCA
 variance of the scores on each PC axis is equal to the
corresponding eigenvalue for that axis

 the eigenvalue represents the variance displayed


(“explained” or “extracted”) by the kth axis

 the sum of the first k eigenvalues is the variance explained


by the k-dimensional ordination.

Prof Dr.Ndoh Mbue 16 11/8/2022


1 = 9.8783 2 = 3.0308 Trace = 12.9091
PC 1 displays (“explains”):
9.8783/12.9091 = 76.5% of the total variance
6

2
PC 2

0
-8 -6 -4 -2 0 2 4 6 8 10 12

-2

-4

-6
Prof Dr.Ndoh Mbue 17 11/8/2022
PC 1
…The Algebra of PCA
 The cross-products matrix computed among the p principal axes has a
simple form:
 all off-diagonal values are zero (the principal axes are uncorrelated)
 the diagonal values are the eigenvalues.

PC1 PC2

PC1 9.8783 0.0000

PC2 0.0000 3.0308

Variance-covariance Matrix
of the PC axes

Prof Dr.Ndoh Mbue 18 11/8/2022


Exercises for next class
1. What are the objectives of Principal Components Analysis (PCA)?
2. What type of data should be used for PCA? (Standardized or mean-corrected?)
3. What is the difference between a variance-covariance matrix and a correlation
matrix?
4. How many principal components should be retained?
5. Given the data below:

Compute the principal factors


Prof Dr.Ndoh Mbue 19 11/8/2022
6. Given the correlation matrix

(a) Compute the eigenvalues λ1 and λ2 of R and the corresponding eigenvectors v1 and v2 of
R:
(b) Show that λ1+λ2=tr(R) where the trace of a matrix equals the sum of its diagonal
components.
(c) Show that λ1 λ2 = where is the determinant of the matrix.
(d) Compute the weights of the principal components w1 and w2 that sets the scales of the
components and ensures that they are orthogonal.
(e) Compute the loadings of the variables.
(f) What proportion of the total variance in the data does the first principal component
account for?

Prof Dr.Ndoh Mbue 20 11/8/2022


Deciding How Many Components to Retain

Another device for deciding on the number of


components to retain is the Cattell's scree test.
This is a plot with eigenvalues on the ordinate and
component number on the abscissa.

Scree is the rubble at the base of a sloping


cliff. In a scree plot, scree is those components
that are at the bottom of the sloping plot of
eigenvalues versus component number.

The plot provides a visual aid for deciding at


what point including additional components no The base of the cliff composed of
longer increases the amount of variance components 1 and 2.
accounted for by a nontrivial amount.
Together components 1 and 2
account for most of the total
variance. We shall retain only the
first two components.

Prof Dr.Ndoh Mbue 21 11/8/2022


What are the assumptions of PCA?
 assumes relationships among variables are LINEAR
 cloud of points in p-dimensional space has linear
dimensions that can be effectively summarized by the
principal axes

 if the structure in the data is NONLINEAR (the cloud of


points twists and curves its way through p-dimensional
space), the principal axes will not be an efficient and
informative summary of the data.

Prof Dr.Ndoh Mbue 22 11/8/2022


Extracting an Initial Solution

A variety of methods have been developed to extract factors from an intercorrelation matrix.
SPSS offers the following methods …
1) Principle components method (probably the most commonly used method)
2) Maximum likelihood method (a commonly used method)
3) Principal axis method also know as common factor analysis
4) Unweighted least-squares method
5) Generalized least squares method
6) Alpha method
7) Image factoring

Prof Dr.Ndoh Mbue 23 11/8/2022


Explained Variance

Which variables load (correlate)


highest on Factor I and low on the
other two factors?

Ans = All of them

The total proportion of the variance in sentence explained by the


factors is simply the sum of its squared factor loadings.
(0.762 + - 0.5762) = .910?
This is called the communality of the variable sentence
Prof Dr.Ndoh Mbue 24 11/8/2022
What is a Factor Loading?
A factor loading is the correlation between a variable and a factor
that has been extracted from the data.

In an experiment, the factor (also called an independent variable) is an


explanatory variable manipulated by the experimenter. Each factor has two or
more levels (i.e., different values of the factor). Combinations of factor levels are
called treatments.
Prof Dr.Ndoh Mbue 25 11/8/2022
Communalities
Communalities
A communality refers to the percent of Initial Extraction
variance in an observed variable that
is accounted for by the retained COST 1.000 .842
components (or factors). SIZE 1.000 .901
ALCOHOL 1.000 .889
REPUTAT 1.000 .546
A given variable will display a large COLOR 1.000 .910
communality if it loads heavily on at AROMA
least one of the study’s retained 1.000 .918
components. Although communalities TASTE 1.000 .922
are computed in both procedures, the Extraction Method: Principal Component Analysis.
concept of variable communality is
more relevant in a factor analysis
than in principal component analysis.

Prof Dr.Ndoh Mbue 26 11/8/2022


Strategy for solving problems
• A principal component factor analysis requires:
1) The variables included must be metric level or dichotomous (dummy-
coded) nominal level
2) The sample size must be greater than 50 (preferably 100)
3) The ratio of cases to variables must be 5 to 1 or larger
4) The correlation matrix for the variables must contain 2 or more
correlations of 0.30 or greater
5) Variables with measures of sampling adequacy (MSA)less than 0.50
must be removed
6) The overall measure of sampling adequacy is 0.50 or higher
7) The Bartlett test of sphericity is statistically significant.

• The first phase of a principal component analysis is devoted to verifying


that we meet these requirements. If we do not meet these requirements,
factor analysis is not appropriate.
Prof Dr.Ndoh Mbue 27 11/8/2022
Notes - 1

• When evaluating measures of sampling adequacy,


communalities, or factor loadings, we ignore the sign of the
numeric value and base our decision on the size or
magnitude of the value.

• The sign of the number indicates the direction of the


relationship.

• A loading of -0.732 is just as strong as a loading of 0.732.


The minus sign indicates an inverse or negative
relationship; the absence of a sign is meant to imply a plus
sign indicating a direct or positive relationship.
Prof Dr.Ndoh Mbue 28 11/8/2022
Notes - 2

• If there are two or more components in the component


matrix, the pattern of loadings is based on the SPSS Rotated
Component Matrix. If there is only one component in the
solution, the Rotated Component Matrix is not computed,
and the pattern of loadings is based on the Component
Matrix.

• It is possible that the analysis will break down and we will


have too few variables in the analysis to support the use of
principal component analysis.

Prof Dr.Ndoh Mbue 29 11/8/2022


Step-By-Step Tutorials

Two set of data to be used in this tutorial:

gs2000.sav

Make sure these are already in your working folder

Prof Dr.Ndoh Mbue 30 11/8/2022


Tutorials I: gs2000.sav datafile

 How To Open the datafile:

 [File > Open > Data > (Choose the C/D/E: drive, the SPSS
folder, then gs2000.sav]

Prof Dr.Ndoh Mbue 31 11/8/2022


Exercise
In the dataset gs2000.sav, is the following statement true, false, or an
incorrect application of a statistic? Answer the question based on the
results of a principal component analysis prior to testing for outliers,
split sample validation, and a test of reliability. Assume that there is no
problematic pattern of missing data. Use a level of significance of 0.05.

A principal component analysis of the variables "father's highest


academic degree" [padeg], "mother's highest academic degree" [madeg],
“Souses highest degree" [spdeg], "general happiness" [happy],
“Happiness of marriage" [hapmar], "condition of health" [health], and
“Is life exciting or dull" [life] are ordinal level variables does not result
in any usable components that reduce the number of variable needed to
represent the information in the original variables.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
Prof Dr.Ndoh Mbue 32 11/8/2022
Computing a principal component analysis

To compute a principal
component analysis in SPSS,
select the Data Reduction |
Factor… command from the
Analyze menu.

Prof Dr.Ndoh Mbue 33 11/8/2022


Add the variables to the analysis

First, move the


variables listed in
the problem to the
Variables list box.

Second, click on the


Descriptives… button to
specify statistics to
include in the output.

Prof Dr.Ndoh Mbue 34 11/8/2022


Compete the descriptives dialog box

First, mark the Univariate


descriptives checkbox to get a
tally of valid cases.

Sixth, click
on the
Continue
Second, keep the Initial button.
solution checkbox to get
the statistics needed to
determine the number
of factors to extract. Fifth, mark the Anti-image
checkbox to get more
outputs used to assess the
appropriateness of factor
analysis for the variables.

Third, mark the


Coefficients checkbox to get
a correlation matrix, one of
Fourth, mark the KMO and Bartlett’s test
the outputs needed to
of sphericity checkbox to get more outputs
assess the appropriateness
used to assess the appropriateness of
of factor analysis for the
factor analysis for the variables.
variables.
Prof Dr.Ndoh Mbue 35 11/8/2022
Select the extraction method

First, click on the The extraction method refers


Extraction… button to to the mathematical method
specify statistics to that SPSS uses to compute the
include in the output. factors or components.

Prof Dr.Ndoh Mbue 36 11/8/2022


Compete the extraction dialog box

First, retain the default


method Principal components.

Second, click
on the
Continue
button.

Prof Dr.Ndoh Mbue 37 11/8/2022


Select the rotation method

The rotation method refers to


First, click on the the mathematical method that
Rotation… button to SPSS rotate the axes in
specify statistics to geometric space. This makes
include in the output. it easier to determine which
variables are loaded on which
components.
Prof Dr.Ndoh Mbue 38 11/8/2022
Compete the rotation dialog
box

First, mark the Second, click


Varimax method on the
as the type of Continue
rotation to used button.
in the analysis.

Prof Dr.Ndoh Mbue 39 11/8/2022


Complete the request for the
analysis

First, click on the


OK button to
request the output.

Prof Dr.Ndoh Mbue 40 11/8/2022


Level of measurement requirement
"Highest academic degree" [degree], "father's highest
academic degree" [padeg], "mother's highest academic degree"
[madeg], "spouse's highest academic degree" [spdeg], "general
happiness" [happy], "happiness of marriage" [hapmar],
"condition of health" [health], and "attitude toward life" [life]
are ordinal level variables.

If we follow the convention of treating ordinal level variables


as metric variables, the level of measurement requirement for
principal component analysis is satisfied. Since some data
analysts do not agree with this convention, a note of caution
should be included in our interpretation.

Prof Dr.Ndoh Mbue 41 11/8/2022


Sample size requirement:
minimum number of cases
Descriptive Statistics

Mean Std. Deviation Analysis N


RS HIGHEST DEGREE 1.68 1.085 68
FATHERS HIGHEST
.96 .984 68
DEGREE
MOTHERS HIGHEST
.85 .797 68
DEGREE
SPOUSES HIGHEST
1.97 1.233 68
DEGREE
The number of valid cases for this
GENERAL HAPPINESS 1.65 .617 68
set of variables is 68.
HAPPINESS OF
1.47 .532 68
MARRIAGE
While principal component analysis
CONDITION OF HEALTH
can be conducted on a1.76sample that .848 68
IS LIFE EXCITING
has fewer OR
than 100 cases,
1.53 but more.532 68
DULL
than 50 cases, we should be
cautious about its interpretation.

Prof Dr.Ndoh Mbue 42 11/8/2022


Sample size requirement:
ratio of cases to variables
Descriptive Statistics

Mean Std. Deviation Analysis N


RS HIGHEST DEGREE 1.68 1.085 68
FATHERS HIGHEST
.96 .984 68
DEGREE
MOTHERS HIGHEST
.85 .797 68
DEGREE
SPOUSES HIGHEST
1.97 1.233 68
DEGREE
The ratio of cases to
GENERAL HAPPINESS 1.65 .617 68
variables in a principal
HAPPINESS OF
component analysis
1.47 should .532 68
MARRIAGE
be at least 5 to 1.
CONDITION OF HEALTH 1.76 .848 68
With
IS LIFE EXCITING OR 68 and 8 variables,
1.53 .532 68
DULL the ratio of cases to
variables is 8.5 to 1, which
exceeds the requirement
for the ratio of cases to
variables.

Prof Dr.Ndoh Mbue 43 11/8/2022


Appropriateness of factor analysis:
Presence of substantial correlations

Principal components analysis requires that there be


some correlations greater than 0.30 between the
variables included in the analysis.

For this set of variables, there are 7 correlations in


the matrix greater than 0.30, satisfying this
requirement. The correlations greater than 0.30 are
highlighted in yellow.
Correlation Matrix

FATHERS MOTHERS SPOUSES HAPPINESS IS LIFE


RS HIGHEST HIGHEST HIGHEST HIGHEST GENERAL OF CONDITION EXCITING
DEGREE DEGREE DEGREE DEGREE HAPPINESS MARRIAGE OF HEALTH OR DUL
Correlation RS HIGHEST DEGREE 1.000 .490 .410 .595 -.017 -.172 -.246 -.13
FATHERS HIGHEST
.490 1.000 .677 .319 -.100 -.131 -.174 -.01
DEGREE
MOTHERS HIGHEST
.410 .677 1.000 .208 .105 -.046 -.008 .15
DEGREE
SPOUSES HIGHEST
.595 .319 .208 1.000 -.053 -.138 -.392 -.09
DEGREE
GENERAL HAPPINESS -.017 -.100 .105 -.053 1.000 .514 .267 .21
HAPPINESS OF
-.172 -.131 -.046 -.138 .514 1.000 .282 .16
MARRIAGE
CONDITION OF HEALTH -.246 -.174 -.008 -.392 .267 .282 1.000 .21
IS LIFE EXCITING OR
-.138 -.012 .151 -.090 .214 .161 .214 1.00
DULL

Prof Dr.Ndoh Mbue 44 11/8/2022


Removing the variable from
the list of variables

First, highlight
the life variable.

Second, click on the left


arrow button to remove
the variable from the
Variables list box.

Prof Dr.Ndoh Mbue 45 11/8/2022


Replicating the factor
analysis

The dialog recall command opens


the dialog box with all of the
settings that we had selected the
last time we used factor analysis.

To replicate the analysis without


the variable that we just removed,
click on the OK button.

Prof Dr.Ndoh Mbue 46 11/8/2022


Appropriateness of factor analysis:
Sampling adequacy of individual variables

Anti-image Matrices

FATHERS MOTHERS SPOUSES HAPPINESS IS LIFE


RS HIGHEST HIGHEST HIGHEST HIGHEST GENERAL OF CONDITION EXCITING
DEGREE DEGREE DEGREE DEGREE HAPPINESS MARRIAGE OF HEALTH OR DULL
Anti-image Covariance RS HIGHEST DEGREE .511 -.101 -.079 -.274 -.058 .067 -.008 .108
FATHERS HIGHEST
-.101 .455 -.290 -.024 .103 -.028 .050 .028
There
DEGREE
are two anti-image
matrices:
MOTHERS HIGHEST
DEGREE
the anti-image-.079 -.290 .476 .028 -.102 .043 -.052 -.121
covariance
SPOUSES HIGHEST
matrix and the Principal component analysis requires
anti-image
DEGREE correlation -.274 -.024 .028 that .578
the Kaiser-Meyer-Olkin
-.014 -.012 Measure of
.203 -.039
matrix. We
GENERAL HAPPINESS are interested
-.058 in .103 -.102 Sampling
-.014 Adequacy
.666 be greater
-.325 than 0.50
-.085 -.085
the anti-image
HAPPINESS OF correlation for each individual variable as well as the
matrix.
MARRIAGE
.067 -.028 .043 set of variables.
-.012 -.325 .692 -.099 -.024

CONDITION OF HEALTH -.008 .050 -.052 .203 -.085 -.099 .749 -.102
IS LIFE EXCITING OR On iteration 1, the MSA for all of the
DULL
.108 .028 -.121
individual
-.039
variables
-.085
included
-.024
in the-.102 .876

Anti-image Correlation RS HIGHEST DEGREE .701a -.210 -.161 analysis


-.503 was greater
-.099 than.113
0.5, supporting
-.012 .162
FATHERS HIGHEST
-.210 .640
a
-.623
their-.048
retention .187
in the analysis.
-.049 .086 .044
DEGREE
MOTHERS HIGHEST a
-.161 -.623 .586 .053 -.181 .076 -.087 -.188
DEGREE
SPOUSES HIGHEST a
-.503 -.048 .053 .656 -.023 -.018 .309 -.055
DEGREE
GENERAL HAPPINESS -.099 .187 -.181 -.023 .549a -.478 -.120 -.111
HAPPINESS OF a
.113 -.049 .076 -.018 -.478 .619 -.137 -.030
MARRIAGE
CONDITION OF HEALTH a
-.012 .086 -.087 .309 -.120 -.137 .734 -.126
IS LIFE EXCITING OR a
.162 .044 -.188 -.055 -.111 -.030 -.126 .638
DULL
a. Measures of Sampling Adequacy(MSA)
Prof Dr.Ndoh Mbue 47 11/8/2022
Appropriateness of factor analysis:
Sampling adequacy for set of variables

KMO and Bartlett's Test


Kais er-Meyer-Olkin Measure of Sampling
Adequacy. .640

Bartlett's Test of Approx. Chi-Square 137.823


Sphericity df 28
Sig. .000
In addition, the overall
MSA for the set of variables
included in the analysis
was 0.640, which exceeds
the minimum requirement
of 0.50 for overall MSA.

Prof Dr.Ndoh Mbue 48 11/8/2022


Appropriateness of factor analysis:
Bartlett test of sphericity

KMO and Bartlett's Test


Kais er-Meyer-Olkin Measure of Sampling
Adequacy. .640

Bartlett's Test of Approx. Chi-Square 137.823


Sphericity df 28
Sig. .000

Principal component analysis requires


that the probability associated with
Bartlett's Test of Sphericity be less
than the level of significance.

The probability associated with the


Bartlett test is <0.001, which satisfies
this requirement.

Prof Dr.Ndoh Mbue 49 11/8/2022


Number of factors to extract:
Latent root criterion
Total Variance Explained

Initial Eigenvalues Extraction Sums of Squared Loadings


Component Total % of Variance Cumulative % Total % of Variance Cumulative %
1 2.600 32.502 32.502 2.600 32.502 32.502
2 1.772 22.149 54.651 1.772 22.149 54.651
3 1.079 13.486 68.137 1.079 13.486 68.137
4 .827 10.332 78.469
5 .631 7.888 86.358
6 .487 6.087 92.445
7 .333 4.161 96.606
8 .272 3.394 100.000
Extraction Method: Principal Component Analysis.
Using the output from iteration 1,
there were 3 eigenvalues greater
than 1.0.
See Scree plot
The latent root criterion for
number of factors to derive would
indicate that there were 3
components to be extracted for
these variables.

Prof Dr.Ndoh Mbue 50 11/8/2022


Number of factors to extract:
Percentage of variance criterion
T otal Variance Explained

Initial Eigenvalues Extraction Sums of Squared


Component Total % of Variance Cumulative % Total % of Variance C
1 2.600 32.502 32.502 2.600 32.502
2 1.772 22.149 54.651 1.772 22.149
3 1.079 13.486 68.137 1.079 13.486
4 .827 10.332 78.469
5 .631 7.888 86.358
6 .487 6.087 92.445
7 .333 4.161 96.606
8 .272 3.394 100.000
Extraction Method: Principal Component Analysis.
In addition, the cumulative
proportion of variance criteria can
be met with 3 components to
satisfy the criterion of explaining
60% or more of the total variance.
Since the SPSS default is to extract
the number of components indicated A 3 components solution would
by the latent root criterion, our explain 68.137% of the total
initial factor solution was based on variance.
the extraction of 3 components.

Prof Dr.Ndoh Mbue 51 11/8/2022


Communality requiring
variable removal
Communalities

Initial Extraction On iteration 2, the


RS HIGHEST DEGREE 1.000 .642 communality for the
variable "condition of
FATHERS HIGHEST
1.000 .623 health" [health] was
DEGREE 0.477. Since this is less
MOTHERS HIGHEST than 0.50, the variable
1.000 .592
DEGREE should be removed from
SPOUSES HIGHEST the next iteration of the
DEGREE
1.000 .516 principal component
analysis.
GENERAL HAPPINESS 1.000 .638
HAPPINESS OF The variable was removed
1.000 .594 and the principal
MARRIAGE
CONDITION OF HEALTH 1.000 .477 component analysis was
computed again.
Extraction Method: Principal Component Analysis.

Prof Dr.Ndoh Mbue 52 11/8/2022


Repeating the factor analysis

In the drop down menu,


select Factor Analysis to
reopen the factor analysis
dialog box.

Prof Dr.Ndoh Mbue 53 11/8/2022


Removing the variable from
the list of variables

First, highlight
the health
variable.

Second, click on the left


arrow button to remove
the variable from the
Variables list box.

Prof Dr.Ndoh Mbue 54 11/8/2022


Replicating the factor
analysis

The dialog recall command opens


the dialog box with all of the
settings that we had selected the
last time we used factor analysis.

To replicate the analysis without


the variable that we just removed,
click on the OK button.

Prof Dr.Ndoh Mbue 55 11/8/2022


Communality requiring
variable removal
On iteration 3, the
communality for the
variable "spouse's highest
academic degree" [spdeg]
Communalities was 0.491. Since this is
less than 0.50, the
Initial Extraction variable should be
RS HIGHEST DEGREE 1.000 .674 removed from the next
FATHERS HIGHEST iteration of the principal
DEGREE
1.000 .640 component analysis.
MOTHERS HIGHEST
1.000 .577 The variable was removed
DEGREE
and the principal
SPOUSES HIGHEST
1.000 .491 component analysis was
DEGREE computed again.
GENERAL HAPPINESS 1.000 .719
HAPPINESS OF
1.000 .741
MARRIAGE
Extraction Method: Principal Component Analysis.

Prof Dr.Ndoh Mbue 56 11/8/2022


Repeating the factor analysis

In the drop down menu,


select Factor Analysis to
reopen the factor analysis
dialog box.

Prof Dr.Ndoh Mbue 57 11/8/2022


Removing the variable from
the list of variables

First, highlight the


spdeg variable.

Second, click on the left


arrow button to remove
the variable from the
Variables list box.

Prof Dr.Ndoh Mbue 58 11/8/2022


Replicating the factor
analysis

The dialog recall command opens


the dialog box with all of the
settings that we had selected the
last time we used factor analysis.

To replicate the analysis without


the variable that we just removed,
click on the OK button.

Prof Dr.Ndoh Mbue 59 11/8/2022


Communality satisfactory for
all variables
Communalities
Once any variables with
Initial Extraction communalities less than
RS HIGHEST DEGREE 1.000 .577 0.50 have been removed
from the analysis, the
FATHERS HIGHEST pattern of factor loadings
1.000 .720
DEGREE should be examined to
MOTHERS HIGHEST identify variables that
1.000 .684 have complex structure.
DEGREE
GENERAL HAPPINESS 1.000 .745
HAPPINESS OF
1.000 .782
MARRIAGE
Extraction Method: Principal Component Analysis.

Complex structure occurs when


one variable has high loadings or
correlations (0.40 or greater) on
more than one component. If a
variable has complex structure, it
should be removed from the
analysis.

Variables are only checked for


complex structure if there is more
than one component in the
solution. Variables that load on
only one component are described
as having simple structure.

Prof Dr.Ndoh Mbue 60 11/8/2022


Identifying complex structure
If only one component has been
extracted, each variable can only
load on that one factor, so
complex structure is not an issue.

Rotated Component Matrixa

Component
1 2
On iteration 4, none of the
RS HIGHEST DEGREE .732 -.202
variables demonstrated
complex structure. It is not
FATHERS HIGHEST
DEGREE
.848 .031 necessary to remove any
additional variables because
MOTHERS HIGHEST
DEGREE
.810 .169 of complex structure.
GENERAL HAPPINESS .145 .851
HAPPINESS OF
-.145 .872
MARRIAGE
Extraction Method: Principal Component Analysis.
Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 3 iterations.

Prof Dr.Ndoh Mbue 61 11/8/2022


Variable loadings on
components
On iteration 4, the 2
components in the
analysis had more than
one variable loading on
each of them.

No variables need to be
Rotated Component Matrixa
removed because they
Component are the only variable
1 2 loading on a component.
RS HIGHEST DEGREE .732 -.202
FATHERS HIGHEST
.031
DEGREE .848
MOTHERS HIGHEST
.169
DEGREE .810
GENERAL HAPPINESS .145 .851
HAPPINESS OF
-.145
MARRIAGE .872
Extraction Method: Principal Component Analysis.
Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 3 iterations.

Prof Dr.Ndoh Mbue 62 11/8/2022


Final check of communalities
Once we have resolved any
problems with complex
structure, we check the
communalities one last time to
make certain that we are
explaining a sufficient portion
of the variance of all of the
original variables.

Communalities

Initial Extraction
RS HIGHEST DEGREE 1.000 .577
FATHERS HIGHEST
1.000 .720
DEGREE
MOTHERS HIGHEST The communalities for all of the
1.000 .684 variables included on the
DEGREE
GENERAL HAPPINESS 1.000 .745 components were greater than
HAPPINESS OF 0.50 and all variables had
MARRIAGE
1.000 .782 simple structure.
Extraction Method: Principal Component Analysis.
The principal component
analysis has been completed.

Prof Dr.Ndoh Mbue 63 11/8/2022


Interpreting the principal
components
The information in 5 of the
variables can be represented
by 2 components.
Component 1 includes the
variables

•"highest academic degree"


[degree],
Rotated Component Matrixa •"father's highest academic
degree" [padeg], and
Component •"mother's highest
academic degree" [madeg].
1 2
RS HIGHEST DEGREE .732 -.202
FATHERS HIGHEST
.031
DEGREE .848
MOTHERS HIGHEST
.169
DEGREE .810 Component 2 includes the
GENERAL HAPPINESS .145 .851 variables
HAPPINESS OF
MARRIAGE
-.145 •"general happiness"
.872
[happy] and
Extraction Method: Principal Component Analysis. •"happiness of marriage"
Rotation Method: Varimax with Kaiser Normalization. [hapmar].
a. Rotation converged in 3 iterations.

Prof Dr.Ndoh Mbue 64 11/8/2022


Total variance explained
Total Variance Explained

Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums


Component Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of V
1 1.953 39.061 39.061 1.953 39.061 39.061 1.953
2 1.555 31.109 70.169 1.555 31.109 70.169 1.556
3 .649 12.989 83.158
4 .441 8.820 91.977
5 .401 8.023 100.000
Extraction Method: Principal Component Analysis.
The 2 components explain
70.169% of the total
variance in the variables
which are included on the
components.

Prof Dr.Ndoh Mbue 65 11/8/2022


General Steps in principal component analysis - 1

The following is a guide to the decision process for


answering problems about principal component analysis:

Are the variables


No Incorrect application
included in the analysis
metric? of a statistic

Yes

Is the number of valid No


cases ≥50 Incorrect application
of a statistic

Yes

No
Is the ratio of cases to Incorrect application
variables at least 5 to 1? of a statistic

Yes

Prof Dr.Ndoh Mbue 66 11/8/2022


Steps in principal component analysis - 2

Are there two or more No Incorrect application


correlations that are of a statistic
0.30 or greater?

Yes

Is the measure of
No Remove variable with
sampling adequacy larger
lowest MSA and repeat
than 0.50 for each analysis
variable?

Yes

Overall measure of No
sampling adequacy Incorrect application
greater than 0.50? of a statistic

Yes
Prof Dr.Ndoh Mbue 67 11/8/2022
Steps in principal component analysis - 3

Probability for Bartlett


No Incorrect application
test of sphericity less
than level of of a statistic
significance?

Yes

Communality for each


No Remove variable with
variable greater than lowest communality
0.50? and repeat analysis

Yes

Prof Dr.Ndoh Mbue 68 11/8/2022


Steps in principal component analysis - 4

Does all variables show No Remove variable with


simple structure complex structure and
(only 1 loading > 0.40)? repeat analysis

Yes

Do all of the components No Remove single variable


have more than one loading on component
variable loading on it? and repeat analysis

Yes

Are the number of No


components and pattern False
of loadings correct?

Yes

Prof Dr.Ndoh Mbue 69 11/8/2022


Steps in principal component analysis - 5

Are all of the metric No


variables included in the
final analysis interval
True with caution
level?

Yes

Is the sample size for the No


final analysis include 100 True with caution
cases or more?

Yes

Is the cumulative
proportion of variance No
for variables 60% or True with caution
higher?

Yes

Prof Dr.Ndoh Mbue True 70 11/8/2022


Prof Dr.Ndoh Mbue 71 11/8/2022

You might also like