Paper No and Title Paper no. 2 Quantitative Methods

Module No and Title Module no. 32 : Multivariate techniques: PCA and MDS

Module Tag PSY_P2_M32

1. Learning Outcomes
2. Introduction
3. Principal Component Analysis (PCA)
3.1 Goals
3.2 Applications of PCA
3.3 PCA vs. Factor Analysis
3.4 When to Use PCA
3.5 Dataset in PCA
3.6 Issues with PCA
4. Multi- dimensional scaling
4.1 The concept of MDS
4.2 Spatial representation of MDS
4.3 The steps in MDS application to data
4.4 Application of MDS
5. Summary

1. Learning Outcomes
After studying this module, you shall be able to

 Know about Principal Component Analysis and its data set.

 Understand the goals and applications of PCA.
 Learn about multi dimensional scaling.
 Understand its applications.

2. Introduction
The present module introduces the two important multivariate techniques to the readers- Principal
Component Analysis (PCA) and multi dimensional scaling (MDS). The two powerful statistical
techniques are widely used in social science research in contemporary times.

The Principal Component Analysis (PCA) is the method to reduce the number of parameters
while maintaining information as much as possible. The aim of the PCA is to transfer a set of
correlated variables into a new set of uncorrelated variables. It reduces the original number of
variables to the new components that are orthogonal to each other and explains/summarizes the
underlying variance-covariance structure of a large set of variables through a few linear
combinations of these variables.

The multidimensional scaling is a popular multivariate technique that seeks latent variables by
exploring the observed data. The MDS attempts to investigate the similarities among the observed
entities and then identify if the set of data can be best described as lying on two or more
dimensions. The focus in MDS is to scale stimuli on various psychophysical attributes and then
explore their similarities. The mathematical model used in MDS is same as factor analysis.

3. Principal Components Analysis (PCA)

The Principal Component Analysis (PCA) is the method to study, visualize and analyze the large
data sets. One of the most popular multivariate techniques belonging to the general linear model
(GLM), the goal of the PCA is to simplify the data and extract the relevant and useful information
out of it. The extracted information is expressed as a set of new orthogonal variables called
“principal components”. It is a method in which the large data set is reduced to a small number of
manageable “components” that represents a common theme or a construct. The components
capturing as much original variance in the data as possible are uncorrelated with one another. The
data in PCA also represents the pattern of similarity of the observations and the variables by
displaying them as points in maps.

3.1 Goals of PCA:

The goals of PCA are

1. Extract the most important information from the data table.

2. Compress or reduce the size of the data set by keeping

only the relevant information.
3. Simplify the description of the data set.
4. Analyze the structure of the observations and the variables.
5. Map the data into a space of low dimensionality.

3.2 Applications of PCA:

In social sciences, especially in psychology, the PCA is used for

1. Visualization of the data

2. Reduction of the data
3. Classification of the data
4. Finding the relationships between the variables.
5. For doing Trend Analysis
6. For Psychometric test Construction
7. For reduction of Noise

3.3 Principal Component Analysis Vs. Factor Analysis:

Both factor analysis and Principal Component Analysis are the methods of extracting the factors
or components from the correlation matrix. The two methods that are quite similar in their
theoretical considerations and are often known to yield similar results differ in how they analyze
the variance in the correlation matrix.

The method of factor analysis distinguishes between specific variance and error variance from the
common variance whereas the PCA ignores the distinction among the different sources of
variance. The method of PCA analyzes total variance in the correlation matrix and assumes that
the components derived from the data set can explain all the variance. As a result, the components
extracted by using PCA are known to include a certain amount of error and specific variances.

3.4 When to use PCA?

1. PCA is applied to a data set which has either interval or ratio level of measurement scale.
2. The variables for PCA should be normally distributed, linearly related and independent of
each other.
3. It is used when there are at least 5 observations per variable.
4. It is used when total number of observations should be 100 at least.

5. Participant to variable ratio in PCA should be 5: 1 at the minimum, though ideal ratio
documented in literature is 10: 1.
6. It is used when the aim is to reduce the number of variables to be used in another GLM
technique like egression, MANOVA, etc.
7. When attempting to identify latent constructs that are being measured by observed
variables in the absence of a priori theory.

3.5 Data Set in PCA:

PCA is applied to the data tables where rows are considered as individuals/participants and the
columns are considered as quantitative variables. Such tables where individuals are represented in
rows and variables are listed in columns are very common in many fields like sensory analysis,
genetics, ecology, biology, psychology and sociology. Thus, the technique has widespread
applications in various disciplines. The aim of PCA is to characterize the individuals according to
the quantitative variables.

The data set in Principal Component Analysis can be studied in different ways:

1. The data table can be visualized as a set of rows to see differences from one row to the
other, or similarities from one row to another.

2. The data table can also be seen as a set of columns to understand the similarities or the
links between the columns.

Study of Individuals (Rows): Characterization of the group of individuals/participants by the


3. For the study of individuals, that is to say for the study of rows, the data table tells when
two individuals are close and when they are different from the point of view of all the

4. In case there are many individuals, it also can be used to propose a typology of the
individuals, that is to say, which are the most similar individuals and which are the most

5. It can also tell whether there exist any groups of individuals which are homogenous in
terms of their similarities

6. In addition, it can also reveal the common dimensions of variability that oppose extreme
and intermediate individuals.

Study of Variables (Columns): To specify individuals/participants to better understand links

between variables.

7. As for the individuals, the PCA can tell and interpret about the similarities between the
quantitative variables. In PCA, the variables are generally talked about in terms of
relationship rather than similarity. So, in other words, the technique tells about the
relationship between the quantitative variables. As the best known relationship between
the variables is the linear relationship, so the PCA focuses on the linear relationships
between the variables. However, more complex relationships like quadratic relationships,
logarithmic, exponential functions etc. may also exist but PCA is generally known for
studying the linear relationships only.

8. Similar to the grouping of individuals, PCA can be used to create groups of variables.

9. The aim of PCA is to draw conclusions from the linear

relationships between variables by detecting principal dimensions of variability.

Thus, PCA is a method to jointly study the individuals and the variables to reinforce their
respective interpretations. When studying individuals, groups of individuals are built and then
characterized on the basis of variables. Similarly, when there are groups of variables, it may not
be easy to interpret the relationships between many variables. In such a case, individuals who are
extreme from the point of view of these relationships can be used to make interpretation about the

3.6 Issues with PCA:

1. It is a descriptive method to explore data.

2. The data is visualized with simple graphics.
3. PCA leads to data compression. It summarizes and synthesis the information contained in a
data table by viewing it.
4. Multi dimensional scaling

4.1 The concept of MDS

In unidimensional scaling we believe that the data varies along a single dimension only. In
unidimensional scaling one specifies that dimension in advance and the responses are generated
accordingly. However, in reality stimuli vary on a number of dimensions simultaneously. In MDS
the subjects would not be informed in advance about the dimension on which the response is to
be generated; rather overall similarities or dissimilarities in the stimuli are the bases for
responding. The concept of MDS is useful when:

 The aim is to know which possible dimensions may be used by the respondents in
arriving at a response. So, no preliminary instructions about the response criterion are
given to the subjects. The subjects decide the response using their ingenuity and
 The other study where MDS is employed is when the decisive dimensions are known but
their respective contribution to arriving at a response may not be clear. It is not clear how
a subject uses these dimensions to differentiate the stimuli and arrive at a response. For
example a subject may use two dimensions to arrive at a response, but then which one
was more important in judgment needs to be known. Alternatively, there is a possibility
that although two dimensions were given as criterion the subject actually used one only.

Let us pick an example and explain this: some teachers are asked to compare their 20 students in
pairs and rate if they are similar. Each teacher has 190 comparisons to be made for each student.
The data is obtained for all the teachers and an MDS program is run for it. It is possible that after
application of the software the two dimensions emerge: academic achievement (excellent – below
average) and regularity in attending lectures (very regular – very irregular).

Thus, the implicit model of how similarity judgments are

produced by the brain is that items have attributes (such as size,
viciousness, intelligence, furriness, etc) in varying degrees, and the similarity between items is a
function of their similarity/ dissimilarity in scores across all attributes. This function is often
conceived of as a weighted sum of the similarity across each attribute, where the weights reflect
the importance or saliency of the attribute.

So, once the important and decisive attributes are identified the data interpretation becomes more

4.2 Spatial representation of MDS

The multi dimensional scaling is a statistical procedure that aims to visualize the similarities
among the entities in a data set. The basic algorithm used for MDS is that every entity in the data
set can be represented in a “p” dimensional Euclidean geometric space (there are other metric and
non-metric spatial representations also). A data set can be represented using N-1 dimensional
space (where, N is the number of items scaled). The Euclidean space mapping is the most popular
method of representation of points. The number of dimensions to be chosen for representing a
data set poses some problems like: impossibility in representing four or more dimensions on
paper, increasing dimensions lead to complexities in stress function.

Representation and visualization of the data set is complex mathematical proposition. Statistically
speaking, MDS attempts to find a set of vectors in a p- dimensional space. The Euclidean distance
between the points in this space has to be as close as possible to the input data matrix. This is
achieved by minimizing the criterion of ‘stress function”. The process makes use of monotonic
regression (also known as isotonic regression). The stress can be found on using the following

√ [ƩƩ (f (x ij) – dij )2 ] / scale

Where, dij is Euclidean distance between i and j across all the dimensions

Fxij some function of the input data

Scale is the constant scaling factor used to keep the stress value between 0 and 1.

Conventionally speaking the acceptable range of stress is 0.1 – 0.15. Ideal case is a zero stress
matrix. A stress value of more than 0 indicates distortion of the input data in the distance map.
The causes of high stress coefficient can be: random measurement error and insufficient

The different spatial representations of the MDS are:

 Classical MDS: also known as “principal coordinate’s analysis” this method takes the
input matrix which gives dissimilarities between pairs and the output matrix aims to
minimize the “strain function”. The method is also called as Torgerson- Gower scaling.
 Metric MDS: the representation uses the input matrix

of known distances and the procedure to obtain the
output matrix aims to minimize the loss function “stress function”. Metric MDS is used
for data in metric scales of measurement: interval and ratio.
 Non- metric MDS: the non metric MDS aims to find relationships and eucladian
distances between the items in the input matrix. The non metric MDS is used for data in
ordinal scale.

The method used to find the relationships is “isotonic regression”. An example of non-
metric MDS is Guttman’s smallest space analysis.

4.3 The steps in MDS application to data:

There are several steps in applying the MDS to the research:

1. Formulating the problem: the aim of the research should be clear with the number of
variables that the researcher wishes to compare. Too many variables to be compared
means confusion whereas too less would not bring valid results.
2. Input data: Subjects may be asked series of questions for each pair of items to be
compared on similarity on a Likert type scale or semantic differential scale. The data can
also be gathered by asking the preferences rather than similarities among the pairs.
3. Running the MDS on the obtained data: Software for running the procedure is
available in many software for statistics. The software provide an option between the
metric and the non metric MDS.
4. Deciding the number of dimensions to be created: The researcher must decide on the
number of dimensions they want the software to create. Of course a higher number of
dimensions to be extracted leads to better statistical fit. One must keep in mind that more
dimensions would also mean difficulties at interpretive stage.
5. Mapping the results and defining the dimensions: after running the MDS software the
data is converted to Shepard diagram (scatter plot of input proximities against output
distances for every pair of items scaled).
6. Test the results for reliability and validity: othere are various tests available that test
the validity and reliability of the results obtained after MDS application. Some of the
tests are Kruskal’s stress test, split data test, data stability test etc. R2 can also be
calculated to determine the proportion of variance contributed by scaled data upon MDS
7. Report the results comprehensively

4.4 Applications of MDS

The MDS is a complex statistical technique which is difficult

to apply and comprehend too. Nonetheless, it has its
applications in the wide fields of natural and applied sciences. Some of the areas are:
 Field of knowledge that require scientific visualization of the data like, cognitive science,
ecology, marketing research and informatics.
 In the field of Psychology MDS proves its utility in studying the cognitive processes,
psychophysics, psychometrics etc.
 In the area of geostatistics for modeling the spatial variability of the patterns.
 Linguistics: as MDS helps in understanding the natural language processing, semantics
by representing the concepts in multi-dimensional vector space.

5. Summary

 In this module, we have studied two multivariate statistical techniques namely: Principal
Component Analysis and multidimensional scaling.
 The Principal Component Analysis (PCA) is the method to study, visualize and analyze
the large data sets.
 The aim of the PCA is to transfer a set of correlated variables into a new set of
uncorrelated variables.
 In unidimensional scaling one specifies that dimension in advance and the responses are
generated accordingly. However, in reality stimuli vary on a number of dimensions
simultaneously. In MDS the subjects would not be informed in advance about the
dimension on which the response is to be generated; rather overall similarities or
dissimilarities in the stimuli are the bases for responding.
 The implicit model of how similarity judgments are produced by the brain is that items
have attributes (such as size, viciousness, intelligence, furriness, etc) in varying degrees,
and the similarity between items is a function of their similarity/ dissimilarity in scores
across all attributes
 The multi dimensional scaling is a statistical procedure that aims to visualize the
similarities among the entities in a data set. The basic algorithm used for MDS is that
every entity in the data set can be represented in a “p” dimensional Euclidean geometric
 Representation and visualization of the data set is complex mathematical proposition.
Statistically speaking, MDS attempts to find a set of vectors in a p- dimensional space.
 The different spatial representations of the MDS are: Classical MDS, Metric MDS, Non-
metric MDS.
 There are several steps in applying the MDS to the research and they have also been
discussed in the text.
 The MDS is a complex statistical technique which is difficult to apply and comprehend
too. Nonetheless, it has its applications in the wide fields of natural and applied sciences.

